# Random Forest

- Random Forest is combination of many decision trees
- It is a classification algorithm.

Why do we need Random Forest over Decision Trees?
- Though Decision Trees are easy to build, use and interpret, but they are inaccurate
- DTs are not very good with unseen data so our Model may not work as desired
- Random Forest = Simplicity of DT + Very Good Accuracy

## Input

1. .csv - produced by pre_processing.ipynb
2. The pre_processed input data includes following techniques:
   * MinMax Scaling
   * PCA
   * Correlation

## Output/Analysis

1. Visualising the accuracy of RF with k-fold validation.
2. Comparing the accuracy of RF model with and without PCA.   

In [49]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [50]:
def load_data(filename):
    return pd.read_csv(filename)

# Spilt the input file into test and train dataset

I/P: dataframe

O/P: x_cross_val, y_cross_val, x_test, y_test

In [51]:
def prep_training(network_data):
    return train_test_split(network_data, train_size=0.8, test_size=0.2, random_state=42)

# Hyperparameter Tuning for Random Forest

The following hyperparamter tuning has taken reference from:
1. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
2. https://medium.com/@ODSC/optimizing-hyperparameters-for-random-forest-algorithms-in-scikit-learn-d60b7aa07ead

In [52]:
from pprint import pprint

Instead of all the above the parameters, we will just focus on tuning a few as given below:
We will try adjusting a few of the following set of hyperparameters:
1. n_estimators = number of trees in the foreset
2. max_features = max number of features considered for splitting a node
3. max_depth = max number of levels in each decision tree
4. min_samples_split = min number of data points placed in a node before the node is split
5. min_samples_leaf = min number of data points allowed in a leaf node
6. bootstrap = method for sampling data points (with or without replacement)

To use CV, we first need to create a parameter grid to sample from during fitting:

Params From reference github
- n_estimators=[30,60,90,120]  
- max_features = Not included
- max_depth = [10,15,20]
- min_samples_split = [2,4,6]
- min_samples_leaf = not included
- bootstrap = not included


### RandomizedSearch CV versus GridSearchCV

Grid Search is good when we work with a small number of hyperparameters. However, if the number of parameters to consider is particularly high and the magnitudes of influence are imbalanced, the better choice is to use the Random Search. [Reference](https://towardsdatascience.com/machine-learning-gridsearchcv-randomizedsearchcv-d36b89231b10)

Therefore, we have used GridSearchCV.

In [53]:
def creatingParamterGrid():
    n_estimators = [30,60,90,120]   # Number of trees in random forest
    max_depth = [10,15,20]    # Maximum number of levels in tree
    min_samples_split = [2,4,6]    # Minimum number of samples required to split a node
    
    # Create the parameter grid
    parameter_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
    return parameter_grid

## Cross-Validated Grid Search

We are now ready to create our grid-search object. We'll use each of the objects we've created thus far.
Instead of passing a `PredefinedSplit` object tothe `cv` paramter, we are simply passing the number of folds.

In [54]:
def prepToFindOptimalHyperParams(clf,parameter_grid):
    grid_search = GridSearchCV(estimator=clf, cv=3, param_grid = parameter_grid)
    return grid_search

## Training the Model

Now that we have created our `grid_search` object, we are ready to train our model.

In [55]:
def train(cross_val_df,grid_search):
    ll = ["label","attack_cat_0","attack_cat_1","attack_cat_2","attack_cat_3","attack_cat_4","attack_cat_5","attack_cat_6","attack_cat_7","attack_cat_8","attack_cat_9"]
    grid_search.fit(cross_val_df.drop(ll,axis=1), cross_val_df["label"])

We are training models = (Number of unique Hyperparamter Combinations * number of folds) + 1

## Cross-validated Results

To examine the results individual fold, we use `grid_search`'s `cv_results_` attribute.
pd.DataFrame(grid_search.cv_results_).head()

## Optimal Hyperparamters

grid_search.best_params_
grid_search.best_score_

# Test the model and find out its accuracy

In [56]:
def testRandomForestModel(test_df,grid_search):
    ll = ["label","attack_cat_0","attack_cat_1","attack_cat_2","attack_cat_3","attack_cat_4","attack_cat_5","attack_cat_6","attack_cat_7","attack_cat_8","attack_cat_9"]
    acc=accuracy_score(test_df["label"],grid_search.predict(test_df.drop(ll,axis=1)))
    print('Acc: {:.4f}'.format(acc))      

# Main Function

In [57]:
def main(network_data): 
    clf = RandomForestClassifier(random_state=42)
    cross_val_df, test_df = prep_training(network_data)
    parameter_grid = creatingParamterGrid()
    pprint(parameter_grid)
    grid_search = prepToFindOptimalHyperParams(clf,parameter_grid)
    train(cross_val_df,grid_search)
    testRandomForestModel(test_df,grid_search)
    return grid_search

# Classification with RF after MinMax Scaling 

In [58]:
network_data = load_data("./data_minmax_labelenc.csv")

In [59]:
grid_search= main(network_data)

{'max_depth': [10, 15, 20],
 'min_samples_split': [2, 4, 6],
 'n_estimators': [30, 60, 90, 120]}
Acc: 0.9495


In [60]:
grid_search.best_params_

{'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 90}

In [61]:
grid_search.best_score_

0.9479474904012962

# Classification with RF after MinMax Scaling + Dimension Reduction (using PCA)

In [76]:
# network_data = load_data('dataset_minmax_pca.csv')
# main(network_data)

# Classification with RF after MinMax Scaling + Correlation analysis

In [77]:
# network_data = load_data('dataset_minmax_corr.csv')
# main(network_data)