# Random Forest

- Random Forest is combination of many decision trees
- It is a classification algorithm.

Why do we need Random Forest over Decision Trees?
- Though Decision Trees are easy to build, use and interpret, but they are inaccurate
- DTs are not very good with unseen data so our Model may not work as desired
- Random Forest = Simplicity of DT + Very Good Accuracy

## Input

1. .csv - produced by pre_processing.ipynb
2. The pre_processed input data includes following techniques:
   * MinMax Scaling
   * PCA
   * Correlation

## Output/Analysis

1. Visualising the accuracy of RF with k-fold validation.
2. Comparing the accuracy of RF model with and without PCA.   

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
def load_data(filename):
    return pd.read_csv(filename)

# Spilt the input file into test and train dataset

I/P: dataframe

O/P: x_cross_val, y_cross_val, x_test, y_test

In [3]:
def prep_training(network_data):
    return train_test_split(network_data, train_size=0.8, test_size=0.2, random_state=42)

# Hyperparameter Tuning for Random Forest

The following hyperparamter tuning has taken reference from:
1. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
2. https://medium.com/@ODSC/optimizing-hyperparameters-for-random-forest-algorithms-in-scikit-learn-d60b7aa07ead

In [4]:
from pprint import pprint

Instead of all the above the parameters, we will just focus on tuning a few as given below:
We will try adjusting a few of the following set of hyperparameters:
1. n_estimators = number of trees in the foreset
2. max_features = max number of features considered for splitting a node
3. max_depth = max number of levels in each decision tree
4. min_samples_split = min number of data points placed in a node before the node is split
5. min_samples_leaf = min number of data points allowed in a leaf node
6. bootstrap = method for sampling data points (with or without replacement)

To use CV, we first need to create a parameter grid to sample from during fitting:

Params From reference github
- n_estimators=[30,60,90,120]  
- max_features = Not included
- max_depth = [10,15,20]
- min_samples_split = [2,4,6]
- min_samples_leaf = not included
- bootstrap = not included


### RandomizedSearch CV versus GridSearchCV

Grid Search is good when we work with a small number of hyperparameters. However, if the number of parameters to consider is particularly high and the magnitudes of influence are imbalanced, the better choice is to use the Random Search. [Reference](https://towardsdatascience.com/machine-learning-gridsearchcv-randomizedsearchcv-d36b89231b10)

Therefore, we have used GridSearchCV.

In [5]:
def creatingParamterGrid():
    n_estimators = [30,60,90,120]   # Number of trees in random forest
    max_depth = [10,15,20]    # Maximum number of levels in tree
    min_samples_split = [2,4,6]    # Minimum number of samples required to split a node
    
    # Create the parameter grid
    parameter_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
    return parameter_grid

## Cross-Validated Grid Search

We are now ready to create our grid-search object. We'll use each of the objects we've created thus far.
Instead of passing a `PredefinedSplit` object tothe `cv` paramter, we are simply passing the number of folds.

In [6]:
def prepToFindOptimalHyperParams(clf,parameter_grid):
    grid_search = GridSearchCV(estimator=clf, cv=3, param_grid = parameter_grid)
    return grid_search

## Training the Model

Now that we have created our `grid_search` object, we are ready to train our model.

In [7]:
def train(cross_val_df,grid_search):
    grid_search.fit(cross_val_df.drop("label",axis=1), cross_val_df["label"])

We are training models = (Number of unique Hyperparamter Combinations * number of folds) + 1

## Cross-validated Results

To examine the results individual fold, we use `grid_search`'s `cv_results_` attribute.
pd.DataFrame(grid_search.cv_results_).head()

## Optimal Hyperparamters

grid_search.best_params_
grid_search.best_score_

# Test the model and analyse it

In [8]:
def testRandomForestModel(test_df,grid_search):
    Ytest = test_df["label"]
    Ypred = grid_search.predict(test_df.drop("label",axis=1))
    acc=accuracy_score(Ytest,Ypred)
    print('Accuracy of Random Forest: {:.4f}'.format(acc))  
    print(classification_report(Ytest,Ypred))
    print(confusion_matrix(Ytest, Ypred))

# Main Function

In [9]:
def main(network_data): 
    clf = RandomForestClassifier(random_state=42)
    cross_val_df, test_df = prep_training(network_data)
    parameter_grid = creatingParamterGrid()
    pprint(parameter_grid)
    grid_search = prepToFindOptimalHyperParams(clf,parameter_grid)
    train(cross_val_df,grid_search)
    testRandomForestModel(test_df,grid_search)
    print(grid_search.best_params_)
    print(grid_search.best_score_)
    return grid_search

# Classification with RF without preprocessing

In [10]:
network_data = load_data('https://raw.githubusercontent.com/divyaKh/CMPE255Project/main/2.Data_Cleaning/cleaned_dataset_label_encoding.csv')

In [11]:
network_data = network_data.drop("attack_cat",axis=1)
grid_search = main(network_data)

{'max_depth': [10, 15, 20],
 'min_samples_split': [2, 4, 6],
 'n_estimators': [30, 60, 90, 120]}
Accuracy of Random Forest: 0.9488
              precision    recall  f1-score   support

           0       0.92      0.94      0.93     18675
           1       0.96      0.96      0.96     32860

    accuracy                           0.95     51535
   macro avg       0.94      0.95      0.94     51535
weighted avg       0.95      0.95      0.95     51535

[[17508  1167]
 [ 1472 31388]]
{'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 60}
0.9476564252995031


# Classification with RF after MinMax Scaling 

In [10]:
network_data1 = load_data('../input/dataset_minmax.csv')

In [11]:
network_data1

Unnamed: 0,# dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
0,1.833334e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000033,0.000000,0.090909,0.996078,...,0.000000,0.000000,0.015625,0.0,0.0,0.0,0.000000,0.016393,0.0,0
1,1.333334e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000121,0.000000,0.125000,0.996078,...,0.000000,0.000000,0.015625,0.0,0.0,0.0,0.000000,0.016393,0.0,0
2,8.333335e-08,0.901515,0.000000,0.5,0.000094,0.000000,0.000073,0.000000,0.200000,0.996078,...,0.000000,0.000000,0.031250,0.0,0.0,0.0,0.000000,0.032787,0.0,0
3,1.000000e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000061,0.000000,0.166667,0.996078,...,0.017241,0.000000,0.031250,0.0,0.0,0.0,0.016949,0.032787,0.0,0
4,1.666667e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000146,0.000000,0.100000,0.996078,...,0.017241,0.000000,0.031250,0.0,0.0,0.0,0.016949,0.032787,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257668,1.500000e-07,0.901515,0.166667,0.5,0.000094,0.000000,0.000006,0.000000,0.111111,0.996078,...,0.396552,0.266667,0.359375,0.0,0.0,0.0,0.389831,0.377049,0.0,1
257669,8.429368e-03,0.856061,0.000000,0.4,0.000845,0.000726,0.000042,0.000024,0.000034,0.996078,...,0.000000,0.000000,0.015625,0.0,0.0,0.0,0.000000,0.000000,0.0,1
257670,1.500000e-07,0.901515,0.166667,0.5,0.000094,0.000000,0.000006,0.000000,0.111111,0.996078,...,0.034483,0.044444,0.187500,0.0,0.0,0.0,0.033898,0.180328,0.0,1
257671,1.500000e-07,0.901515,0.166667,0.5,0.000094,0.000000,0.000006,0.000000,0.111111,0.996078,...,0.500000,0.288889,0.453125,0.0,0.0,0.0,0.491525,0.475410,0.0,1


In [12]:
grid_search1= main(network_data1)

{'max_depth': [10, 15, 20],
 'min_samples_split': [2, 4, 6],
 'n_estimators': [30, 60, 90, 120]}
Accuracy of Random Forest: 0.9495
              precision    recall  f1-score   support

           0       0.93      0.94      0.93     18675
           1       0.96      0.96      0.96     32860

    accuracy                           0.95     51535
   macro avg       0.94      0.95      0.95     51535
weighted avg       0.95      0.95      0.95     51535

[[17480  1195]
 [ 1407 31453]]
{'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 90}
0.9479474904012962


In [13]:
pd.DataFrame(grid_search1.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,5.568965,0.108508,0.202967,0.003435,10,2,30,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.934146,0.936111,0.934597,0.934951,0.00084,36
1,12.168635,0.578116,0.492135,0.013797,10,2,60,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.934976,0.936155,0.934844,0.935325,0.000589,31
2,15.135262,0.464047,0.561972,0.00733,10,2,90,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.93499,0.93614,0.935266,0.935466,0.00049,27
3,20.88407,0.936567,0.750132,0.009459,10,2,120,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.934743,0.93614,0.935033,0.935305,0.000602,32
4,5.044115,0.124677,0.207318,0.009411,10,4,30,"{'max_depth': 10, 'min_samples_split': 4, 'n_e...",0.933884,0.93614,0.935295,0.935107,0.000931,35


Here:
1. {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 90} = best params after hyperparameter tuning
2. 0.9479474904012962 is the best score after Cross Validation of Random Forest.

# Classification with RF after MinMax Scaling + Correlation analysis

In [14]:
network_data2 = load_data('../input/dataset_minmax_corr.csv')

In [15]:
grid_search2 = main(network_data2)

{'max_depth': [10, 15, 20],
 'min_samples_split': [2, 4, 6],
 'n_estimators': [30, 60, 90, 120]}
Accuracy of Random Forest: 0.9401
              precision    recall  f1-score   support

           0       0.92      0.92      0.92     18675
           1       0.95      0.95      0.95     32860

    accuracy                           0.94     51535
   macro avg       0.94      0.94      0.94     51535
weighted avg       0.94      0.94      0.94     51535

[[17117  1558]
 [ 1530 31330]]
{'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 120}
0.9383422787135555


In [16]:
pd.DataFrame(grid_search2.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,5.457005,0.375925,0.193947,0.004409,10,2,30,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.923144,0.921063,0.920771,0.921659,0.001057,35
1,9.768146,1.0758,0.405202,0.041636,10,2,60,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.923115,0.922606,0.921266,0.922329,0.00078,30
2,14.390445,1.20349,0.550731,0.018484,10,2,90,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.923115,0.921572,0.921207,0.921965,0.000827,32
3,18.294489,0.779368,0.743094,0.020131,10,2,120,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.923464,0.922416,0.921149,0.922343,0.000947,29
4,4.328513,0.054004,0.189706,0.004133,10,4,30,"{'max_depth': 10, 'min_samples_split': 4, 'n_e...",0.921267,0.923915,0.92259,0.922591,0.001081,27


# Classification with RF after Dimension Reduction (using PCA)

In [17]:
network_data3 = load_data('../input/dataset_pca.csv')

In [18]:
grid_search3 = main(network_data3)

{'max_depth': [10, 15, 20],
 'min_samples_split': [2, 4, 6],
 'n_estimators': [30, 60, 90, 120]}
Accuracy of Random Forest: 0.9296
              precision    recall  f1-score   support

           0       0.92      0.88      0.90     18675
           1       0.94      0.96      0.95     32860

    accuracy                           0.93     51535
   macro avg       0.93      0.92      0.92     51535
weighted avg       0.93      0.93      0.93     51535

[[16506  2169]
 [ 1461 31399]]
{'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 90}
0.927892964597675


In [19]:
pd.DataFrame(grid_search3.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,18.495456,0.77205,0.230829,0.005162,10,2,30,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.909391,0.909449,0.909346,0.909396,4.2e-05,32
1,30.630508,0.938306,0.399791,0.007541,10,2,60,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.91041,0.91041,0.910743,0.910521,0.000157,25
2,42.824832,0.249863,0.589165,0.013895,10,2,90,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.910512,0.909653,0.910903,0.910356,0.000522,26
3,58.379558,2.197414,0.764653,0.013248,10,2,120,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.910308,0.909464,0.910671,0.910148,0.000506,27
4,14.224579,0.076404,0.199939,0.001074,10,4,30,"{'max_depth': 10, 'min_samples_split': 4, 'n_e...",0.90846,0.910192,0.908371,0.909008,0.000838,35
