# Random Forest

- Random Forest is combination of many decision trees
- It is a classification algorithm.

Why do we need Random Forest over Decision Trees?
- Though Decision Trees are easy to build, use and interpret, but they are inaccurate
- DTs are not very good with unseen data so our Model may not work as desired
- Random Forest = Simplicity of DT + Very Good Accuracy

## Input

1. .csv - produced by pre_processing.ipynb
2. The pre_processed input data includes following techniques:
    #TODO

## Output/Analysis

1. Visualising the accuracy of RF with k-fold validation.
2. Comparing the accurancy of RF model with and without PCA.   

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
def load_data(filename):
    return pd.read_csv(filename)

In [4]:
network_data = load_data('data_minmax.csv')

# Spilt the input file into test and train dataset

I/P: dataframe

O/P: x_train, y_train, x_test, y_test

In [5]:
network_data.head(5)

Unnamed: 0,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,...,attack_cat_Backdoor,attack_cat_DoS,attack_cat_Exploits,attack_cat_Fuzzers,attack_cat_Generic,attack_cat_Normal,attack_cat_Reconnaissance,attack_cat_Shellcode,attack_cat_Worms,label
0,1.833334e-07,9.4e-05,0.0,3.3e-05,0.0,0.090909,0.996078,0.0,0.030121,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
1,1.333334e-07,9.4e-05,0.0,0.000121,0.0,0.125,0.996078,0.0,0.147128,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
2,8.333335e-08,9.4e-05,0.0,7.3e-05,0.0,0.2,0.996078,0.0,0.142685,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
3,1e-07,9.4e-05,0.0,6.1e-05,0.0,0.166667,0.996078,0.0,0.1002,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
4,1.666667e-07,9.4e-05,0.0,0.000146,0.0,0.1,0.996078,0.0,0.142017,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0


In [6]:
def prep_training():
    x = network_data.iloc[:,list(range(1,206))]
    y = network_data.iloc[:,206]
    print(x)
    print(y)
    print("Shape of x: ", x.shape)
    print("Shape of y: ", y.shape)
    return train_test_split(x,y,test_size=0.2)

In [7]:
x_train,x_test,y_train,y_test = prep_training()

           spkts     dpkts    sbytes    dbytes      rate      sttl      dttl  \
0       0.000094  0.000000  0.000033  0.000000  0.090909  0.996078  0.000000   
1       0.000094  0.000000  0.000121  0.000000  0.125000  0.996078  0.000000   
2       0.000094  0.000000  0.000073  0.000000  0.200000  0.996078  0.000000   
3       0.000094  0.000000  0.000061  0.000000  0.166667  0.996078  0.000000   
4       0.000094  0.000000  0.000146  0.000000  0.100000  0.996078  0.000000   
...          ...       ...       ...       ...       ...       ...       ...   
257668  0.000094  0.000000  0.000006  0.000000  0.111111  0.996078  0.000000   
257669  0.000845  0.000726  0.000042  0.000024  0.000034  0.996078  0.992126   
257670  0.000094  0.000000  0.000006  0.000000  0.111111  0.996078  0.000000   
257671  0.000094  0.000000  0.000006  0.000000  0.111111  0.996078  0.000000   
257672  0.000094  0.000000  0.000006  0.000000  0.111111  0.996078  0.000000   

           sload     dload     sloss  .

# Split the train dataset into train and Cross validation dataset

I/P: x_train, y_train

O/P: x_train_new, x_cv, y_train_new, y_cv

In [8]:
# Splitting train in train and cv data
x_train_new, x_cv, y_train_new, y_cv = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

In [9]:
x_train_new.shape, y_train_new.shape, x_cv.shape, y_cv.shape, x_test.shape, y_test.shape

((164910, 205), (164910,), (41228, 205), (41228,), (51535, 205), (51535,))

# Hyperparameter Tuning for Random Forest

The following hyperparamter tuning has taken reference from:
1. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
2. https://medium.com/@ODSC/optimizing-hyperparameters-for-random-forest-algorithms-in-scikit-learn-d60b7aa07ead

In [10]:
from sklearn.ensemble import RandomForestRegressor
from pprint import pprint

In [11]:
rf = RandomForestRegressor(random_state = 42)

In [12]:
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


Instead of all the above the parameters, we will just focus on tuning a few as given below:
We will try adjusting the following set of hyperparameters:
1. n_estimators = number of trees in the foreset
2. max_features = max number of features considered for splitting a node
3. max_depth = max number of levels in each decision tree
4. min_samples_split = min number of data points placed in a node before the node is split
5. min_samples_leaf = min number of data points allowed in a leaf node
6. bootstrap = method for sampling data points (with or without replacement)

To use RandomizedSearchCV, we first need to create a parameter grid to sample from during fitting:

Params From reference github
- n_estimators=[100,200,300,400]
- max_features = Not included
- max_depth = [20,22,24]
- min_samples_split = [2,4,6]
- min_samples_leaf = not included
- bootstrap = not included


In [13]:
from sklearn.model_selection import RandomizedSearchCV

In [14]:
# Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
n_estimators = [100,200,300,400]

In [15]:
# Number of features to consider at every split
# max_features = ['auto', 'sqrt']

In [16]:
# Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
max_depth = [20,22,24]

In [17]:
# Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
min_samples_split = [2,4,6]

In [18]:
# Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]

In [19]:
# Method of selecting samples for training each tree
# bootstrap = [True, False]

In [21]:
# Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}

In [22]:
pprint(random_grid)

{'max_depth': [20, 22, 24],
 'min_samples_split': [2, 4, 6],
 'n_estimators': [100, 200, 300, 400]}


In [23]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()

In [24]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

Finally, fit the RandomizedSearchCV object to the data frames containing features and labels and print the optimal hyperparameter values.

In [25]:
# Fit the random search model
print(len(x_train_new), len(y_train_new))
rf_random.fit(x_train_new, y_train_new)
rf_random.best_params_

164910 164910
Fitting 3 folds for each of 36 candidates, totalling 108 fits




{'n_estimators': 100, 'min_samples_split': 2, 'max_depth': 20}

# Train the Random Forest classifier

In [26]:
def trainAndTestRandomForest(_bootstrap, _max_depth,_max_features,
                            _min_samples_leaf, _min_samples_split, 
                            _n_estimators):
    clf = RandomForestClassifier(max_depth=_max_depth, 
                                 bootstrap = _bootstrap, 
                                 max_features = _max_features,
                                 min_samples_leaf = _min_samples_leaf, 
                                 min_samples_split = _min_samples_split, 
                                 n_estimators = _n_estimators)
    # Train Random Forest Classifer
    clf = clf.fit(x_train_new,y_train_new)
    #Predict the response for test dataset
    return clf

# Test the model and find out its accuracy

In [None]:
def tellAcurracyOfModel(clf):
    y_pred = clf.predict(x_test)
    # Model Accuracy, how often is the classifier correct?
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Classification with RF after MinMax Scaling + Dimension Reduction (using PCA)

In [None]:
#TODO: make one function of the entire fucntionality and call it for different datasets and find accuracy

# Classification with RF after MinMax Scaling + Correlation analysis