# Random Forest

- Random Forest is combination of many decision trees
- It is a classification algorithm.

Why do we need Random Forest over Decision Trees?
- Though Decision Trees are easy to build, use and interpret, but they are inaccurate
- DTs are not very good with unseen data so our Model may not work as desired
- Random Forest = Simplicity of DT + Very Good Accuracy

## Input

1. .csv - produced by pre_processing.ipynb
2. The pre_processed input data includes following techniques:
    #TODO

## Output/Analysis

1. Visualising the accuracy of RF with k-fold validation.
2. Comparing the accurancy of RF model with and without PCA.   

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
def load_data(filename):
    return pd.read_csv(filename)

# Spilt the input file into test and train dataset

I/P: dataframe

O/P: x_train, y_train, x_test, y_test

In [3]:
def prep_training(network_data):
    x = network_data.iloc[:,list(range(1,206))]
    y = network_data.iloc[:,206]
    print(x)
    print(y)
    print("Shape of x: ", x.shape)
    print("Shape of y: ", y.shape)
    return train_test_split(x,y,test_size=0.2)

# Split the train dataset into train and Cross validation dataset

I/P: x_train, y_train

O/P: x_train_new, x_cv, y_train_new, y_cv

In [4]:
def splitIntoTrainAndCV(x_train,y_train):
    # Splitting train in train and cv data
    _x_train_new, _x_cv, _y_train_new, _y_cv = train_test_split(x_train, y_train, test_size=0.2, random_state=42)
    print(_x_train_new.shape, _y_train_new.shape, _x_cv.shape, _y_cv.shape)
    return {'x_train_new':_x_train_new, 'x_cv':_x_cv, 'y_train_new': _y_train_new, 'y_cv':_y_cv}

# Hyperparameter Tuning for Random Forest

The following hyperparamter tuning has taken reference from:
1. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
2. https://medium.com/@ODSC/optimizing-hyperparameters-for-random-forest-algorithms-in-scikit-learn-d60b7aa07ead

In [5]:
from sklearn.ensemble import RandomForestRegressor
from pprint import pprint

Instead of all the above the parameters, we will just focus on tuning a few as given below:
We will try adjusting a few of the following set of hyperparameters:
1. n_estimators = number of trees in the foreset
2. max_features = max number of features considered for splitting a node
3. max_depth = max number of levels in each decision tree
4. min_samples_split = min number of data points placed in a node before the node is split
5. min_samples_leaf = min number of data points allowed in a leaf node
6. bootstrap = method for sampling data points (with or without replacement)

To use RandomizedSearchCV, we first need to create a parameter grid to sample from during fitting:

Params From reference github
- n_estimators=[100,200,300,400]
- max_features = Not included
- max_depth = [20,22,24]
- min_samples_split = [2,4,6]
- min_samples_leaf = not included
- bootstrap = not included


In [6]:
from sklearn.model_selection import RandomizedSearchCV

In [7]:
def creatingRandomGrid():
    # Number of trees in random forest
    n_estimators = [100,200,300,400]
    # Maximum number of levels in tree
    max_depth = [20,22,24]
    # Minimum number of samples required to split a node
    min_samples_split = [2,4,6]
    # Create the random grid
    random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
    return random_grid

In [8]:
def prepToFindOptimalHyperParams(random_grid):
    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestRegressor()
    
    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                                   n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
    return rf_random

Finally, fit the RandomizedSearchCV object to the data frames containing features and labels and print the optimal hyperparameter values.

In [9]:
def findBestHyperParameters(rf_random, x_train_new, y_train_new):
    # Fit the random search model
    print(len(x_train_new), len(y_train_new))
    rf_random.fit(x_train_new, y_train_new)
    bestParamsDict = rf_random.best_params_
    return bestParamsDict

# Train the Random Forest classifier

In [10]:
def trainAndTestRandomForest(_max_depth,
                            _min_samples_leaf,
                            _n_estimators,x_train_new,y_train_new):
    clf = RandomForestClassifier(max_depth=_max_depth, 
                                 min_samples_split = _min_samples_split, 
                                 n_estimators = _n_estimators)
    # Train Random Forest Classifer
    clf = clf.fit(x_train_new,y_train_new)
    #Predict the response for test dataset
    return clf

# Test the model and find out its accuracy

In [11]:
def tellAcurracyOfModel(clf):
    y_pred = clf.predict(x_test)
    # Model Accuracy, how often is the classifier correct?
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Main Function

In [12]:
def main(network_data):
        
    x_train,x_test,y_train,y_test = prep_training(network_data)
    
    newDict = splitIntoTrainAndCV(x_train,y_train)
    x_train_new = newDict['x_train_new']
    x_cv = newDict['x_cv']
    y_train_new = newDict['y_train_new']
    y_cv = newDict['y_cv']
                    
    random_grid = creatingRandomGrid()
    pprint(random_grid)
    rf_random = prepToFindOptimalHyperParams(random_grid)
    bestParamsDict = findBestHyperParameters(rf_random, x_train_new, y_train_new)
    
    #TODO : Grab this dict values in paramters
    #and pass those params to below function
    _n_estimators = bestParamsDict['n_estimators']
    _min_samples_split = bestParamsDict['min_samples_split']
    _max_depth = bestParamsDict['max_depth']
    
    clf = trainAndTestRandomForest(_max_depth,_min_samples_leaf,_n_estimators,x_train_new,y_train_new)
    tellAcurracyOfModel(clf)

# Classification with RF after MinMax Scaling 

In [13]:
network_data = load_data('data_minmax.csv')
main(network_data)

FileNotFoundError: [Errno 2] No such file or directory: 'data_minmax.csv'

# Classification with RF after MinMax Scaling + Dimension Reduction (using PCA)

In [None]:
network_data = load_data('dataset_minmax_pca.csv')
main(network_data)

# Classification with RF after MinMax Scaling + Correlation analysis

In [None]:
network_data = load_data('dataset_minmax_corr.csv')
main(network_data)