# Random Forest

- Random Forest is combination of many decision trees
- It is a classification algorithm.

Why do we need Random Forest over Decision Trees?
- Though Decision Trees are easy to build, use and interpret, but they are inaccurate
- DTs are not very good with unseen data so our Model may not work as desired
- Random Forest = Simplicity of DT + Very Good Accuracy

## Input

1. .csv - produced by pre_processing.ipynb
2. The pre_processed input data includes following techniques:
    #TODO

## Output/Analysis

1. Visualising the accuracy of RF with k-fold validation.
2. Comparing the accurancy of RF model with and without PCA.   

In [9]:
import numpy as np

# Spilt the input file into test and train dataset

I/P: dataframe

O/P: x_train, y_train, x_test, y_test

In [10]:
from sklearn.ensemble import RandomForestClassifier

# Split the train dataset into train and Cross validation dataset

I/P: x_train, y_train

O/P: x_train_new, x_cv, y_train_new, y_cv

# Hyperparameter Tuning for Random Forest

The following hyperparamter tuning has taken reference from:
1. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
2. https://medium.com/@ODSC/optimizing-hyperparameters-for-random-forest-algorithms-in-scikit-learn-d60b7aa07ead

In [11]:
from sklearn.ensemble import RandomForestRegressor
from pprint import pprint

In [12]:
rf = RandomForestRegressor(random_state = 42)

In [13]:
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


Instead of all the above the parameters, we will just focus on tuning a few as given below:
We will try adjusting the following set of hyperparameters:
1. n_estimators = number of trees in the foreset
2. max_features = max number of features considered for splitting a node
3. max_depth = max number of levels in each decision tree
4. min_samples_split = min number of data points placed in a node before the node is split
5. min_samples_leaf = min number of data points allowed in a leaf node
6. bootstrap = method for sampling data points (with or without replacement)

To use RandomizedSearchCV, we first need to create a parameter grid to sample from during fitting:

In [14]:
from sklearn.model_selection import RandomizedSearchCV

In [15]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

In [18]:
# Number of features to consider at every split
max_features = ['auto', 'sqrt']

In [19]:
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

In [20]:
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

In [21]:
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

In [22]:
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [23]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [24]:
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


In [25]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()

In [26]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

Finally, fit the RandomizedSearchCV object to the data frames containing features and labels and print the optimal hyperparameter values.

In [27]:
# Fit the random search model
#TODO 
rf_random.fit(train_features, train_labels)
rf_random.best_params_

NameError: name 'train_features' is not defined

# Classification of input data with RF 

In [12]:
# Classifier Name
CLF_NAME = RandomForestClassifier

In [13]:
clf = RandomForestClassifier(max_depth=2, random_state=0)

In [14]:
clf.fit(X, y)

RandomForestClassifier(max_depth=2, random_state=0)

In [15]:
print(clf.predict([[0, 0, 0, 0]]))

[1]


# Classification with RF after Dimension Reduction (using PCA)