## What is Random Forest?


In this tutorial, we are going to see how to do Random Forest using scikit learn.

source:
1. https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
2. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
3. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

First, we load our data. The first two rows are the parameters and the third column is the dependent value.

https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

In [2]:
import numpy as np
import pandas as pd
import os
from sklearn.datasets import load_iris
# print(os.listdir("./input"))

Now we load out dataset.
Note that iris dataset is s dictionary, we set x and y as the data and the target.
And then we split the data in to test and train.

In [3]:
from sklearn.datasets import load_iris
iris=load_iris()#print(iris)
x=iris.data # print(x.shape)
y=iris.target # print(y.shape)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.3)

It is clealy seen that there are five features for eath Iris flower data point

In [3]:
print(x_train[:5])

[[4.6 3.1 1.5 0.2]
 [6.4 2.8 5.6 2.2]
 [5.6 2.7 4.2 1.3]
 [6.1 2.6 5.6 1.4]
 [5.6 3.  4.5 1.5]]


### Normalization (Scaling)
Next, we apply scaling on the test dataset, and apply the same transform to the test dataset.
After fitting, the scale and offset used with your training data is stored. We use that on the test dataset with scaler.transform..

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform( x_train )
x_test = scaler.transform( x_test )

### Cross Validation with Regularization
Any model has its own hyperparameter. These paramters could change the training results significantly. Thus we have to do hyper paramter tuning via cross validation.

One method is called grid search cross validation. One exhaustively navigates through the n parameters values (the grids). So we have a n squared parameter combination.

The grid term is the parameter set where we include all the hyper parameters that the model uses and the range of values that we would like to test. In logistic regression, there are two hyperparameters we could tune, one is the regularization coefficient C and the second is the regularization method L.

Finally, one uses K-folds cross validation to test the accuracy of the model. the cv argument in GridSearchCV allows one to decide how many folds one wants to use. an average score for each model is then generated. Only the best model and its hyperparameters are shown in the end.

source:
1. https://www.youtube.com/watch?v=IXPgm1e0IOo
2. https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

In [7]:
# Simply remove warnings
import warnings
warnings.filterwarnings("ignore")
from pprint import pprint
# Grid search cross validation
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]# Number of trees in random forest
max_features = ['auto', 'sqrt']# Number of features to consider at every split
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]# Maximum number of levels in tree
max_depth.append(None)
min_samples_split = [2, 5, 10]# Minimum number of samples required to split a node
min_samples_leaf = [1, 2, 4]# Minimum number of samples required at each leaf node
bootstrap = [True, False]# Method of selecting samples for training each tree
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Use the random grid to search for best hyperparameters
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(x_train,y_train)
print("tuned hpyerparameters :(best parameters) ",rf_random.best_params_)
print("accuracy :",rf_random.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   11.0s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   44.9s


tuned hpyerparameters :(best parameters)  {'n_estimators': 400, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.4min finished


In [8]:
print("accuracy :",rf_random.best_score_)

accuracy : 0.9305327755100079


### Predictions
Finally we do predictions.

In [10]:
rf=RandomForestRegressor(n_estimators=400, min_samples_split=10, min_samples_leaf=4, max_features='sqrt', max_depth=90, bootstrap='True')
rf.fit(x_train,y_train)
print("score",rf.score(x_test,y_test))
                         

score 0.9544023255938269


### end

### P.S.1

One could manually find the cv scores by the following lines. This example shows when we have selected C=1 and L2 morn as regularization, the average score of the 10 folds cross validation is 94.25%

In [11]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(LogisticRegression(C=1,penalty="l2"), x_train, y_train, scoring='accuracy', cv = 10).mean())

0.9425252525252527
