<center> <h2> Random Forest </h2> </center>

<center><img src="pics/RF.png" width=600></center>

<h3> <font color="blue"> Pros </h3>

1. Decision Trees gets to "overfitting" if not pruned properly. Random Forest eliminates it.The votes (classification) and estimates (regression) are "counted" or "averaged" across many trees. Decorrelation happens since the "features" are selected at "random" from the subset of feature space.
2. Since this is "wisdom of crowd" or "ensemble" method, it is less likely to prone to small changes in the datasets.
3. Works well with **high-dimesnion** data.
4. Paralleization : Splitting the process across many machine is easy and hence the computation speed can be made fast.
4. Similar to Decision Tree , Random Forests can handle outliers and non linear data very well.
5. Unbalanced data can be handled well, though this is easily possible in all other "tree" like approaches.

<h3> <font color = "red"> Cons </h3>

1. Since we are building many trees and hence the process may take time and can be expensive in terms of memory consumption.
2. Random Forest can still **overfit** so be careful with the **hyper-parameter**.

<h3> <font color = "green"> How it works </h3>

<p> It is a simple model if you understand how Decision Trees work. The Random Forest is a collection of many trees,
    where features AND data points are selected randomly to create "de-correlated" trees. </p>
<p> Bootstrapping (random sampling with replacement) is done for each data point. </p>

<h3> <span style="background:yellow"> Hyper-parameters </h3>

1. **Maximum Depth** : Careful with overfitting.
2. **Number of Trees** : Generally higher the number the better. Computationally expensise though.
3. **Max Features** : Size of random feature when looking for best split at each node. 
4. **Split Criterion** : Gini or Entropy. Normally either gives good result. Problem specific.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
credit = pd.read_csv('/Users/bt/Documents/GITHUB/creditcard.csv')

In [3]:
credit.drop('Time', axis=1, inplace=True)

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [5]:
X = credit.drop('Class', axis=1)
y = credit[['Class']]

X_train, X_valid, y_train, y_valid = train_test_split(X,y,stratify=y,test_size=0.3, shuffle=True)

In [6]:
sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_valid = sc.transform(X_valid)

In [7]:
model = RandomForestClassifier(n_estimators=100,max_features='sqrt')

In [8]:
model.fit(X_train,y_train.values.ravel())

RandomForestClassifier(max_features='sqrt')

In [9]:
predictions = model.predict(X_valid)

In [20]:
print(pd.DataFrame(confusion_matrix(y_valid,predictions), columns=['Pred:Yes','Pred:No'], index=['Actual:Yes','Actual:No']))

            Pred:Yes  Pred:No
Actual:Yes     85286        9
Actual:No         34      114


In [21]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score

In [27]:
print('Accuracy Score : ' + str(accuracy_score(y_valid,predictions)))
print('Recall Score : ' + str(recall_score(y_valid,predictions)))
print('Precision Score : ' + str(precision_score(y_valid,predictions)))
print('F1 Score : ' + str((f1_score(y_valid,predictions)))) 

Accuracy Score : 0.9994967405170698
Recall Score : 0.7702702702702703
Precision Score : 0.926829268292683
F1 Score : 0.8413284132841328


In [12]:
from sklearn.model_selection import cross_validate

In [13]:
model = RandomForestClassifier()

In [15]:
predcited = cross_validate(model,X_train,y_train.values.ravel(),cv=5,scoring='recall')

In [16]:
print(np.mean(predcited['test_score']))

0.790537084398977


In [29]:
from sklearn.model_selection import RandomizedSearchCV

In [28]:
rf = RandomForestClassifier()

In [30]:
from pprint import pprint

In [32]:
pprint(rf.get_params())

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [40]:
n_estimators = [int(x) for x in np.linspace(start=400,stop=600,num=3)]
max_features = ['auto','sqrt']
max_depth = [int(x) for x in np.linspace(20,100,num=5)]
min_samples_split = [10,20]

In [41]:
random_grid = {'n_estimators' : n_estimators,
              'max_features' : max_features,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split}

In [42]:
pprint(random_grid)

{'max_depth': [20, 40, 60, 80, 100],
 'max_features': ['auto', 'sqrt'],
 'min_samples_split': [10, 20],
 'n_estimators': [400, 500, 600]}


In [43]:
rf_random_grid = RandomizedSearchCV(rf, param_distributions=random_grid, cv=5, n_jobs=-1)

In [None]:
rf_random_grid.fit(X_train,y_train)