# Random Forest Tuning & Cross-validation Lab (V-304)

This notebook will demonstrate how to construct a random forest ensemble model in Python with scikit-learn. Topics of focus include:


*   Relevant import statements
*   Encoding of categorical features as dummies
*   Stratification during data splitting
*   Fitting a model
*   Pickling a model
*   Using GridSearchCV to cross-validate the model and tune the following hyperparameters:
  - max_depth
  - max_features
  - min_samples_split
  - n_estimators
  - min_samples_leaf
*   Model evaluation using precision, recall, and f1 score



## Import statements

These are the packages needed to implement the model code and evaluation. 

In [6]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
#plt.rcParams["figure.figsize"] = (15, 11)
pd.set_option('display.max_columns', None)

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

import pickle

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Data preparation

In this step, we'll read in the data and prepare it for modeling.          

In [7]:
# Read in data
df_original = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Churn_Modelling.csv')
df_original.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [8]:
# Drop useless and sensitive (Gender) cols
churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], axis=1)
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0
2,502,France,42,8,159660.8,3,1,0,113931.57,1
3,699,France,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0


In [9]:
# Dummy encode categoricals
churn_df2 = pd.get_dummies(churn_df, drop_first='True')
churn_df2.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1


In [10]:
# Split data
y = churn_df2["Exited"]

X = churn_df2.copy()
X = X.drop("Exited", axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

## Modeling

### Cross-validated hyperparameter tuning

In [11]:
%%time

cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

rf = RandomForestClassifier(random_state=0)

scoring = {'accuracy', 'precision', 'recall', 'f1'}

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='f1')

#rf_cv.fit(X_train, y_train)

CPU times: user 196 µs, sys: 37 µs, total: 233 µs
Wall time: 242 µs


In [12]:
path = '/content/drive/MyDrive/Colab Notebooks/'

In [13]:
# # Pickle the model
# with open(path+'rf_cv_model.pickle', 'wb') as to_write:
#   pickle.dump(rf_cv, to_write)

In [14]:
# Open pickled model
with open(path+'rf_cv_model.pickle', 'rb') as to_read:
  rf_cv = pickle.load(to_read)

In [15]:
rf_cv.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 125}

In [16]:
rf_cv.best_score_

0.580528563620339

In [17]:
def make_results(model_name, model_object):
  '''
  Accepts as arguments a model name (your choice - string) and
  a fit GridSearchCV model object.
  
  Returns a pandas df with the F1, recall, precision, and accuracy scores
  for the model with the best mean F1 score across all validation folds.  
  '''

  # Get all the results from the CV and put them in a df
  cv_results = pd.DataFrame(model_object.cv_results_)

  # Isolate the row of the df with the max(mean f1 score)
  best_estimator_results = cv_results.iloc[cv_results['mean_test_f1'].idxmax(), :]

  # Extract accuracy, precision, recall, and f1 score from that row
  f1 = best_estimator_results.mean_test_f1
  recall = best_estimator_results.mean_test_recall
  precision = best_estimator_results.mean_test_precision
  accuracy = best_estimator_results.mean_test_accuracy
  
  # Create table of results
  table = pd.DataFrame()
  table = table.append({'Model': model_name,
                        'F1': f1,
                        'Recall': recall,
                        'Precision': precision,
                        'Accuracy': accuracy
                        },
                        ignore_index=True
                       )
  
  return table


In [18]:
rf_cv_results = make_results('Random Forest CV', rf_cv)
rf_cv_results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333


In [19]:
results = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/results1.csv', index_col=0)
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


In [20]:
results = pd.concat([rf_cv_results, results])
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


### Hyperparameters tuned with separate validation set

In [21]:
# Create separate validation data
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, 
                                            stratify=y_train, random_state=10)

In [22]:
split_index = [0 if x in X_val.index else -1 for x in X_train.index]

In [23]:
from sklearn.model_selection import PredefinedSplit

# provides train/test indices to split data into trainin`g 
# and test sets using a predefined scheme.

In [24]:
cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

rf = RandomForestClassifier(random_state=0)

scoring = {'accuracy', 'precision', 'recall', 'f1'}

custom_split = PredefinedSplit(split_index)

rf_val = GridSearchCV(rf, cv_params, scoring=scoring, cv=custom_split, refit='f1')

In [25]:
%%time
#rf_val.fit(X_train, y_train)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs


In [26]:
# # Pickle the model
# with open(path+'rf_val_model.pickle', 'wb') as to_write:
#   pickle.dump(rf_val, to_write)

In [27]:
# Open pickled model
with open(path+'rf_val_model.pickle', 'rb') as to_read:
  rf_val = pickle.load(to_read)

In [28]:
rf_val.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 150}

In [29]:
rf_val_results = make_results('Random Forest Validated', rf_val)
results = pd.concat([rf_val_results, results])
results.sort_values(by=['F1'], ascending=False)

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333
0,Random Forest Validated,0.57551,0.460784,0.766304,0.861333
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


In [30]:
results.to_csv(path+'results2', index=False);

## Model selection and final results

Now we have three models. If we've decided that we're done trying to optimize them, then we can now use our best model to predict on the test holdout data. We'll be using the cross-validated model without the depth limitation, but if we were instead to use the model that was validated against a separate validation dataset, we'd now go back and retrain the model on the full training set (training + validation sets).

**NOTE**: _It might be tempting to see how all models perform on the test holdout data, and then to choose the one that performs best. While this **can** be done, it biases the final model, because you used your test data to go back and make an upstream decision. The test data should represent **unseen** data. In competitions, for example, you must submit your final model before receiving the test data._