# Bank customer churn modeling - Part 4
## Random forest tuning & cross-validation


Code written by: Gabriela Novo de Oliveira

## 1. Overview


The modeling objective is to build and test a random forest model that uses banking data to predict whether a customer will churn. If a customer churns, it means they left the bank and took their business elsewhere. If we can predict customers who are likely to churn, we can take measures to retain them before they do. These measures could be promotions, discounts, or other incentives to boost customer satisfaction and, therefore, retention.

To complete this notebook I will:

* Perform feature engineering.
* Perform encoding of categorical features as dummies.
* Conduct stratification during data splitting.
* Fit a model.
* Perform model evaluation using precision, recall, and F1 score.
* Use `GridSearchCV` to cross-validate the model and tune the following hyperparameters:  
    - `max_depth`  
    - `max_features`  
    - `min_samples_split`
    - `n_estimators`  
    - `min_samples_leaf`  



I will be using `numpy` and `pandas` for operations, and `matplotlib` for plotting. I will import some evaluation metrics from `sklearn.metrics` and the `tree` model from `sklearn` that will help me generate, tune and cross-validade a Decision Tree.

## 2. Importing packages and libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier

## 3. Data Analysis

In [4]:
# Read in data
df_original = pd.read_csv('Churn_Modelling.csv') # This is the original dataset

### 3.1. Feature engineering

#### 3.1.1. Feature selection

In this step, I'll prepare the data for modeling. I will begin by dropping the columns that I wouldn't expect to offer any predictive signal to the model. These columns include `RowNumber`, `CustomerID`, and `Surname`. I'll drop these columns so they don't introduce noise to the model.  

I'll also drop the `Gender` column, because I don't want our model to make predictions based on gender.

In [5]:
# Create a new df that drops RowNumber, CustomerId, Surname, and Gender cols
churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], 
                            axis=1)

In [6]:
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0
2,502,France,42,8,159660.8,3,1,0,113931.57,1
3,699,France,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0


#### 3.1.2. Feature transformation

Next, I'll dummy encode the `Geography` variable, which is categorical. I do this with the `pd.get_dummies()` function and setting `drop_first='True'`, which replaces the `Geography` column with two new Boolean columns called `Geography_Germany` and `Geography_Spain`.

In [7]:
# Dummy encode categoricals
churn_df2 = pd.get_dummies(churn_df, drop_first='True')
churn_df2.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,False,False
1,608,41,1,83807.86,1,0,1,112542.58,0,False,True
2,502,42,8,159660.8,3,1,0,113931.57,1,False,False
3,699,39,1,0.0,2,0,0,93826.63,0,False,False
4,850,43,2,125510.82,1,1,1,79084.1,0,False,True


#### 3.1.3. Split the data

I'll split the data into features and target variable, and into training data and test data using the `train_test_split()` function. I'll include the `stratify=y` parameter to ensure that the 80/20 class ratio of the target variable is maintained in both the training and test datasets after splitting.

Lastly, I will set a random seed so the work is reproducible.

In [9]:
# Define the y (target) variable
y = churn_df2["Exited"]

# Define the X (predictor) variables
X = churn_df2.copy()
X = X.drop("Exited", axis = 1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

## 4. Modeling

### 4.1. Cross-validated hyperparameter tuning

The cross-validation process is the same as it was for the decision tree model. The only difference is that there will be more hyperparameters being tuned now. The steps are included below:

1. Instantiate the classifier (and set the `random_state`). 

2. Create a dictionary of hyperparameters to search over.

3. Create a set of scoring metrics to capture. 

4. Instantiate the `GridSearchCV` object. Pass as arguments:
  - The classifier (`rf`)
  - The dictionary of hyperparameters to search over (`cv_params`)
  - The set of scoring metrics (`scoring`)
  - The number of cross-validation folds you want (`cv=5`)
  - The scoring metric that you want GridSearch to use when it selects the "best" model (i.e., the model that performs best on average over all validation folds) (`refit='f1'`)

5. Fit the data (`X_train`, `y_train`) to the `GridSearchCV` object (`rf_cv`).


In [10]:
%%time

rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

scoring = {'accuracy': 'accuracy', 
           'precision': 'precision', 
           'recall': 'recall', 
           'f1': 'f1'}

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='f1')

rf_cv.fit(X_train, y_train)

CPU times: user 12min 37s, sys: 7.84 s, total: 12min 45s
Wall time: 12min 50s


I will use the model's `best_params_` attribute to check the hyperparameters that had the best average F1 score across all the cross-validation folds.

In [11]:
rf_cv.fit(X_train, y_train)
rf_cv.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 150}

To check the best average F1 score of this model on the validation folds, I will use the `best_score_` attribute. 

In [12]:
rf_cv.best_score_

0.5833023473561428

Our model had an F1 score of 0.5805&mdash;not terrible. When I ran the grid search, I specified that I also wanted to capture precision, recall, and accuracy. The reason for doing this is that it's difficult to interpret an F1 score. These other metrics are much more directly interpretable, so they're worth knowing. 

The following cell defines a helper function that extracts these scores from the fit `GridSearchCV` object and returns a pandas dataframe with all four scores from the model with the best average F1 score during validation.

In [13]:
def make_results(model_name, model_object):
    '''
    Accepts as arguments a model name (your choice - string) and
    a fit GridSearchCV model object.

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.
    '''

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(mean f1 score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_f1'].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy

    # Create table of results
    table = pd.DataFrame({'Model': [model_name],
                          'F1': [f1],
                          'Recall': [recall],
                          'Precision': [precision],
                          'Accuracy': [accuracy]
                         }
                        )

    return table

In [14]:
# Make a results table for the rf_cv model using above function
rf_cv_results = make_results('Random Forest CV', rf_cv)
rf_cv_results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.583302,0.47514,0.758639,0.862133


I will concatenate these results to the master results table from when I built the single decision tree model.

In [15]:
# Read in master results table
results = pd.read_csv('decision_tree_results.csv', index_col=0)
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


In [16]:
# Concatenate the random forest results to the master table
results = pd.concat([rf_cv_results, results])
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.583302,0.47514,0.758639,0.862133
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


The scores in the table above show that the random forest model performs better than the single decision tree model on every metric. Nice!

Now, I'll build another random forest model, only this time I'll tune the hyperparameters using a separate validation dataset.



### 4.2. Hyperparameters tuned with separate validation set  

I will begin by splitting the training data to create a validation dataset.

I'll use `train_test_split` to divide `X_train` and `y_train` into 80% training data (`X_tr`, `y_tr`) and 20% validation data (`X_val`, `y_val`).

In [17]:
# Create separate validation data
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, 
                                            stratify=y_train, random_state=10)

When tuning hyperparameters with `GridSearchCV` using a separate validation dataset, a few extra steps need to be taken. `GridSearchCV` wants to cross-validate the data. In fact, if the `cv` argument were left blank, it would split the data into five folds for cross-validation by default. 

Instead, I am going to tell it exactly which rows of `X_train` are for training, and which rows are for validation.  

To do this, I need to make a list of length `len(X_train)` where each element is either a 0 or -1. A 0 in index _i_ will indicate to `GridSearchCV` that index _i_ of `X_train` is to be held out for validation. A -1 at a given index will indicate that that index of `X_train` is to be used as training data. 

I'll make this list using a list comprehension that looks at the index number of each row in `X_train`. If that index number is in `X_val`'s list of index numbers, then the list comprehension appends a 0. If it's not, then it appends a -1.

So if the training data is:  
[A, B, C, D],  
and the list is:   
[-1, 0, 0, -1],  
then `GridSearchCV` will use a training set of [A, D] and validation set of [B, C].

In [18]:
# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index]

Now that I have this list, I need to import a new function called `PredefinedSplit`. This function is what allows me to pass the list we just made to `GridSearchCV`. References: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit.

In [19]:
from sklearn.model_selection import PredefinedSplit

Now I can build the model. Everything is the same as when I cross-validated, except this time I pass the `split_index` list to the `PredefinedSplit` function and assign it to a new variable called `custom_split`.

Then I'll use this variable for the `cv` argument when we instantiate `GridSearchCV`.

In [29]:
rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

scoring = {
    'accuracy': 'accuracy', 
    'precision': 'precision', 
    'recall': 'recall', 
    'f1': 'f1'
}

custom_split = PredefinedSplit(split_index)

rf_val = GridSearchCV(rf, cv_params, scoring=scoring, cv=custom_split, refit='f1')


Now fit the model.

In [30]:
rf_val.fit(X_train, y_train)

Now I will check the parameters of the best-performing model on the validation set:

In [31]:
rf_val.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 150}

The best hyperparameters were slightly different than the cross-validated model.  

Now, I'll generate the model results using the `make_results` function, add them to the master table, and then sort them by F1 score in descending order.

In [32]:
# Create model results table
rf_val_results = make_results('Random Forest Validated', rf_val)

# Concatentate model results table with master results table
results = pd.concat([rf_val_results, results])

# Sort master results by F1 score in descending order
results.sort_values(by=['F1'], ascending=False)

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.583302,0.47514,0.758639,0.862133
0,Random Forest Validated,0.579592,0.464052,0.771739,0.862667
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


We can save the new master table to use later when we build more models. 

In [34]:
# Save the master results table
results.to_csv('model_results2.csv', index=False);

## 5. Model selection and final results

The results in the table show that the cross-validated random forest model performs a little better than the one trained on a separate validation set. 

It performs well for precision and accuracy, but the recall is 0.4725. This means that out of all the people in the validation folds who _actually_ left the bank, the model successfully identifies 47% of them. 