Problem Statement


Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on
clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1
score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUC-ROC metric and compare it with the F1.
1. Download and prepare the data. Explain the procedure.
2. Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.
3. Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.
4. Perform the final testing.


**Data** **description**


● Dataset URL (CSV File): https://bit.ly/2XZK7Bo



● Features


○ RowNumber — data string index

○ CustomerId — unique customer identifier

○ Surname — surname

○ CreditScore — credit score

○ Geography — country of residence

○ Gender — gender

○ Age — age

○ Tenure — period of maturation for a customer’s fixed deposit (years)

○ Balance — account balance

○ NumOfProducts — number of banking products used by the customer

○ HasCrCard — customer has a credit card

○ IsActiveMember — customer’s activeness

○ EstimatedSalary — estimated salary


● Target
○ Exited — сustomer has left

Importing the data.

In [None]:
# Importing the required libraries
import pandas as pd
import numpy as np 

Reading the data

In [None]:
df = pd.read_csv('https://bit.ly/2XZK7Bo')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
# Checking the data shape
df.shape

(10000, 14)

In [None]:
# Checking for nulls in the data
df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

Observation: The Tenure column has 909 null values. Seeing that this is an important feature to use in training the model, I've decided to drop the 909 observations, but maintain the column.

In [None]:
# Dropping the 909 observations with null values in 'Tenure'
df = df.dropna(axis=0, subset=['Tenure'])

In [None]:
# Confirming if the above step was successful
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [None]:
# Describing the data
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,9091.0,9091.0,9091.0,9091.0,9091.0,9091.0,9091.0,9091.0,9091.0,9091.0,9091.0
mean,5013.909911,15691050.0,650.736553,38.949181,4.99769,76522.740015,1.530195,0.704983,0.515565,100181.214924,0.203938
std,2884.433466,71614.19,96.410471,10.555581,2.894723,62329.528576,0.581003,0.456076,0.499785,57624.755647,0.402946
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2521.5,15628990.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51227.745,0.0
50%,5019.0,15691060.0,652.0,37.0,5.0,97318.25,1.0,1.0,1.0,100240.2,0.0
75%,7511.5,15752850.0,717.0,44.0,7.0,127561.89,2.0,1.0,1.0,149567.21,0.0
max,9999.0,15815660.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [None]:
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [None]:
# Dropping the irrelevant columns i.e. RowNumber and Surname
df = df.drop(['RowNumber', 'Surname'], axis = 1)

In [None]:
# Transforming the Geography and Gender columns using One Hot Encoding 

dummies_df = pd.get_dummies(df[['Geography', 'Gender']])
dummies_df.head()

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,1,0,0,1,0
1,0,0,1,1,0
2,1,0,0,1,0
3,1,0,0,1,0
4,0,0,1,1,0


In [None]:
# Joining the dummies_df to the df dataframe

new_df = pd.concat([df, dummies_df], axis=1, sort=False)
print(new_df.shape)
new_df.head(1)

(9091, 17)


Unnamed: 0,CustomerId,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,15634602,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1,1,0,0,1,0


Dropping the 'CustomerId', 'Geography' and 'Gender' columns since they are not relevant.

In [None]:
new_df = new_df.drop(['CustomerId', 'Geography', 'Gender'], axis = 1)

Splitting the dataset into Training, Validation and Test dataset. This will be in the ratio of 60:20:20


In [None]:
# Splitting the data into Training and Test sets

from sklearn.model_selection import train_test_split

new_df_train, new_df_test = train_test_split(new_df, test_size=0.20, random_state=12345)

In [None]:
# checking the sizes of two datasets

print(new_df_train.shape) #7272 records
print(new_df_test.shape) #1819 records

(7272, 14)
(1819, 14)


In [None]:
# Defining the features and target

features = new_df.drop('Exited', axis = 1)
target = new_df['Exited']

In [None]:
#Splitting the training set into training and validation sets

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size = 0.20, random_state = 12345)

# new_df_train, new_df_valid = train_test_split(new_df, test_size=0.20, random_state=12345)

In [None]:
# Importing the classification models

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Creating the models
model_lr = LogisticRegression(random_state=12345, solver='liblinear')
model_rf = RandomForestClassifier(random_state=12345, n_estimators=3)
model_dt = DecisionTreeClassifier(random_state=12345)

Train the model without taking into account the
imbalance. Briefly describe your findings.

In [None]:
# Training the models

model_lr.fit(features_train, target_train)
model_rf.fit(features_train, target_train)
model_dt.fit(features_train, target_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=12345, splitter='best')

In [None]:
# Making predictions using the three trained models

lr_pred = model_lr.predict(features_valid) 
rf_pred = model_rf.predict(features_valid)
dt_pred = model_dt.predict(features_valid)

In [None]:
# Checking the accuracy of the three models

from sklearn.metrics import f1_score

print('Logistic Regression F1 score: ', f1_score(target_valid, lr_pred))
print('Random Forest F1 score: ', f1_score(target_valid, rf_pred))
print('Decision Tree Classifier F1 score: ', f1_score(target_valid, dt_pred))

Logistic Regression F1 score:  0.07881773399014778
Random Forest F1 score:  0.4820359281437126
Decision Tree Classifier F1 score:  0.49934640522875823


Examin the balance of classes

In [None]:
from sklearn.metrics import confusion_matrix

print("Logistic Regression Confusion Matrix: ", confusion_matrix(target_valid, lr_pred))
print("Random Forest Confusion Matrix: ", confusion_matrix(target_valid, rf_pred))
print("Decision Tree Confusion Matrix: ",confusion_matrix(target_valid, dt_pred))

Logistic Regression Confusion Matrix:  [[1429   21]
 [ 353   16]]
Random Forest Confusion Matrix:  [[1312  138]
 [ 208  161]]
Decision Tree Confusion Matrix:  [[1245  205]
 [ 178  191]]


Observations: There is quite a considerable imbalance of the classes, seeing that the false positives and false negatives are considerably high across all the three models.

The F1 scores are also quite low, with Decision Tree Classifier scoring the highest at 0.499. This is below the required threshold of 0.59.

In [None]:
random_state=12345, class_weight='balanced', solver='liblinear'

Retraining the models after addressing the imbalance of classes

In [None]:
model_lr = LogisticRegression(random_state=12345, class_weight='balanced', solver='liblinear')
model_rf = RandomForestClassifier(random_state=12345, n_estimators=3, class_weight = 'balanced')
model_dt = DecisionTreeClassifier(random_state=12345, class_weight = 'balanced')

In [None]:
model_lr.fit(features_train, target_train)
model_rf.fit(features_train, target_train)
model_dt.fit(features_train, target_train)

lr_pred = model_lr.predict(features_valid)
rf_pred = model_rf.predict(features_valid)
dt_pred = model_dt.predict(features_valid)

print('Logistic Regression F1 score: ', f1_score(target_valid, lr_pred))
print("Random Forest F1 score: ", f1_score(target_valid, rf_pred))
print("Decision Tree F1 score: ",f1_score(target_valid, dt_pred))

Logistic Regression F1 score:  0.5152091254752852
Random Forest F1 score:  0.5022026431718062
Decision Tree F1 score:  0.4918032786885246


Observation: After defining the hyper parameter for class_weight, the F1 score for the three models improved, with the logistic regression model improving the most.

In [None]:
# Finding the best depth for the Random Forest model

for depth in range(1, 10):
        model_rf = RandomForestClassifier(random_state=12345, n_estimators=3, class_weight = 'balanced',max_depth = depth)

        model_rf.fit(features_train, target_train) 

        rf_pred = model_rf.predict(features_valid) 

        print("max_depth =", depth, ": ", end='')
        print(f1_score(target_valid, rf_pred))

max_depth = 1 : 0.48893572181243417
max_depth = 2 : 0.5351239669421488
max_depth = 3 : 0.5244536940686785
max_depth = 4 : 0.5704845814977973
max_depth = 5 : 0.570203644158628
max_depth = 6 : 0.5652620760534429
max_depth = 7 : 0.5727590221187427
max_depth = 8 : 0.5565410199556541
max_depth = 9 : 0.5672727272727273


Observation: The best F1 score comes at Max_depth of 7 for the Random Forest(0.5727590221187427).

In [None]:
# Finding the best depth for the Decision Tree model

for depth in range(1, 10):
        model_dt = DecisionTreeClassifier(random_state=12345, class_weight = 'balanced', max_depth = depth)

        model_dt.fit(features_train, target_train) 

        dt_pred = model_dt.predict(features_valid) 

        print("max_depth =", depth, ": ", end='')
        print(f1_score(target_valid, dt_pred))

max_depth = 1 : 0.48893572181243417
max_depth = 2 : 0.51340206185567
max_depth = 3 : 0.5285868392664509
max_depth = 4 : 0.5337026777469991
max_depth = 5 : 0.5661538461538461
max_depth = 6 : 0.5454545454545453
max_depth = 7 : 0.5550239234449761
max_depth = 8 : 0.5372549019607843
max_depth = 9 : 0.5441176470588236


Observation: The best F1 score comes at Max_depth of 5 for the Decision Tree (0.5661538461538461)

In [None]:
# Finding the best n_estimator for the Random Forest model

for n_est in range(11, 20):
        model_rf = RandomForestClassifier(random_state=12345, n_estimators=n_est, class_weight = 'balanced',max_depth = 7)

        model_rf.fit(features_train, target_train) 

        rf_pred = model_rf.predict(features_valid) 

        print("n_estimator =", n_est, ": ", end='')
        print(f1_score(target_valid, rf_pred))

n_estimator = 11 : 0.6037735849056604
n_estimator = 12 : 0.6068476977567886
n_estimator = 13 : 0.6173708920187794
n_estimator = 14 : 0.6093023255813954
n_estimator = 15 : 0.6186046511627907
n_estimator = 16 : 0.6119577960140681
n_estimator = 17 : 0.6102088167053364
n_estimator = 18 : 0.6157407407407408
n_estimator = 19 : 0.610011641443539


Best n_estimator is 15 which gives F1 score of 0.6186046511627907

In [None]:
# Recreating the models with the improved parameters
model_lr = LogisticRegression(random_state=12345, class_weight='balanced', solver='liblinear')
model_rf = RandomForestClassifier(random_state=12345, n_estimators=15, class_weight = 'balanced', max_depth = 7)
model_dt = DecisionTreeClassifier(random_state=12345, class_weight = 'balanced', max_depth = 5)

In [None]:
# Retraining the models
model_lr.fit(features_train, target_train)
model_rf.fit(features_train, target_train)
model_dt.fit(features_train, target_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced', criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=12345, splitter='best')

Testing the models

In [None]:
# Splitting the test dataset to features and target

test_features = new_df_test.drop('Exited', axis = 1)
test_target = new_df_test['Exited']

In [None]:
# Making predictions with the three trained models

lr_pred = model_lr.predict(test_features)
rf_pred = model_rf.predict(test_features)
dt_pred = model_dt.predict(test_features)

In [None]:
# Evaluating the F1 Score for each model on the test dataset

print('Logistic Regression F1 score: ', f1_score(test_target, lr_pred))
print("Random Forest F1 score: ", f1_score(test_target, rf_pred))
print("Decision Tree F1 score: ",f1_score(test_target, dt_pred))

Logistic Regression F1 score:  0.5152091254752852
Random Forest F1 score:  0.6186046511627907
Decision Tree F1 score:  0.5661538461538461


Findings: The best model is the Random Forest with F1 score of 0.6186046511627907.