# Predicting whether a customer will leave the bank soon

In this project, we will use machine learning algorithms to develop a model that would analyze the data on clients’ past behavior and termination of contracts with the bank. The model will predict whether a customer will leave the bank soon.

# Contents 
* [Data preparation]()
* [Training the model without taking into account the imbalance]()
* [Improving the quality of the model]()
* [Final testing]()
* [General Conclusion]()

# Data preparation

First of all, will load the data and the libraries that we will use in this project.

In [1]:
# Loading all required libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler 
from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score 

In [2]:
# Loading the data files into DataFrame
df=pd.read_csv('/datasets/Churn.csv')

We will display general data info and a sample of the data.

In [3]:
# printing the general/summary information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
# printing a sample of data
df.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


Columns 'RowNumber', 'CustomerId', and 'Surname' have unique values for each row and are worthless for the algorithm. Therefore, we will drop them.

In [5]:
# dropping RowNumber', 'CustomerId', 'Surname' columns
df=df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

The tenure column has a lot of missing values. Let’s check which unique values this column has.

In [6]:
#printing unique values of Tenure column
df['Tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

The column has only whole numbers and there does not seem to be a correlation between that column and other columns. We will replace the missing values in that column with the median value.

In [7]:
#replacing the missing values in Tenure column with median value.
df['Tenure']=df['Tenure'].fillna(df['Tenure'].median())

In [8]:
# printing the general/summary information about the DataFrame to ensure that no missing values left
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


Next, we will split the data frame into target and features subsets and prepare the features.

In [9]:
#split the data frame into target and features subsets
target=df['Exited']
features=df.drop(['Exited'], axis=1)

We will use One-Hot Encoding to transform categorical features into numerical features.

In [10]:
# transforming categorical features into numerical features
features=pd.get_dummies(features, drop_first=True)

Next, we will split the target and features sets into train, validation, and test subsets. And then we will check the balance of the train set.

In [11]:
# spliting data into training, validation and test sets
features, features_test, target, target_test = train_test_split(features, target, test_size=0.2, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

In [12]:
#printing the sizes of the sets to ensure that the data was splitted correctly
print(features_train.shape)
print(features_test.shape)
print(features_valid.shape)
print(target_train.shape)
print(target_test.shape)
print(target_valid.shape)

(6000, 11)
(2000, 11)
(2000, 11)
(6000,)
(2000,)
(2000,)


To standardize numerical features ('CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary') we will scale them.

In [13]:
#scaling numerical features
pd.options.mode.chained_assignment = None
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaler = StandardScaler()
scaler.fit(features_train[numeric]) 
features_train[numeric]=scaler.transform(features_train[numeric])
features_valid[numeric]=scaler.transform(features_valid[numeric])
features_test[numeric]=scaler.transform(features_test[numeric])

The data is ready. Next, we will examine the balance of classes. To do this we will count the values in the target set.

In [14]:
#counting the values in the target set
target_train.value_counts()

0    4781
1    1219
Name: Exited, dtype: int64

As can be seen, the class are imbalanced. Firstly we will train the model without taking into account the imbalance.

# Training the model without taking into account the imbalance

We will investigate the quality of different models by changing hyperparameters. We will start with the Decision Tree and will train the model with different maximum depths in the range from 2 to 100. We will investigate the quality of each model by calculating f1 score on the validation data set.

In [15]:
#declaring variables for storing best max_depth and best f1 score

best_max_depth=0
best_f1=0

#training models with different maximum depths in range from 2 to 100
for i in range(2,100):
    model = DecisionTreeClassifier(random_state=12345, max_depth=i)
    model.fit(features_train, target_train) 
#calculating f1 score of the model on the validation data set    
    predictions=model.predict(features_valid)
    f1 = f1_score(target_valid, predictions)
#if accuracy of the model is greater than previus best accuracy, updating best max_depth and best accuracy variables
    if f1>best_f1:
        best_max_depth=i
        best_f1=f1

#printing best f1 score
print(f'Best f1 score {best_f1} is achived with max_depth={best_max_depth}')

Best f1 score 0.5583596214511041 is achived with max_depth=7


As can be seen, the best accuracy of 0.5583596214511041 is achieved with max_depth=7.

Next, we will train the Random Forest with different numbers of estimators and with different depths. We will investigate the quality of each model by calculating f1 score on the validation data set.

In [16]:
#declaring variables for storing best max_depth, best_max_depth and best f1 score

best_n_estimators=0
best_f1=0
best_max_depth=0

#training models with different maximum depths and n_estimators
for i in range(2,50):
    for j in range(2,20):
        model = RandomForestClassifier(random_state=12345, n_estimators=i,max_depth=j) 
        model.fit(features_train, target_train) 
#calculating f1 score of the model on the validation data set    
        predictions=model.predict(features_valid)
        f1 = f1_score(target_valid, predictions)
#if f1 score of the model is greater than previus best accuracy, updating best n_estimators and best f1 variables
        if f1>best_f1:
            best_n_estimators=i
            best_max_depth=j
            best_f1=f1

#printing best f1 score
print(f'Best f1 score {best_f1} is achived with n_estimators={best_n_estimators} and max_depth={best_max_depth}')

Best f1 score 0.5822784810126581 is achived with n_estimators=37 and max_depth=15


As can be seen, the best accuracy 0.5822784810126581 is achieved with n_estimators=37 and max_depth=15.

Finally, we will train the Logistic Regression will liblinear solver and will investigate the quality of the model by calculating f1 score on the validation data set.

In [17]:
#training the model
model = LogisticRegression(random_state=12345, solver='liblinear') 
model.fit(features_train, target_train) 

#calculating f1 score of the model on the validation data set 
predictions=model.predict(features_valid)
f1 = f1_score(target_valid, predictions)

#printing f1 score
f1

0.30131826741996237

Based on the above calculations, our best model is Random Forest with 37 estimators and max_depth=15, and the model with the least accuracy is Logistic Regression.

In the next step, we will check the quality of our best model using the test set. We will calculate f1 score and AUC-ROC metrics.

In [18]:
#training the model
model = RandomForestClassifier(random_state=12345, n_estimators=37,max_depth=15) 
model.fit(features_train, target_train) 

#calculating f1 score and AUC-ROC of the model on the test data set    
predictions=model.predict(features_test)
f1 = f1_score(target_test, predictions)
print(f'f1:{f1}')
probabilities_test = model.predict_proba(features_test)
auc_roc=roc_auc_score(target_test, probabilities_test[:, 1])
print(f'auc_roc: {auc_roc}')

f1:0.550595238095238
auc_roc: 0.8511979823455232


As can be seen, the f1 score is lower than the threshold f1 score (0.59). Next, we will try to improve the quality of the model by fixing class imbalance.

# Improving the quality of the model

First, we will adjust the class weight by using the class_weight argument. We will use only our best model - Random Forest with different numbers of estimators and with different depths.

In [19]:
#declaring variables for storing best max_depth, best_max_depth and best f1 score

best_n_estimators=0
best_f1=0
best_max_depth=0

#training models with different maximum depths and n_estimators
for i in range(2,50):
    for j in range(2,20):
        model = RandomForestClassifier(random_state=12345, n_estimators=i,max_depth=j, class_weight='balanced') 
        model.fit(features_train, target_train) 
#calculating f1 score of the model on the validation data set    
        predictions=model.predict(features_valid)
        f1 = f1_score(target_valid, predictions)
#if f1 score of the model is greater than previus best accuracy, updating best n_estimators and best f1 variables
        if f1>best_f1:
            best_n_estimators=i
            best_max_depth=j
            best_f1=f1

#printing best f1 score
print(f'Best f1 score {best_f1} is achived with n_estimators={best_n_estimators} and max_depth={best_max_depth}')

Best f1 score 0.6038216560509554 is achived with n_estimators=37 and max_depth=10


As can be seen, the best accuracy 0.6038216560509554 is achived with n_estimators=37 and max_depth=10. It is above the threshold, but let's see if we can improve this further by using the upsampling technique. As we saw, there are 4 times more customers who stayed with the bank than those who left. We will build the function to upsample the imbalanced class.

In [20]:
#defining the function to upsample imbalanced class.
def upsample(features, target, repeat):
    
#splitting feature and target into subset with different classes
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

#upsampling imbalanced class and concatinating the classes back together
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

#shuffling the sets
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

#returning the new balanced target and feature sets
    return features_upsampled, target_upsampled

#using the function to upsample the training sets
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

Next, we will train our best model - Random Forest on new balanced training sets with different numbers of estimators and with different depths.

In [21]:
#declaring variables for storing best max_depth, best_max_depth and best f1 score

best_n_estimators=0
best_f1=0
best_max_depth=0

#training models with different maximum depths and n_estimators
for i in range(2,50):
    for j in range(2,20):
        model = RandomForestClassifier(random_state=12345, n_estimators=i,max_depth=j, class_weight='balanced') 
        model.fit(features_upsampled, target_upsampled) 
#calculating f1 score of the model on the validation data set    
        predictions=model.predict(features_valid)
        f1 = f1_score(target_valid, predictions)
#if f1 score of the model is greater than previus best accuracy, updating best n_estimators and best f1 variables
        if f1>best_f1:
            best_n_estimators=i
            best_max_depth=j
            best_f1=f1

#printing best f1 score
print(f'Best f1 score {best_f1} is achived with n_estimators={best_n_estimators} and max_depth={best_max_depth}')

Best f1 score 0.6027027027027029 is achived with n_estimators=35 and max_depth=17


As can be seen, the best accuracy 0.6027027027027029 is achived with n_estimators=35 and max_depth=17. And this is slightly lower than it was before.

# Final testing

Finally, we will check the quality of our best model using the test set. We will calculate f1 score and AUC-ROC metrics.

In [22]:
model = RandomForestClassifier(random_state=12345, n_estimators=37,max_depth=10, class_weight='balanced') 
model.fit(features_train, target_train) 
#calculating f1 score of the model on the validation data set    
predictions=model.predict(features_test)
f1 = f1_score(target_test, predictions)
print(f'f1:{f1}')
probabilities_test = model.predict_proba(features_test)
auc_roc=roc_auc_score(target_test, probabilities_test[:, 1])
print(f'auc_roc: {auc_roc}')

f1:0.6339712918660286
auc_roc: 0.8587135666122254


The accuracy of the model on the test set is even higher than on the validation set, and it is much higher than the threshold. At the same time, we see that auc_roc metrics improves in comparison with the model built before balancing the classes.

# General conclusion

In this project, we used machine learning algorithms to develop a model that analyzed the data on clients’ past behavior and termination of contracts with the bank. The model was made to predict whether a customer will leave the bank soon.

First of all, we observed the data, dropped the unnecessary columns, and addressed missing values. Next, we will split the data frame into the target and features subsets and prepare the features. We used One-Hot Encoding to transform categorical features into numerical features. To standardize numerical features, we scaled them.

We found that the classes were imbalanced. Firstly, we trained the model without taking into account the imbalance. Using training and validation data sets, we investigated the quality of different models by changing hyperparameters. It appeared that in this case, the best model was Random Forest with 37 estimators and max_depth=15, and the model with the least accuracy is Logistic Regression. Next, we checked the quality of our best model using the test set. We calculated f1 score and AUC-ROC metrics. The f1 score (0.55) was lower than the threshold f1 score (0.59), so we tried to improve the quality of the model by fixing class imbalance. We used two different techniques to address class imbalance: class weight adjustments and upsampling. With class weight adjustments, we were able to achieve the best accuracy 0.6038216560509554 with n_estimators=37 and max_depth=10. And this was above the threshold.

Finally, we checked the quality of our best model using the test set. The accuracy of the model on the test set was even higher than on the validation set, and it was much higher than the threshold. At the same time, we saw that auc_roc metrics improved in comparison with the model built before balancing the classes.