# Bank customer churn prediction

Beta Bank customers are leaving and the bankers figured it’s cheaper to save the existing customers rather than attract the new ones.

The goal is to predict whether a customer will leave the bank soon. The Bank provided us with the historical data on clients’ past behavior and termination of contracts with the bank.
It is necessary to build a model with the maximum possible F1 score. The bank's requirement for F1 score is at least 0.59 (59%).

Data source: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

# Data Description

**Feature variables**
<br>RowNumber - row index in data
<br>CustomerId — unique customer ID
<br>Surname - client's surname
<br>CreditScore - credit rating
<br>Geography - country of residence
<br>Gender - client's gender
<br>Age - client's age
<br>Tenure — how many years a person has been a bank client
<br>Balance — account balance
<br>NumOfProducts — number of bank products used by the client
<br>HasCrCard - the presence of a credit card
<br>IsActiveMember - cis the client active member
<br>EstimatedSalary - estimated salary

**Target variable**
<br>Exited - did the client leave the bank

# Studying general information

In [332]:
# importing libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve 

In [333]:
# opening the data file
df = pd.read_csv('Churn.csv')

In [334]:
# printing first 5 lines of data
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [335]:
# looking at the general information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [336]:
# finding the number of the missing values in 'Tenure'
df['Tenure'].isnull().sum()

909

In [337]:
# checking the numeric values
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RowNumber,10000.0,5000.5,2886.89568,1.0,2500.75,5000.5,7500.25,10000.0
CustomerId,10000.0,15690940.0,71936.186123,15565701.0,15628528.25,15690740.0,15753230.0,15815690.0
CreditScore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
Age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
Tenure,9091.0,4.99769,2.894723,0.0,2.0,5.0,7.0,10.0
Balance,10000.0,76485.89,62397.405202,0.0,0.0,97198.54,127644.2,250898.09
NumOfProducts,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
HasCrCard,10000.0,0.7055,0.45584,0.0,0.0,1.0,1.0,1.0
IsActiveMember,10000.0,0.5151,0.499797,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,10000.0,100090.2,57510.492818,11.58,51002.11,100193.9,149388.2,199992.48


In [338]:
# checking the quantitative values
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Surname,10000,2932,Smith,32
Geography,10000,3,France,5014
Gender,10000,2,Male,5457


In [339]:
# checking values for the columns with object data type
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Surname,10000,2932,Smith,32
Geography,10000,3,France,5014
Gender,10000,2,Male,5457


In [340]:
# checking for duplicates
print('Num. duplicates:', df.duplicated().sum())

Num. duplicates: 0


The target variable 'Exited' must be checked for the balance of classes. Most often the threshold for the class balance level is 25-30%. We set it as 30%, if the value found is bellow 30% - the data requires class balancing.

In [341]:
#checking the imbalance of classes.
print('Classes in target variables:', df['Exited'].value_counts())

print('Class balance level:',round((2037/7963)*100,2),'%') 

Classes in target variables: 0    7963
1    2037
Name: Exited, dtype: int64
Class balance level: 25.58 %


**Summary:** Data contains 15 columns and 10000 rows. Findings and next steps:

* Found 909 NaN values in 'Tenure' columns. It's only 9% from all the data. We will drop it.
* Change float type to int in 'Tenure' column (it is supposed to be numerical).
* We don't need 'RowNumber', 'CustomerId' and 'Surname' columns for building a prediction model. Drop them.
* Process data in 'Geography', 'Gender' with One-Hot Encoding (OHE).
* Data has the class imbalance in target variable. We will use either downsampling or upsanpling techniques to process it.

## Data preprocessing

In [342]:
# dropping missing values
df.dropna(subset=['Tenure'], inplace=True) 
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True) # удаление ненужных столбцов

In [343]:
# changing datatype in 'Tenure'
df['Tenure'] = df['Tenure'].astype(int)

In [344]:
# transforming data with OHE
data = pd.get_dummies(df, columns=(['Geography', 'Gender']), drop_first=True)

**Summary:** 
* dropped the unnecessary columns and missing values
* changed data type in 'Tenure' to the appropriate data type('int')
* transformed data in columns with categorical data with OHE

# Feature preparation

In [345]:
# setting features and target variables
features = data.drop(['Exited'], axis=1)
target = data['Exited']

In [346]:
# checking for multicollinearity between the variables in 'features'
vif_data = pd.DataFrame()
vif_data["feature"] = features.columns
vif_data["VIF"] = [variance_inflation_factor(features.values, i) for i in range(len(features.columns))]
vif_data

Unnamed: 0,feature,VIF
0,CreditScore,21.208943
1,Age,12.213651
2,Tenure,3.851954
3,Balance,3.189289
4,NumOfProducts,7.827226
5,HasCrCard,3.287674
6,IsActiveMember,2.078944
7,EstimatedSalary,3.892154
8,Geography_Germany,1.792308
9,Geography_Spain,1.484706


No multicollinearity found.

In [347]:
# splitting the data in train, valid, and test sets of data
features_train, features_test, target_train, target_test = train_test_split(features, target,test_size=0.2,random_state=12345)
features_train,features_valid, target_train, target_valid = train_test_split(features_train,target_train,test_size = 0.25,random_state=12345)

In [348]:
# checking
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5454 entries, 3706 to 208
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        5454 non-null   int64  
 1   Age                5454 non-null   int64  
 2   Tenure             5454 non-null   int32  
 3   Balance            5454 non-null   float64
 4   NumOfProducts      5454 non-null   int64  
 5   HasCrCard          5454 non-null   int64  
 6   IsActiveMember     5454 non-null   int64  
 7   EstimatedSalary    5454 non-null   float64
 8   Geography_Germany  5454 non-null   uint8  
 9   Geography_Spain    5454 non-null   uint8  
 10  Gender_Male        5454 non-null   uint8  
dtypes: float64(2), int32(1), int64(5), uint8(3)
memory usage: 378.2 KB


**Summary:** 
* data was split into 3 sets: train, valid, and test. 
* data in features was additionally checked for multicollinearity (none found)

# Balancing classes

We will try both methods of upsampling and downsampling the classes and determine the most effective one with F1 score from logistic regression

## Upsampling

In [349]:
repeat = ((target_train == 0).sum() / (target_train == 1).sum()).round().astype(int)
repeat

4

In [350]:
# writing the function for upsampling
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [351]:
#chiecking
model = LogisticRegression(random_state=12345,solver='liblinear')
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.4493392070484581


## Downsampling

In [352]:
# writing the function for downsampling
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

In [353]:
model = LogisticRegression(random_state=12345,solver='liblinear')
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)


print("F1:", f1_score(target_valid, predicted_valid))
print(features_downsampled.info())

F1: 0.3584710743801653
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1553 entries, 5658 to 1410
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        1553 non-null   int64  
 1   Age                1553 non-null   int64  
 2   Tenure             1553 non-null   int32  
 3   Balance            1553 non-null   float64
 4   NumOfProducts      1553 non-null   int64  
 5   HasCrCard          1553 non-null   int64  
 6   IsActiveMember     1553 non-null   int64  
 7   EstimatedSalary    1553 non-null   float64
 8   Geography_Germany  1553 non-null   uint8  
 9   Geography_Spain    1553 non-null   uint8  
 10  Gender_Male        1553 non-null   uint8  
dtypes: float64(2), int32(1), int64(5), uint8(3)
memory usage: 107.7 KB
None


**Summary** Logistic regression for upsampling showed the result of F1_score equalling to 0.44, while Downsampling showed only 0.35. We will use upsampling in the building of models later.

# Сhoosing the best model

We will build and compare the four following models:
* Decision Tree Classifier
* Random Forest Classifier
* Logistic Regression Classifier
* Gradient Boosting Classifier

## Decision Tree Classifier

In [354]:
df_dtc = pd.DataFrame() # create dataframe 
for depth in range(1, 10):
    # defining the model
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)

    # fitting the model
    model.fit(features_upsampled, target_upsampled)

    # predicting values
    predicted_train = model.predict(features_upsampled)
    predicted_valid = model.predict(features_valid)

    # calculating f1_score
    f1_train = f1_score(target_upsampled, predicted_train)
    f1_valid = f1_score(target_valid, predicted_valid)

    # predicting probabilities
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]

    # calculating auc_roc score
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid) 

    # creating the temporary dataframe with metrics
    df_decissiontree_temp = pd.DataFrame({'max_depth': [depth],'f1_train': [f1_train],'f1_valid': [f1_valid],'AUC_ROC': [auc_roc]})

    # appending the new values to df_dtc  
    df_dtc = df_dtc.append(df_decissiontree_temp)


dtc = df_dtc.reset_index(drop=True)
round(dtc.sort_values(by='f1_valid', ascending=False).head(),2)

Unnamed: 0,max_depth,f1_train,f1_valid,AUC_ROC
6,7,0.82,0.56,0.82
5,6,0.79,0.55,0.81
8,9,0.86,0.54,0.78
7,8,0.84,0.54,0.8
4,5,0.75,0.54,0.81


**Summary:** The best result with max_depth=7, f1_valid_up=0.56, and AUC ROC = 0.82, which is a good result

## Random Forest Classifier

In [355]:
df_rfc = pd.DataFrame() # create dataframe 
for est in range(1, 50):
    # defining the model
    model = RandomForestClassifier(max_depth=5, n_estimators=est, random_state=12345)

    # fitting the model
    model.fit(features_upsampled, target_upsampled)

    # predicting values
    predicted_train = model.predict(features_upsampled)
    predicted_valid = model.predict(features_valid)

    # calculating f1_score
    f1_train = f1_score(target_upsampled, predicted_train)
    f1_valid = f1_score(target_valid, predicted_valid)

    # predicting probabilities
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]

    # calculating auc_roc score
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid) 

    # creating the temporary dataframe with metrics
    df_rf_temp = pd.DataFrame({'est': [est],'f1_train': [f1_train],'f1_valid': [f1_valid],'AUC_ROC': [auc_roc]})

    # appending the new values to df_dtc  
    df_rfc = df_rfc.append(df_rf_temp)


rfc = df_rfc.reset_index(drop=True)
round(rfc.sort_values(by='f1_valid', ascending=False).head(),2)

Unnamed: 0,est,f1_train,f1_valid,AUC_ROC
9,10,0.78,0.6,0.85
7,8,0.76,0.59,0.84
13,14,0.77,0.59,0.85
28,29,0.78,0.59,0.85
11,12,0.78,0.59,0.85


**Summary:** The best result with n_estimators=10. We have f1_valid_up=0.60, and AUC ROC = 0.85. It's higher than for DecisionTreeClassifier.

## Logistic Regression

In [356]:
log_reg = pd.DataFrame() # create dataframe 

model = LogisticRegression(solver='liblinear', class_weight='balanced', random_state=12345)

# fitting the model
model.fit(features_upsampled, target_upsampled)

# predicting values
predicted_train = model.predict(features_upsampled)
predicted_valid = model.predict(features_valid)

# calculating f1_score
f1_train = f1_score(target_upsampled, predicted_train)
f1_valid = f1_score(target_valid, predicted_valid)

# predicting probabilities
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

# calculating auc_roc score
auc_roc = roc_auc_score(target_valid, probabilities_one_valid) 

# creating the temporary dataframe with metrics
df_lr_temp = pd.DataFrame({'f1_train': [f1_train],'f1_valid': [f1_valid],'AUC_ROC': [auc_roc]})

# appending the new values to df_dtc  
log_reg = log_reg.append(df_lr_temp)


lr_reg = log_reg.reset_index(drop=True)
round(lr_reg.sort_values(by='f1_valid', ascending=False).head(),2)

Unnamed: 0,f1_train,f1_valid,AUC_ROC
0,0.67,0.45,0.71


**Summary** Logistic regression showed the worst result with f1 score = 0.45 on valid set, and AUC ROC = 0.71

## Gradient Boosting Classifier

In [358]:
df_gb = pd.DataFrame() # create dataframe 
for est in range(1, 50):
    # defining the model
    model = GradientBoostingClassifier(max_depth=9, n_estimators=est, random_state=12345)

    # fitting the model
    model.fit(features_upsampled, target_upsampled)

    # predicting values
    predicted_train = model.predict(features_upsampled)
    predicted_valid = model.predict(features_valid)

    # calculating f1_score
    f1_train = f1_score(target_upsampled, predicted_train)
    f1_valid = f1_score(target_valid, predicted_valid)

    # predicting probabilities
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]

    # calculating auc_roc score
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid) 

    # creating the temporary dataframe with metrics
    df_gb_temp = pd.DataFrame({'est': [est],'f1_train': [f1_train],'f1_valid': [f1_valid],'AUC_ROC': [auc_roc]})

    # appending the new values to df_dtc  
    df_gb = df_gb.append(df_gb_temp)


gb = df_gb.reset_index(drop=True)
round(gb.sort_values(by='f1_valid', ascending=False).head(),2)

Unnamed: 0,est,f1_train,f1_valid,AUC_ROC
35,36,0.98,0.6,0.83
38,39,0.99,0.6,0.83
40,41,0.99,0.6,0.83
39,40,0.99,0.6,0.83
41,42,0.99,0.6,0.83


**Summary** Gradient boosting has result with f1 = 0.60 and auc roc 0.85

We analyzed the 4 models and compared their f1 score and AUC ROC score on the valid sets

|model|f1_valid|AUC ROC|
|---|---|---|
|DecisionTree|0.56|0.82|
|RandomForest|0.60|0.86|
|LogisticRegression|0.45|0.71|
|GradientBoosting|0.60|0.83|

Thus 'Random Forest Classifier' with max_depth=5 and n_estimators=10 is the best model.

# Model testing

In [360]:
model = RandomForestClassifier(random_state=12345, max_depth=5, n_estimators=10)
model.fit(features_upsampled, target_upsampled) # fit the model
predicted_test = model.predict(features_test)

# predict probabilities
probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

# calculate auc_roc
auc_roc = roc_auc_score(target_test, probabilities_one_test) 

f1_test = f1_score(target_test, predicted_test) # count the test score
print('f1_score:', f1_test)
print('auc_roc:', auc_roc)

f1_score: 0.6078212290502794
auc_roc: 0.8510784038874871


**Summary:** These are good f1  and auc_roc score. The model is not overfitting.


# Overall conclusion

1. Opened and analyzed the data; checked for duplicates and imbalance of classes in the target var.
2. Balanced out the classes in the target var, deleted the unnecessary columns and missing values
3. Analyzed a total of 4 models "Decision Tree", "Random Forest", "Gradient Boosting", and "Logistic Regression". The best result of the f1 score (0.60), AUC-ROC (0.85) was shown by the Random Forest model.
5. The f1 score of the measure is significantly lower than the auc_roc value.

According to the task, it was necessary to build a classification model with an the best possible value of the F1-score, with the threshold being equal to 0.59.
'Random Forest Classifier' with max_depth=5 and n_estimators=10 is the best model, with f1 score equal to 0.50, and AUC ROC equal to 0.85.