# Customer churn

Clients began to leave Beta Bank. Every month. A little, but noticeable. Bank marketers have calculated that it is cheaper to retain current customers than to attract new ones.

It is necessary to predict whether the client will leave the bank in the near future or not. You are provided with historical data on customer behavior and termination of contracts with the bank.

Build a model with an extremely large *F1*-measure. To pass the project successfully, you need to bring the metric to 0.59. Check the *F1*-measure on the test sample yourself.

Additionally, measure *AUC-ROC*, compare its value with the *F1* measure.

Data source: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

## Data preparation

In [1]:
import pandas as pd
from sklearn.metrics import f1_score, confusion_matrix, roc_auc_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
data=pd.read_csv('datasets/Churn.csv')
print(data.head())
print(data.shape)
print(data.isna().sum())

   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0     2.0       0.00              1          1               1   
1     1.0   83807.86              1          0               1   
2     8.0  159660.80              3          1               0   
3     1.0       0.00              2          0               0   
4     2.0  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       0  
4         790

In [2]:
#remove columns that have no predictive value
data = data.drop(['RowNumber', 'CustomerId', 'Surname'],axis=1)
# fill the gaps in Tenure with the median value 
data['Tenure']=data['Tenure'].fillna(data['Tenure'].median())
print(data.isna().sum())

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64


In [3]:
# use direct encoding
data_ohe = pd.get_dummies(data, drop_first=True)
target = data_ohe['Exited']
features = data_ohe.drop('Exited', axis=1)
#Allocate 20% of the data to the test sample
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)
#From the remaining data, we will allocate 25% of the data to the validation sample
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345)
#check the result
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)
print(target_train.shape)
print(target_valid.shape)
print(target_test.shape)
print(data_ohe.info())


(6000, 11)
(2000, 11)
(2000, 11)
(6000,)
(2000,)
(2000,)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Age                10000 non-null  int64  
 2   Tenure             10000 non-null  float64
 3   Balance            10000 non-null  float64
 4   NumOfProducts      10000 non-null  int64  
 5   HasCrCard          10000 non-null  int64  
 6   IsActiveMember     10000 non-null  int64  
 7   EstimatedSalary    10000 non-null  float64
 8   Exited             10000 non-null  int64  
 9   Geography_Germany  10000 non-null  uint8  
 10  Geography_Spain    10000 non-null  uint8  
 11  Gender_Male        10000 non-null  uint8  
dtypes: float64(3), int64(6), uint8(3)
memory usage: 732.6 KB
None


In [4]:
#scale the features
scaler = StandardScaler() 
numeric= ['CreditScore', 'Age','Tenure','Balance','NumOfProducts','EstimatedSalary']
pd.options.mode.chained_assignment = None
scaler.fit(features_train[numeric]) 
features_train[numeric] = scaler.transform(features_train[numeric]) 
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])


## Problem research

In [5]:
# Let's try to train a regression model
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(confusion_matrix(target_valid, predicted_valid)) 
print("F1:", f1_score(target_valid, predicted_valid))
print("AUC_ROC:",auc_roc)

[[1549   60]
 [ 311   80]]
F1: 0.30131826741996237
AUC_ROC: 0.7703391568208876


In [6]:
# Let's try to train a random forest model
best_score = 0
best_depth = 0
best_est = 0
for depth in range(1, 16, 1):
    for est in range(21, 101, 10):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345)
        model.fit(features_train, target_train)
        predicted_valid=model.predict(features_valid)
        score=f1_score(target_valid, predicted_valid) 
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est
print(best_score)
print(best_depth)
print(best_est)

0.5772870662460569
15
41


The quality of both models is not high enough,
Let's explore class balance:

In [7]:
target_train[target_train==1].count()

1219

In [8]:
target_train[target_train==0].count()

4781

Conclusion: there is an imbalance: there are almost four times fewer positive ones than negative ones

## Fighting imbalance

In [9]:
# train a regression model taking into account class imbalance
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print("F1:", f1_score(target_valid, predicted_valid))
print("AUC_ROC:",auc_roc)

F1: 0.4741532976827095
AUC_ROC: 0.7725660805030526


In [10]:
#let's try to train a random forest model with different hyperparameters and taking into account class balance
best_score = 0
best_depth = 0
best_est = 0
for depth in range(1, 16, 1):
    for est in range(21, 101, 10):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345, class_weight='balanced')
        model.fit(features_train, target_train)
        predicted_valid=model.predict(features_valid)
        score=f1_score(target_valid, predicted_valid) 
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est
print(best_score)
print(best_depth)
print(best_est)

0.5963060686015831
11
21


In [11]:
#let's check whether increasing the sample will improve the quality of the model
#Function for increasing the size of the training set by duplicating positive classes
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [12]:
#Let's increase the size of the training sample by quadrupling the number of positive classes
#(this way we get a class balance of about 1:1)

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)
best_score = 0
best_depth = 0
best_est = 0
for depth in range(1, 16, 1):
    for est in range(21, 101, 10):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345, class_weight='balanced')
        model.fit(features_upsampled, target_upsampled)
        predicted_valid=model.predict(features_valid)
        score=f1_score(target_valid, predicted_valid) 
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est
print(best_score)
print(best_depth)
print(best_est)


0.5993031358885018
11
91


Conclusion: increasing the size of the training sample due to duplication of positive classes very little improves the quality of the model built taking into account class imbalance

## Model testing

In [13]:
# check the model's performance on a test sample
model = RandomForestClassifier(n_estimators=80, max_depth=12, random_state=12345, class_weight='balanced')
model.fit(features_upsampled, target_upsampled)
predicted_test=model.predict(features_test)
probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
auc_roc=roc_auc_score(target_test, probabilities_one_test)
print("F1:", f1_score(target_test, predicted_test))
print("Recall:", recall_score(target_test, predicted_test))
print("AUC_ROC:",auc_roc)

F1: 0.6231386025200458
Recall: 0.6370023419203747
AUC_ROC: 0.8590202643853913


In [14]:
dummy_model = DummyClassifier(strategy='constant', constant=1)
dummy_model.fit(features_train, target_train)
print("F1:", f1_score(dummy_model.predict(features_test), target_test))
print("Recall:", recall_score(dummy_model.predict(features_test), target_test))

F1: 0.3518747424804285
Recall: 0.2135


Conclusion: a model was obtained with an F1 value on the test sample of more than 0.62 (which is significantly higher than that for the constant model) and an AUC-ROC of 0.8599, which allows it to be used for predicting customer churn. The model correctly predicts 63.4% of customers for churn (compared to 21% of the constant model)