<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-description" data-toc-modified-id="Project-description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project description</a></span></li><li><span><a href="#Data-description" data-toc-modified-id="Data-description-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data description</a></span></li><li><span><a href="#Data-pre-processing" data-toc-modified-id="Data-pre-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data pre-processing</a></span><ul class="toc-item"><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></li><li><span><a href="#Research-task" data-toc-modified-id="Research-task-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Research task</a></span><ul class="toc-item"><li><span><a href="#Random-forest" data-toc-modified-id="Random-forest-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Random forest</a></span></li><li><span><a href="#Decision-tree" data-toc-modified-id="Decision-tree-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Decision tree</a></span></li><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Logistic regression</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></li><li><span><a href="#Dealing-with-imbalance" data-toc-modified-id="Dealing-with-imbalance-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Dealing with imbalance</a></span><ul class="toc-item"><li><span><a href="#Upsampling" data-toc-modified-id="Upsampling-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Upsampling</a></span><ul class="toc-item"><li><span><a href="#Random-forest" data-toc-modified-id="Random-forest-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Random forest</a></span></li><li><span><a href="#Decision-tree" data-toc-modified-id="Decision-tree-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Decision tree</a></span></li><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>Logistic regression</a></span></li></ul></li><li><span><a href="#Balancing-classes" data-toc-modified-id="Balancing-classes-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Balancing classes</a></span><ul class="toc-item"><li><span><a href="#Random-forest" data-toc-modified-id="Random-forest-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Random forest</a></span></li><li><span><a href="#Decision-tree" data-toc-modified-id="Decision-tree-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>Decision tree</a></span></li><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-5.2.3"><span class="toc-item-num">5.2.3&nbsp;&nbsp;</span>Logistic regression</a></span></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></li><li><span><a href="#Model-testing" data-toc-modified-id="Model-testing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model testing</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Bank churn prediction

## Project description

**Brief information:** the marketing department of "Beta-Bank" noticed that customers were leaving the bank. It was considered that it is cheaper to keep customers than to attract new ones.

**Objective:** develop a binary classification model that predicts whether or not a customer is about to leave the bank based on customer behavior.

**Tasks:** consider and investigate the quality of different binary classification models according to a higher F1 measure with a minimum threshold on a test set of 0.59.

## Data description

The dataset contains a table with the following columns
* `RowNumber` — row index
* `CustomerId` — unique client id
* `Surname` — client's surname
* `CreditScore` — customer's credit score
* `Geography` — country of residence
* `Gender` — client's gender
* `Age` — customer's age
* `Tenure` — how many years a person has been a bank customer
* `Balance` — client's balance
* `NumOfProducts` — number of banking products in use
* `HasCrCard` — credit card holder status
* `IsActiveMember` — client activity
* `EstimatedSalary` — customer's estimated salary
* `Exited` — whether the customer has left the bank or not

## Data pre-processing

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split 
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.utils import shuffle
from sklearn.metrics import f1_score, roc_curve, roc_auc_score

In [None]:
data = pd.read_csv('https://code.s3.yandex.net/datasets/Churn.csv')

In [None]:
data.info()

We will remove the columns `RowNumber`, `CustomerId`, `Surname` from the dataset for further analysis. Customer IDs are always unique and have no connection to the target. Also, the customer's row number and last name are not related to the probability of the customer leaving the bank.

In [None]:
data.drop(["RowNumber", "CustomerId", "Surname"], axis = 1, inplace=True)

In [None]:
data.head()

In [None]:
data.describe()

Note that there are missing values in the `Tenure` column. Before filling it with values, let's analyze the distribution of the column. The distribution is similar to uniform, so the optimal solution is to fill the missing values from 0 to 10, taking into account their probabilities.

In [None]:
#tenure column histogram
data['Tenure'].hist(bins=10, figsize=(10,6))
plt.title('Tenure histogram')
plt.xlabel('Number of years')
plt.ylabel('Number of observations');

In [None]:
#list with probabilities of every year
tenure_probs = data['Tenure'].value_counts(normalize=True).sort_index().to_list()

In [None]:
#setting random seed
np.random.seed(42)
#filling missing values from 0 to 10 with the corresponding probability from declared list
data['Tenure'] = data['Tenure'].fillna(pd.Series(np.random.choice(11, size=data.shape[0], p=tenure_probs)))

In [None]:
#check
data.info()

In [None]:
#plot check
data['Tenure'].hist(bins=10, figsize=(10,6))
plt.title('Tenure histogram')
plt.xlabel('Number of years')
plt.ylabel('Number of observations');

Thus, the missing values in the `Tenure` column are filled in, and the histogram is similar to the original distribution. Therefore, we believe that they are correctly filled.

To further develop the machine learning models, the columns containing text values are changed to categorical: `geography` and `gender`.

### Conclusions

* **Given the uniform distribution in the `Tenure` column, missing values were filled with values from 0 to 10, taking into account their probability;**

* **The columns `RowNumber`, `CustomerId` and `Surname` have been deleted because the IDs will always be unique for customers and they have no correlation with the target values. Also, the row number and the last name of the customer are not related to the probability of the customer leaving the bank. It is worth mentioning that using these features to build a model will lead to overfitting;**

* **The type of columns with text values should be changed to categorical (`Gender` and `Geography`).**

## Research task

Determining a bank's customer churn is a binary classification task.

In this case, the target characteristic is the column `Exited` (if the customer has left the bank - 1, if not - 0). All other columns containing customer socio-demographic characteristics (country, age, gender) and consumer characteristics (credit score, number of years as a customer, account balance, number of bank products, having a credit card, and customer activity) are features.

Before building models, we'll examine the class balance in our target and split the dataset into training, validation, and test sets (the ratio is 3:1:1). We'll use decision tree, random forest, and logistic regression models.

In [None]:
#features and target variables
target = data['Exited']
features = data.drop('Exited', axis=1)

In [None]:
#class balance
target.value_counts().plot(kind='bar', figsize=(10,4))
plt.title('Class distribution')
plt.xlabel('Class')
plt.ylabel('Number of observations');

Given the non-uniform distribution (class 0 exceeds class 1 by ~4 times), we'll use a stratified dataset split to maintain this ratio between sets.

In [None]:
#split dataset
features_train_valid, features_test, target_train_valid, target_test = train_test_split(
    features, target, test_size=0.2, random_state=42, stratify=target)

It is important to correctly predict the potential customer churn (since it is cheaper to keep current customers than to attract new ones), we will evaluate the quality of the model using the F1 metric, since the classes are unbalanced. In summary, we'll consider the prediction accuracy of class 1.

### Random forest

In [None]:
%%time
#text columns encoder for rf and dt with other columns passing through as they are
one_hot_encoder = make_column_transformer(
    (OneHotEncoder(),
    make_column_selector(dtype_include=np.object)
    ),
    remainder="passthrough"
    )
#pipe for rf with encoding and model
rf_pipe = Pipeline(steps=[('preprocessor', one_hot_encoder), ('rf', RandomForestClassifier(random_state=42))])

#parameters grid for rf
rf_grid = {
  "rf__max_depth": range(2, 20, 2),
  "rf__n_estimators": range(10, 350, 25),
  "rf__max_features": range(2, 14)
}
#gridsearching
gcv_rf = GridSearchCV(estimator=rf_pipe, param_grid=rf_grid, scoring='f1', verbose=1, n_jobs=-1)
gcv_rf.fit(features_train_valid, target_train_valid)

In [None]:
#df with results for rf
results_rf = pd.DataFrame(gcv_rf.cv_results_)

In [None]:
#model quality versus parameters plot
fig = px.scatter_3d(results_rf, x='param_rf__max_depth', y='param_rf__max_features', z='param_rf__n_estimators',
                    color='mean_test_score', color_continuous_scale=px.colors.sequential.Magma)
fig.show()

In [None]:
print(f'Best params for rf without class balancing: {gcv_rf.best_params_}')
print(f'Best F1 metrics for rf without class balancing: {gcv_rf.best_score_}')

### Decision tree

In [None]:
%%time
#text columns encoder for dt with other columns passing through as they are
dt_pipe = Pipeline(steps=[('preprocessor', one_hot_encoder), ('dt', DecisionTreeClassifier(random_state=42))])

#parameters grid for dt
dt_grid = {
  "dt__max_depth": range(2, 10),
  "dt__min_samples_leaf": range(2, 11),
  "dt__min_samples_split": range(2, 11)
  }

#gridsearching
gcv_dt = GridSearchCV(estimator=dt_pipe, param_grid=dt_grid, scoring='f1', verbose=1, n_jobs=-1)
gcv_dt.fit(features_train_valid, target_train_valid)

In [None]:
#df with results for dt
results_dt = pd.DataFrame(gcv_dt.cv_results_)

In [None]:
#model quality versus parameters plot
fig = px.scatter_3d(results_dt, x='param_dt__max_depth', y='param_dt__min_samples_leaf', z='param_dt__min_samples_split',
                    color='mean_test_score', color_continuous_scale=px.colors.sequential.Magma)
fig.show()

In [None]:
print(f'Best params for dt without class balancing: {gcv_dt.best_params_}')
print(f'Best F1 metrics for dt without class balancing: {gcv_dt.best_score_}')

### Logistic regression

For logistic regression, we'll also standardize the numerical values in the pipeline.

In [None]:
%%time
#text columns encoder for lr with scaling
one_hot_scaler = make_column_transformer(
        (StandardScaler(),
        make_column_selector(dtype_include=np.number)
        ),
        (
          OneHotEncoder(),
          make_column_selector(dtype_include=np.object)
          )
          )

#pipe for lr with encoding, scaling and model
log_pipe = Pipeline(steps=[('preprocessor', one_hot_scaler), ('log', LogisticRegression(random_state=42, max_iter=1000))])

#parameters grid for lr
log_grid = {
  "log__solver": ['liblinear', 'lbfgs'],
  }

#gridsearching
gcv_log = GridSearchCV(estimator=log_pipe, param_grid=log_grid, scoring='f1', verbose=1, n_jobs=-1)
gcv_log.fit(features_train_valid, target_train_valid)

In [None]:
print(f'Best params for lr without class balancing: {gcv_log.best_params_}')
print(f'Best F1 metrics for lr without class balancing: {gcv_log.best_score_}')

### Conclusions

* **The worst F1 score among the models without class balancing was demonstrated by logistic regression (F1 score is ~0.32);**
* **The best quality was demonstrated by a random forest model with the following parameters: tree depth - 12, the maximum number of features - 11, number of estimators - 60. The overall score is slightly less than 0.59;**
* **There is a need to correct the class imbalance. We'll try different techniques such as upsampling and internal methods of balancing for each model with hyperparameter tuning.**

## Dealing with imbalance

### Upsampling

In [None]:
#upsampling function
def upsample(features, target, repeat):
    features_zeros = features_train_valid[target == 0]
    features_ones = features_train_valid[target == 1]
    target_zeros = target_train_valid[target == 0]
    target_ones = target_train_valid[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=42)
    
    return features_upsampled, target_upsampled

#how much do we need to increase the smaller class
repeat = round(target.value_counts()[0]/target.value_counts()[1])
#calling function
features_upsampled, target_upsampled = upsample(features_train_valid, target_train_valid, repeat)

In [None]:
#check for imbalance
target_upsampled.value_counts().plot(kind='bar', figsize=(10,4))
plt.title('Class distribution')
plt.xlabel('Class')
plt.ylabel('Number of observations');

#### Random forest

In [None]:
%%time
#gridsearching upsampled set for rf with pipe and without internal balancing
gcv_rf_upsampled = GridSearchCV(estimator=rf_pipe, param_grid=rf_grid, scoring='f1', verbose=1, n_jobs=-1)
gcv_rf_upsampled.fit(features_upsampled, target_upsampled)

In [None]:
print(f'Best params for rf with upsampling: {gcv_rf_upsampled.best_params_}')
print(f'Best F1 metrics for rf with upsampling: {gcv_rf_upsampled.best_score_}')

#### Decision tree

In [None]:
%%time
#gridsearching upsampled set for dt with pipe and without internal balancing
gcv_dt_upsampled = GridSearchCV(estimator=dt_pipe, param_grid=dt_grid, scoring='f1', verbose=1, n_jobs=-1)
gcv_dt_upsampled.fit(features_upsampled, target_upsampled)

In [None]:
print(f'Best params for dt with upsampling: {gcv_dt_upsampled.best_params_}')
print(f'Best F1 metrics for dt with upsampling: {gcv_dt_upsampled.best_score_}')

#### Logistic regression

In [None]:
%%time
#gridsearching upsampled set for lr with pipe and without internal balancing
gcv_log_upsampled = GridSearchCV(estimator=log_pipe, param_grid=log_grid, scoring='f1', verbose=1, n_jobs=-1)
gcv_log_upsampled.fit(features_upsampled, target_upsampled)

In [None]:
print(f'Best params for lr with upsampling: {gcv_log_upsampled.best_params_}')
print(f'Best F1 metrics for lr with upsampling: {gcv_log_upsampled.best_score_}')

### Balancing classes

Let's use the built-in methods in each model for class balancing without upsampling.

#### Random forest

In [None]:
%%time
#grid for rf
rf_grid_balanced = {
  "rf__max_depth": range(2, 20, 2),
  "rf__n_estimators": range(10, 350, 25),
  "rf__max_features": range(2, 14),
  "rf__class_weight": ['balanced']
}
#gridsearching
gcv_rf_balanced = GridSearchCV(estimator=rf_pipe, param_grid=rf_grid_balanced, scoring='f1', verbose=1, n_jobs=-1)
gcv_rf_balanced.fit(features_train_valid, target_train_valid)

In [None]:
print(f'Best params for rf with balanced classes: {gcv_rf_balanced.best_params_}')
print(f'Best F1 metrics for rf with balanced classes: {gcv_rf_balanced.best_score_}')

#### Decision tree

In [None]:
%%time
#grid for dt
dt_grid_balanced = {
  "dt__max_depth": range(2, 10),
  "dt__min_samples_leaf": range(2, 11),
  "dt__min_samples_split": range(2, 11),
  "dt__class_weight": ['balanced']
  }

#gridsearching
gcv_dt_balanced = GridSearchCV(estimator=dt_pipe, param_grid=dt_grid_balanced, scoring='f1', verbose=1, n_jobs=-1)
gcv_dt_balanced.fit(features_train_valid, target_train_valid)

In [None]:
print(f'Best params for dt with balanced classes: {gcv_dt_balanced.best_params_}')
print(f'Best F1 metrics for dt with balanced classes: {gcv_dt_balanced.best_score_}')

#### Logistic regression

In [None]:
#grid for lr
log_grid_balanced = {
  "log__solver": ['liblinear', 'lbfgs'],
  "log__class_weight": ['balanced']
  }

#gridsearching
gcv_log_balanced = GridSearchCV(estimator=log_pipe, param_grid=log_grid_balanced, scoring='f1', verbose=1, n_jobs=-1)
gcv_log_balanced.fit(features_train_valid, target_train_valid)

In [None]:
print(f'Best params for lr with balanced classes: {gcv_log_balanced.best_params_}')
print(f'Best F1 metrics for lr with balanced classes: {gcv_log_balanced.best_score_}')

### Conclusions

* **The best model is Random Forest with upsampling of the smallest class. The F1 value is ~0.96. Random forest hyperparameters are as follows: depth - 18, maximum number of features - 2, number of estimators - 110.;**
* **Now let's check the F1 score and ROC-AUC for the built model on the test dataset.**

## Model testing

In [None]:
#model predictions for test dataset
predictions_test = gcv_rf_upsampled.predict(features_test)

In [None]:
print(f'F1 metrics for test dataset: {f1_score(target_test, predictions_test)}')

In [None]:
#ROC-AUC
probabilities_test = gcv_rf_upsampled.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

fpr, tpr, thresholds = roc_curve(target_test, probabilities_one_test)

plt.figure()

plt.plot(fpr, tpr)

# ROC-curve for random model
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC-curve')
plt.show()

In [None]:
auc_roc = roc_auc_score(target_test, probabilities_one_test)
print(f'ROC-AUC for test dataset: {auc_roc}')

## Summary

* **The highest F1 metric demonstrated by the random forest fitted on the upsampled set. The F1 metric is ~0.59. ROC-AUC is 0.84, which generally indicates that the built model predicts better than the random model. Random forest hyperparameters are as follows: tree depth - 18, maximum number of features - 2, number of estimators - 110;**
* **The applicability of the model for a bank is related to the fact that it is cheaper to keep existing customers than to attract new ones. Using the model, it is therefore possible to predict which customers are likely to leave the bank in the near future. This signals to the marketing department that this customer needs to be offered either a rate change or additional services that can retain the customer. However, this requires additional marketing analysis of customer preferences.** 