# Прогноз оттока клиентов

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.

Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком.

Необходимо построить модель с предельно большим значением *F1*-меры. Нужно довести метрику минимум до 0.59.

Дополнительно нужно измерять *AUC-ROC* и сравнивать её значение с *F1*-мерой.

Источник данных: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

## Подготовка данных

In [None]:
!pip install scikit-learn==1.1.3

In [None]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import f1_score, roc_auc_score, recall_score
from sklearn.utils import shuffle

In [None]:
df = pd.read_csv('/datasets/Churn.csv')

In [None]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [None]:
df.duplicated().sum()

0

In [None]:
pat = re.compile('(?<!^)(?=[A-Z])')
df = df.rename(lambda s: pat.sub('_', s).lower(), axis=1)
df.columns

Index(['row_number', 'customer_id', 'surname', 'credit_score', 'geography',
       'gender', 'age', 'tenure', 'balance', 'num_of_products', 'has_cr_card',
       'is_active_member', 'estimated_salary', 'exited'],
      dtype='object')

Удаляем лишние данные, они не помогут модели выявить закономерности.

In [None]:
df = df.drop(['row_number', 'customer_id', 'surname'], axis=1)

Удаляем пропуски

In [None]:
df['tenure'].isna().mean()

0.0909

In [None]:
df = df.dropna(subset=['tenure']).reset_index(drop=True)

Отделяем целевой признак.

In [None]:
features = df.drop(['exited'], axis=1)
target = df['exited']

Разделяем данные на обучающую, валидационную и тестовую выборки.

In [None]:
seed = 0

In [None]:
features_train, _features, target_train, _target = train_test_split(
    features, target, test_size=0.4, random_state=seed)
features_valid, features_test, target_valid, target_test = train_test_split(
    _features, _target, test_size=0.5, random_state=seed)

In [None]:
print(features_train.shape, target_train.shape)
print(features_valid.shape, target_valid.shape)
print(features_test.shape, target_test.shape)

(5454, 10) (5454,)
(1818, 10) (1818,)
(1819, 10) (1819,)


Преобразовываем категориальные признаки в численные методом One-Hot Encoding.

In [None]:
columns = ['gender', 'geography']

ohe = OneHotEncoder(drop='first', handle_unknown='ignore', sparse=False)
ohe.fit(features_train[columns])

new_columns = ohe.get_feature_names_out()

features_train[new_columns] = ohe.transform(features_train[columns])
features_train = features_train.drop(columns, axis=1)

features_valid[new_columns] = ohe.transform(features_valid[columns])
features_valid = features_valid.drop(columns, axis=1)

features_test[new_columns] = ohe.transform(features_test[columns])
features_test = features_test.drop(columns, axis=1)

In [None]:
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5454 entries, 8425 to 2732
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit_score       5454 non-null   int64  
 1   age                5454 non-null   int64  
 2   tenure             5454 non-null   float64
 3   balance            5454 non-null   float64
 4   num_of_products    5454 non-null   int64  
 5   has_cr_card        5454 non-null   int64  
 6   is_active_member   5454 non-null   int64  
 7   estimated_salary   5454 non-null   float64
 8   gender_Male        5454 non-null   float64
 9   geography_Germany  5454 non-null   float64
 10  geography_Spain    5454 non-null   float64
dtypes: float64(6), int64(5)
memory usage: 511.3 KB


Масштабируем признаки методом стандартизации.

In [None]:
numeric = ['credit_score', 'age', 'tenure', 'balance', 'num_of_products', 'estimated_salary']

scaler = StandardScaler()
scaler.fit(features_train[numeric])

pd.options.mode.chained_assignment = None
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

In [None]:
features_train.head()

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,gender_Male,geography_Germany,geography_Spain
8425,0.229308,0.568647,0.34361,-0.064517,-0.922773,0,1,0.684,0.0,0.0,0.0
2404,-0.533089,1.040674,-1.031839,0.62773,-0.922773,1,0,1.473778,1.0,0.0,0.0
3465,-0.564421,-1.508269,0.34361,-1.228064,0.804932,1,0,0.911338,0.0,0.0,1.0
5932,-0.31377,-1.130648,1.375197,1.0163,-0.922773,1,1,0.902799,1.0,0.0,1.0
8224,1.211024,0.757458,-0.000252,0.450016,-0.922773,1,0,-0.900346,0.0,1.0,0.0


По итогам предобработки:
- Дубликаты в данных отсутствуют.
- Все пропуски в данных были удалены.
- Удалены колонки с индексом строки, идентификатором и фамилией клиентов. Они не понадобятся при обучении.
- Данные разбиты на выборки.
- Признаки перекодированны и отмасштабированы, и теперь готовы для обучения моделей.

## Исследование задачи

In [None]:
pd.DataFrame({'count'   : target_train.value_counts(),
              'fraction': target_train.value_counts(normalize=True)})

Unnamed: 0,count,fraction
0,4350,0.79758
1,1104,0.20242


In [None]:
model = LogisticRegression(solver='liblinear', random_state=seed)
model.fit(features_train, target_train)
proba = model.predict_proba(features_valid)[:, 1]
pred = proba > 0.5
print('F1      :', f1_score(target_valid, pred))
print('AUC-ROC :', roc_auc_score(target_valid, proba))

F1      : 0.3008130081300813
AUC-ROC : 0.7737525632262474


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
for d in range(1, 21):
    model = DecisionTreeClassifier(max_depth=d, random_state=seed)
    model.fit(features_train, target_train)
    proba = model.predict_proba(features_valid)[:, 1]
    pred = proba > 0.5
    f1 = f1_score(target_valid, pred)
    if best_f1 < f1:
        best_f1 = f1
        best_auc = roc_auc_score(target_valid, proba)
        best_depth = d

print('Depth   :', best_depth)
print('F1      :', best_f1)
print('AUC-ROC :', best_auc)

Depth   : 7
F1      : 0.5758513931888545
AUC-ROC : 0.8200619987869803


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
best_est = None
for est in range(10, 51, 10):
    for d in range(1, 11):
        model = RandomForestClassifier(max_depth=d, n_estimators=est, random_state=seed)
        model.fit(features_train, target_train)
        proba = model.predict_proba(features_valid)[:, 1]
        pred = proba > 0.5
        f1 = f1_score(target_valid, pred)
        if best_f1 < f1:
            best_f1 = f1
            best_auc = roc_auc_score(target_valid, proba)
            best_depth = d
            best_est = est

print('Depth      :', best_depth)
print('Estimators :', best_est)
print('F1         :', best_f1)
print('AUC-ROC    :', best_auc)

Depth      : 10
Estimators : 10
F1         : 0.5728987993138935
AUC-ROC    : 0.8421524361479883


Классы несбалансированы, 0-й класс занимает 80% выборки. Результаты моделей древа решений и случайного леса в условиях дисбаланса намного лучше чем у логистической регрессии, но всё ещё недостаточно.

## Борьба с дисбалансом

### Взвешивание классов

In [None]:
model = LogisticRegression(solver='liblinear', random_state=seed, class_weight='balanced')
model.fit(features_train, target_train)
proba = model.predict_proba(features_valid)[:, 1]
pred = proba > 0.5
print('F1      :', f1_score(target_valid, pred))
print('AUC-ROC :', roc_auc_score(target_valid, proba))

F1      : 0.5059055118110236
AUC-ROC : 0.7766715123275539


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
for d in range(1, 21):
    model = DecisionTreeClassifier(max_depth=d, random_state=seed, class_weight='balanced')
    model.fit(features_train, target_train)
    proba = model.predict_proba(features_valid)[:, 1]
    pred = proba > 0.5
    f1 = f1_score(target_valid, pred)
    if best_f1 < f1:
        best_f1 = f1
        best_auc = roc_auc_score(target_valid, proba)
        best_depth = d

print('Depth   :', best_depth)
print('F1      :', best_f1)
print('AUC-ROC :', best_auc)

Depth   : 7
F1      : 0.5777262180974478
AUC-ROC : 0.8290489347568665


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
best_est = None
for est in range(10, 51, 10):
    for d in range(1, 11):
        model = RandomForestClassifier(max_depth=d, n_estimators=est, random_state=seed,
                                       class_weight='balanced')
        model.fit(features_train, target_train)
        proba = model.predict_proba(features_valid)[:, 1]
        pred = proba > 0.5
        f1 = f1_score(target_valid, pred)
        if best_f1 < f1:
            best_f1 = f1
            best_auc = roc_auc_score(target_valid, proba)
            best_depth = d
            best_est = est

print('Depth      :', best_depth)
print('Estimators :', best_est)
print('F1         :', best_f1)
print('AUC-ROC    :', best_auc)

Depth      : 7
Estimators : 20
F1         : 0.6113861386138614
AUC-ROC    : 0.863130938742503


### Увеличение выборки

In [None]:
major = target_train == 0
major_count = major.sum()

features_upsampled = pd.concat(
    [features_train[major],
     features_train[~major].sample(major_count, replace=True, random_state=seed)],
    ignore_index=True
)
target_upsampled = pd.concat(
    [target_train[major],
     target_train[~major].sample(major_count, replace=True, random_state=seed)],
    ignore_index=True
)
features_upsampled, target_upsampled = shuffle(
    features_upsampled, target_upsampled, random_state=seed)

In [None]:
target_upsampled.count(), (target_upsampled==1).sum(), (target_upsampled==0).sum()

(8700, 4350, 4350)

In [None]:
model = LogisticRegression(solver='liblinear', random_state=seed)
model.fit(features_upsampled, target_upsampled)
proba = model.predict_proba(features_valid)[:, 1]
pred = proba > 0.5
print('F1      :', f1_score(target_valid, pred))
print('AUC-ROC :', roc_auc_score(target_valid, proba))

F1      : 0.5
AUC-ROC : 0.7744206868002272


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
for d in range(1, 21):
    model = DecisionTreeClassifier(max_depth=d, random_state=seed)
    model.fit(features_upsampled, target_upsampled)
    proba = model.predict_proba(features_valid)[:, 1]
    pred = proba > 0.5
    f1 = f1_score(target_valid, pred)
    if best_f1 < f1:
        best_f1 = f1
        best_auc = roc_auc_score(target_valid, proba)
        best_depth = d

print('Depth   :', best_depth)
print('F1      :', best_f1)
print('AUC-ROC :', best_auc)

Depth   : 8
F1      : 0.5778834720570749
AUC-ROC : 0.8074254137263774


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
best_est = None
for est in range(10, 51, 10):
    for d in range(1, 11):
        model = RandomForestClassifier(max_depth=d, n_estimators=est, random_state=seed)
        model.fit(features_upsampled, target_upsampled)
        proba = model.predict_proba(features_valid)[:, 1]
        pred = proba > 0.5
        f1 = f1_score(target_valid, pred)
        if best_f1 < f1:
            best_f1 = f1
            best_auc = roc_auc_score(target_valid, proba)
            best_depth = d
            best_est = est

print('Depth      :', best_depth)
print('Estimators :', best_est)
print('F1         :', best_f1)
print('AUC-ROC    :', best_auc)

Depth      : 7
Estimators : 50
F1         : 0.6102502979737784
AUC-ROC    : 0.8631867761593485


### Уменьшение выборки

In [None]:
minor = target_train == 1
minor_count = minor.sum()

features_downsampled = pd.concat(
    [features_train[minor],
     features_train[~minor].sample(minor_count, random_state=seed)],
    ignore_index=True
)
target_downsampled = pd.concat(
    [target_train[minor],
     target_train[~minor].sample(minor_count, random_state=seed)],
    ignore_index=True
)
features_downsampled, target_downsampled = shuffle(
    features_downsampled, target_downsampled, random_state=seed)

In [None]:
target_downsampled.count(), (target_downsampled==1).sum(), (target_downsampled==0).sum()

(2208, 1104, 1104)

In [None]:
model = LogisticRegression(solver='liblinear', random_state=seed)
model.fit(features_downsampled, target_downsampled)
proba = model.predict_proba(features_valid)[:, 1]
pred = proba > 0.5
print('F1      :', f1_score(target_valid, pred))
print('AUC-ROC :', roc_auc_score(target_valid, proba))

F1      : 0.510934393638171
AUC-ROC : 0.7760091650380754


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
for d in range(1, 21):
    model = DecisionTreeClassifier(max_depth=d, random_state=seed)
    model.fit(features_downsampled, target_downsampled)
    proba = model.predict_proba(features_valid)[:, 1]
    pred = proba > 0.5
    f1 = f1_score(target_valid, pred)
    if best_f1 < f1:
        best_f1 = f1
        best_auc = roc_auc_score(target_valid, proba)
        best_depth = d

print('Depth   :', best_depth)
print('F1      :', best_f1)
print('AUC-ROC :', best_auc)

Depth   : 5
F1      : 0.586046511627907
AUC-ROC : 0.8403261675315047


In [None]:
best_f1 = 0
best_auc = 0
best_depth = None
best_est = None
for est in range(10, 51, 10):
    for d in range(1, 11):
        model = RandomForestClassifier(max_depth=d, n_estimators=est, random_state=seed)
        model.fit(features_downsampled, target_downsampled)
        proba = model.predict_proba(features_valid)[:, 1]
        pred = proba > 0.5
        f1 = f1_score(target_valid, pred)
        if best_f1 < f1:
            best_f1 = f1
            best_auc = roc_auc_score(target_valid, proba)
            best_depth = d
            best_est = est

print('Depth      :', best_depth)
print('Estimators :', best_est)
print('F1         :', best_f1)
print('AUC-ROC    :', best_auc)

Depth      : 8
Estimators : 40
F1         : 0.5940594059405941
AUC-ROC    : 0.8586485419695205


### Итоги

Самая высокая метрика F1 получилась у модели случайного леса с гиперпараметрами: `max_depth=7`, `n_estimators=20`, методом взвешивания классов.
AUC-ROC лучшей модели около 0.86, что значит с шансом 86% модель правильно определит класс.

## Тестирование модели

In [None]:
model = RandomForestClassifier(max_depth=7,
                               n_estimators=20,
                               random_state=seed,
                               class_weight='balanced')
model.fit(features_train, target_train)
proba = model.predict_proba(features_test)[:, 1]
pred = proba > 0.5
print('F1      :', f1_score(target_test, pred))
print('AUC-ROC :', roc_auc_score(target_test, proba))
print('Recall  :', recall_score(target_test, pred))

F1      : 0.6120591581342434
AUC-ROC : 0.8366733039396956
Recall  : 0.6810126582278481


In [None]:
model = DummyClassifier(strategy='constant', constant=1)
model.fit(features_train, target_train)
proba = model.predict_proba(features_test)[:, 1]
pred = proba > 0.5
print('F1      :', f1_score(target_test, pred))
print('AUC-ROC :', roc_auc_score(target_test, proba))

F1      : 0.3568202348690154
AUC-ROC : 0.5


Итоговая модель примерно на 70% лучше константной по метрике `F1`. В 68% случаев она правильно определяет клиентов на отток. По метрике `AUC-ROC` можно сказать что с 83% вероятностью модель правильно определит уйдет клиент или нет.