<div class="alert alert-block alert-info"><h1>Проект по курсу:<br>"Библиотеки Python для Data Science: продолжение"</h1></div>

## Описание задания

**Задача**

Требуется, на основании имеющихся данных о клиентах банка, построить модель, используя обучающий датасет, для прогнозирования невыполнения долговых обязательств по текущему кредиту. Выполнить прогноз для примеров из тестового датасета.

**Наименование файлов с данными**

train.csv - обучающий датасет<br>
test.csv - тестовый датасет

**Целевая переменная**

Credit Default - факт невыполнения кредитных обязательств

**Метрика качества**

F1-score (sklearn.metrics.f1_score)

**Требования к решению**

*Целевая метрика*
* F1 > 0.5
* Метрика оценивается по качеству прогноза для главного класса (1 - просрочка по кредиту)

**Описание датасета**

* **Home Ownership** - домовладение
* **Annual Income** - годовой доход
* **Years in current job** - количество лет на текущем месте работы
* **Tax Liens** - налоговые обременения
* **Number of Open Accounts** - количество открытых счетов
* **Years of Credit History** - количество лет кредитной истории
* **Maximum Open Credit** - наибольший открытый кредит
* **Number of Credit Problems** - количество проблем с кредитом
* **Months since last delinquent** - количество месяцев с последней просрочки платежа
* **Bankruptcies** - банкротства
* **Purpose** - цель кредита
* **Term** - срок кредита
* **Current Loan Amount** - текущая сумма кредита
* **Current Credit Balance** - текущий кредитный баланс
* **Monthly Debt** - ежемесячный долг
* **Credit Default** - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

## Импорт библиотек и настройка внешнего вида

In [None]:
import datetime
import os
import pickle
import shap

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import xgboost as xgb
import lightgbm as lgbm

import sklearn
import catboost as ctb

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, precision_score, f1_score
from sklearn import metrics

from catboost import Pool, cv
from catboost.utils import get_roc_curve
from catboost.utils import get_fpr_curve
from catboost.utils import get_fnr_curve
from catboost.utils import select_threshold

In [None]:
matplotlib.rcParams.update({'font.size': 14})
pd.set_option('max_columns', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 10000)
pd.set_option('max_colwidth', 300)

## Прописываем необходимые функции

### Основная функция обработки пропусков и выбросов, а также генерация новых признаков

In [None]:
def metamorphosis(raw_df):
    
    # Make a copy of DataFrame
    df = raw_df.copy()
    
    
    # Some constants
    cred_score_median = df['Credit Score'].median()
    ann_inc_median = df['Annual Income'].median()
    wired_loan_amount = np.percentile(raw_train_df['Current Loan Amount'], 85)
    wired_open_credit = np.percentile(raw_train_df['Maximum Open Credit'], 99)
    
    
    # Converts types
    object_cols = df.select_dtypes(include='object').columns.tolist()
    float_cols = df.select_dtypes(include='float64').columns.tolist()

    df[object_cols] = df[object_cols].astype('category')
    df[float_cols] = df[float_cols].astype('float32')


    # Fill NaN section
    df['Annual Income'] = df['Annual Income'].fillna(ann_inc_median)
    df['Credit Score'] = df['Credit Score'].fillna(cred_score_median)
    df['Bankruptcies'] = df['Bankruptcies'].fillna(0)
    df['Years in current job'] = df['Years in current job'].fillna('< 1 year')
    
    df = df.drop(columns=['Months since last delinquent'])
        
    # Maximum Open Credit
    df.loc[df['Maximum Open Credit'] > wired_open_credit * 2, 'Maximum Open Credit'] = wired_open_credit
    # Current Loan Amount (to del NaNs)
    df.loc[df['Current Loan Amount'] > wired_loan_amount * 2, 'Current Loan Amount'] = None
    
    if TARGET in df.columns.tolist():
        df.dropna(inplace=True)
        df[TARGET] = df[TARGET].astype('int8')        
        
    df['Current Loan Amount'].fillna(wired_loan_amount, inplace=True)
        
    df['Term'] = pd.Series(df['Term'].map({'Short Term': 0, 'Long Term': 1}), dtype=np.int8)
    # Years in current job
    job_years_dict = {'< 1 year': 0,
                      '1 year': 1,
                      '2 years': 2,
                      '3 years': 3,
                      '4 years': 4,
                      '5 years': 5,
                      '6 years': 6,
                      '7 years': 7,
                      '8 years': 8,
                      '9 years': 9,
                      '10+ years': 10}
    df['Years in current job'] = pd.Series(df['Years in current job'].map(job_years_dict), dtype=np.int8)
    
    df['No Tax Liens'] = (df['Tax Liens'] == 0)
    df['No Credit Problems'] = (df['Number of Credit Problems'] == 0)
    df['No Bankruptcies'] = (df['Bankruptcies'] == 0)

    df['Credit Score Cat'] = 0
    df.loc[(df['Credit Score'] >= 580) & (df['Credit Score'] < 670), 'Credit Score Cat'] = 1
    df.loc[(df['Credit Score'] >= 670) & (df['Credit Score'] < 730), 'Credit Score Cat'] = 2
    df.loc[(df['Credit Score'] >= 730) & (df['Credit Score'] < 735), 'Credit Score Cat'] = 3
    df.loc[(df['Credit Score'] >= 735) & (df['Credit Score'] < 755), 'Credit Score Cat'] = 4
    df.loc[df['Credit Score'] > 755, 'Credit Score Cat'] = -1
    
    
    # Makes dummies
    df = pd.get_dummies(df, drop_first=True)
    
    renew_prop = 'Purpose_renewable energy'
    if renew_prop not in df.columns.tolist():
        df[renew_prop] = 0
    
    int_columns = [
               'Annual Income',
               'Tax Liens',
               'Number of Open Accounts',
               'Maximum Open Credit',
               'Number of Credit Problems',
               'Bankruptcies',
               'Current Credit Balance',
               'Monthly Debt',
               'Credit Score']
    
    df[int_columns] = df[int_columns].astype('int32')

    
    return df

### Вспомогательные функции

In [None]:
def heatmap(df):
    corr = df.corr()

    plt.figure(figsize=(10, 10))

    ax = sns.heatmap(
        corr, 
        vmin=-1, vmax=1, center=0,
        cmap=sns.diverging_palette(10, 240, n=200),
        square=True
    )

    ax.set_xticklabels(
        ax.get_xticklabels(),
        rotation=45,
        horizontalalignment='right'
    )
    plt.show()

In [None]:
def get_cat_features(df):
    
    cat_features_list = []
    object_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
       
    for cf in object_cols:
        cat_features_list.append(df.columns.get_loc(cf))
    
    return cat_features_list

In [None]:
def get_classification_report(y_train_true, y_train_pred, y_test_true, y_test_pred):
    f1_test = f1_score(y_test_true, y_test_pred)
    precision_test = precision_score(y_test_true, y_test_pred)
    
    print('F1-score: ', f1_test)
    print('Precision: ', precision_test)
    print('TRAIN\n\n' + classification_report(y_train_true, y_train_pred))
    print('TEST\n\n' + classification_report(y_test_true, y_test_pred))
    print('Confusion Matrix\n')
    print(pd.crosstab(y_test_true, y_test_pred))
    
    return f1_test, precision_test

In [None]:
def show_feature_importances(feature_names, feature_importances, get_top=None):
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
    feature_importances = feature_importances.sort_values('importance', ascending=False)
       
    plt.figure(figsize = (10, len(feature_importances) * 0.5))
    
    sns.barplot(feature_importances['importance'], feature_importances['feature'])
    
    plt.xlabel('Importance')
    plt.title('Importance of features')
    plt.show()
    
    if get_top is not None:
        return feature_importances['feature'][:get_top].tolist()

## Анализ данных

### Записываем константы и читаем данные

In [None]:
TRAIN_FILEPATH = 'train.csv'
TEST_FILEPATH = 'test.csv'
METRICS_FILEPATH = 'metrics.pkl'
TARGET = 'Credit Default'

In [None]:
raw_train_df = pd.read_csv(TRAIN_FILEPATH)
raw_test_df = pd.read_csv(TEST_FILEPATH)
raw_train_df.head().T

In [None]:
raw_train_df.describe().T

### Анализ пропусков

In [None]:
plt.figure(figsize=(12, 12))
colours = ['darkblue', 'white'] 
sns.heatmap(raw_train_df.isnull(), cmap=sns.color_palette(colours))

In [None]:
for col in raw_train_df.columns:
    pct_missing = np.mean(raw_train_df[col].isna())
    if pct_missing:
        print('{} - {}%'.format(col, round(pct_missing*100)))

### Смотрим на размерность загруженных данных

In [None]:
raw_train_df.shape, raw_test_df.shape

In [None]:
raw_train_df[TARGET].value_counts()

## Обработка выбросов и пропусков, генерация новых признаков

### Применяем основную функцию обработки данных и смотрим на кореляцию признаков

In [None]:
train_df = metamorphosis(raw_train_df)
test_df = metamorphosis(raw_test_df)

heatmap(train_df.select_dtypes(exclude='uint8'))

### Определяем степень влияния признаков на целевой класс

In [None]:
sns.set(font_scale=1)
corr_with_TARGET = train_df.corr().iloc[:-1, -1].sort_values(ascending=False)
plt.figure(figsize=(9, 9))
sns.barplot(x=corr_with_TARGET.values, y=corr_with_TARGET.index)
plt.title('Correlation with TARGET variable')

### График зависимости величины кредита от годового дохода с нанесением целевого класса

In [None]:
plt.figure(figsize=(12,12))
train_df.plot(kind="scatter", x="Annual Income", y="Current Loan Amount", alpha=0.4,
    c="Credit Default", cmap=plt.get_cmap("jet"), colorbar=True, figsize=(12,12),
    sharex=False)
plt.grid(True)
plt.show()

### Распределение величины кредита

In [None]:
plt.style.use('seaborn-bright')
plt.figure(figsize=(12,12))
sns.distplot(train_df['Current Loan Amount'])
plt.title('Distribution of Current Loan Amount')
plt.grid(True)

### Категория кредитной истории - новый признак

In [None]:
pd.crosstab(train_df['Credit Default'], train_df['Credit Score Cat'])

## Построение модели при помощи CatBoost

### Выбор и проверка модели на обучающей выборке

In [None]:
X = train_df.drop(TARGET, axis=1)
y = train_df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, shuffle=True, random_state=42)

cat_features = get_cat_features(X)
cat_features

In [None]:
y.value_counts()[0] / y.value_counts()[1]

In [None]:
y.value_counts()

In [None]:
frozen_params = {'silent': True,
                 'random_state': 42,
                 'n_estimators': 1200,
                 'eval_metric': 'F1',
                 'custom_metric': 'Precision',
                 'learning_rate': 0.01,
                 'class_weights': [1, 2.2],
                 'early_stopping_rounds': 800
                }

In [None]:
cat_model = ctb.CatBoostClassifier(**frozen_params)

cat_model.fit(X_train, y_train, cat_features, eval_set=(X_test, y_test), plot=True)

### Проверяем метрики

In [None]:
if os.path.exists(METRICS_FILEPATH):

    with open(METRICS_FILEPATH, 'rb') as file:
        pre_metrics = pickle.load(file)

else: pre_metrics = 'Previous metrics is not defined'

y_train_pred = cat_model.predict(X_train)
y_test_pred = cat_model.predict(X_test)

metrics = get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

with open(METRICS_FILEPATH, 'wb') as file:
    pickle.dump(metrics, file)

print('\n', (pre_metrics, "We've been here before, haven't we?")[pre_metrics == metrics])

### Ранжирование признаков по мере важности

In [None]:
important_features_top = show_feature_importances(X.columns, cat_model.feature_importances_, get_top=15)

### Дедаем предсказание на тестовой выборке

In [None]:
final_params = {'silent': True,
                 'random_state': 42,
                 'n_estimators': 800,
                 'eval_metric': 'F1',
                 'custom_metric': 'Precision',
                 'learning_rate': 0.01,
                 'class_weights': [1, 2.6],
                }

In [None]:
model = ctb.CatBoostClassifier(**final_params)

model.fit(X, y, cat_features)

### Записываем результаты

In [None]:
test_predictions = model.predict(test_df)

submit = pd.read_csv('/kaggle/input/credit-default-prediction-ai-big-data/sampleSubmission.csv')
submit['Credit Default'] = test_predictions
submit.to_csv('submission.csv', index=False)
submit.head()

## Строим графики

In [None]:
pool1 = Pool(data=X_train, label=y_train, cat_features=cat_features)
eval_pool = Pool(X_test, y_test, cat_features=cat_features)
curve = get_roc_curve(cat_model, eval_pool)
(fpr, tpr, thresholds) = curve
roc_auc = sklearn.metrics.auc(fpr, tpr)

In [None]:
(thresholds, fpr) = get_fpr_curve(curve=curve)
(thresholds, fnr) = get_fnr_curve(curve=curve)

plt.figure(figsize=(16, 8))
lw = 2

plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc, alpha=0.5)

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--', alpha=0.5)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.grid(True)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('Receiver operating characteristic', fontsize=20)
plt.legend(loc="lower right", fontsize=16)

In [None]:
plt.figure(figsize=(16, 8))
lw = 2

plt.plot(thresholds, fpr, color='blue', lw=lw, label='FPR', alpha=0.5)
plt.plot(thresholds, fnr, color='green', lw=lw, label='FNR', alpha=0.5)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.grid(True)
plt.xlabel('Threshold', fontsize=16)
plt.ylabel('Error Rate', fontsize=16)
plt.title('FPR-FNR curves', fontsize=20)
plt.legend(loc="lower left", fontsize=16)

In [None]:
shap_values = cat_model.get_feature_importance(pool1, type='ShapValues')

expected_value = shap_values[0,-1]
shap_values = shap_values[:,:-1]

print(shap_values.shape)

shap.initjs()
shap.force_plot(expected_value, shap_values[1,:], X_train.iloc[1,:])

In [None]:
shap.summary_plot(shap_values, X_train)

## Выводы:
В целом, найдена сбалансированная модель показывающая примерно равные метрики Precision, Recall и F1. Борьба с переобучением за счет регуляризации приводит к худшим результатам и, на мой взгляд, не требуется. Балансировка классов также не ребуется за счет возможности установки весов в классификаторе.