**Author:** 
<br>
Muhammad Insan Aprilian (insanaprilian50@gmail.com)
<br>
<br>

<span style="font-size:16pt;font-weight:bold">Problem Statement</font>
<br>


You are being asked to create a credit scoring model for a lending company. You are given a file with historical data. 
1.	The upper management wants the overall default rate of their portfolio to be below 2.5%, please provide recommendation on the optimal credit score cutoff rate. 
2.	Please create a credit score for each individual, validate your solution, and provide guidance on the next steps. 
3.	Please create deciles by credit score and provide risk and default levels by deciles (by decile and cumulative). Bonus if you can provide confidence (or methodology how you would do it) for your scores/default rates by bin. 


<span style="font-size:16pt;font-weight:bold">The Data Science Workflow</font>
**<p>1.  Import Packages</p>**

**<p>2.  Import Data</p>**
<p>&nbsp; &nbsp;     2.1.  Metadata Definition</p>

**<p>3.  Data Exploration</p>**
<p>&nbsp; &nbsp;     3.1.  Missing Value Check</p>
<p>&nbsp; &nbsp;     3.2.  Outlier Check (IQR based)</p>
<p>&nbsp; &nbsp;     3.3.  Predictor Distribution to Target</p>
<p>&nbsp; &nbsp;     3.4.  Data Split</p>
<p>&nbsp; &nbsp;     3.5.  Data Transformation</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     3.5.1.  Categorical - Woe Encoder</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     3.5.2.  Numerical - Missing Imputation</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     3.5.3.  Numerical - Standardization</p>

**<p>4.  Predictor Selection</p>**    
<p>&nbsp; &nbsp;     4.1.  Predictor power comparison</p>
<p>&nbsp; &nbsp;     4.2.  Correlations</p>

**<p>5.  Modeling</p>**
<p>&nbsp; &nbsp;     5.1.  Logistic Regression Session</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.1.1.  Tuning parameter</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.1.2.  Train the model</p>
<p>&nbsp; &nbsp;     5.2.  XGBoost Session</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.2.1.  Train initial model</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.2.2.  Evaluate predictor</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;     5.2.2.1.  Weight of each predictor</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;     5.2.2.2.  Gain of each predictor</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;     5.2.2.3.  Selected Predictor</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.2.3.  Tuning parameter</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.2.4.  Final Model</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp;     5.2.5.  Evaluate final model</p>
<p>&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;     5.2.5.1.  Gain of each predictor</p>

**<p>6.  Score the dataset</p>**

**<p>7.  Performance characteristics</p>**
<p>&nbsp; &nbsp;     7.1.  Performance per sample</p>
<p>&nbsp; &nbsp;     7.2.  ROC Curve</p>
<p>&nbsp; &nbsp;     7.3.  Score Linearity on Holdout Sample</p>
<p>&nbsp; &nbsp;     7.4.  Cut-Off Estimation</p>

**<p>8.  Conclusion</p>**

# Import Packages

- `time` - datetime - ability to get current time for logs
- `math` - basic mathematical functions (as logarithm etc.))
- `numpy` - for mathematical,and numerical calculations
- `scipy` - for metrics evaluation calculations
- `pandas` - for work with large data structures
- `scikit` - all important machine learning (and statistical) algorithms used for training the models
- `matplotlib` - for plotting the charts
- `seaborn` - for statistical visualisations
- `xgboost` - gradient boosting used for training the models
- `category_encoders` - for category type transformation
- `ia_pkg` - for combined function used in this notebook

In [None]:
!pip install category_encoders

In [None]:
import pandas as pd
import numpy as np
import time
import datetime

import ia_pkg

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import display, HTML, Markdown

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Checking used library version
ia_pkg.pkg_version()

# Import Data

In [None]:
#Import data from CSV
data = pd.read_csv('credit_ds_v3.csv', index_col='X')
print('Data loaded on', datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
#Running DataFrame optimizer to reduce memory usage
from ia_pkg.function import optimizer
data = optimizer(data)

In [None]:
#Remove rows with duplicated index
data=data[~data.index.duplicated(keep='first')]

In [None]:
print('Number of rows:',data.shape[0])
print('Number of columns:',data.shape[1])

In [None]:
data.head()

## Metadata Definition

In [None]:
col_target = 'default_flag'

cols_pred = list(data.drop(col_target,axis=1).columns)

cols_pred_num = list(data[cols_pred].select_dtypes(include=np.number).columns)
cols_pred_cat = list(data[cols_pred].select_dtypes(include=np.object).columns)

print('List of numerical predictors:', len(cols_pred_num),'\n\n', data[cols_pred_num].dtypes)
print('\nList of categorical predictors: ', len(cols_pred_cat), '\n\n', data[cols_pred_cat].dtypes)

# Data Exploration

Showing the statistical summary, could help us on preliminary investigation on dataset

In [None]:
data.describe(include='all').T

## Missing Value Check

In [None]:
# Investigate columns with null values
missingCol = data.isnull().sum()
print("There are", len(missingCol[missingCol != 0]),"columns with missing value")

In [None]:
# Investigate null rate of contained null columns
missingRate = []
for col in cols_pred:
    if data[col].isnull().any():
        missingRate.append({'Predictor' : col,
                       'Missing rate' : data[col].isnull().sum() / data.shape[0]})
pd.DataFrame(missingRate).set_index('Predictor').sort_values('Missing rate',ascending=False)

## Outlier Check (IQR based)

In [None]:
from ia_pkg.function import cnt_outliers, replace_with_thresholds

# Check number of 1.5 IQR based outlier
cnt_outliers(data,cols_pred_num,plot=True)

Seems all numerical predictor have an outlier, indication that high variability characteristics on the dataset

## Predictor Distribution to Target

**Categorical predictor**

In [None]:
from ia_pkg.plots import stacked_plot, dist_plot
stacked_plot(data,
            cat_columns=cols_pred_cat,
            col_target=col_target)

From the graph, the *'E'* case on the **branch_code** would riskier than the other value, around 2 times riskier (9% on population to 17%)

**Numerical Predictor**

In [None]:
dist_plot(data,
            columns=cols_pred_num,
            col_target=col_target)

From the graphs, explicitely there are several predictors that have very good potential on the model. Their ability to differentiate behavior of default and non-default user is used as the base assumption.

Predictor which contained information like **delinquency score**, **Utilization**, **remaining bill** and **overlimit** assumed as the good predictor to the default. But still, further investigation needs to be done (correlation check, etc)

## Data Split

- Split data into three parts (train,valid,test)
- Adds a new column indicating to which part the observation belong
- Split is done in random
- Set the random seed so the results are replicable

In [None]:
from ia_pkg.function import data_split
data['data_type'] = data_split(data,
                               sample_sizes=[0.8,0.1,0.1],
                               sample_names=['train','test','valid'],
                               seed=42)

In [None]:
#masked the sample name
train_mask = (data['data_type'] == 'train')
valid_mask = (data['data_type'] == 'valid')
test_mask = (data['data_type'] == 'test')

In [None]:
data_summary = data.groupby(['data_type']).aggregate({col_target:['sum','count']})
data_summary.columns = [col_target, 'rows']
data_summary[col_target+' rate'] = data_summary[col_target] / data_summary['rows']

display(data_summary)

## Data Transformation

### Categorical - Woe Encoder

WoE method chose to transform the string-type categorical predictor to be in numeric form, WoE estimated the weight of each predictor's unique value for their ability to separate the target(in this case Default/not default).

WoE is also flexible with the null value as we can cluster it into 'special segment'. So the imputation would not be needed in this case.

In [None]:
from ia_pkg.function import woe_transform
#fit and transform WoE on categorical predictor
data_woe = woe_transform(data,
                         mask=train_mask,
                         cat_columns=cols_pred_cat,
                         col_target=col_target)

In [None]:
#Stored the WoE output on cols_woe
data_woe.columns = [i + '_woe' for i in data_woe.columns]
cols_woe = list(data_woe.columns)

data[cols_woe] = data_woe

In [None]:
woe_change = []
#Listed the tranformation result on each unique value on categorical predictor
for col,col_woe in zip(cols_pred_cat,cols_woe):
    woe_change.append(data[[col,col_woe,col_target]].fillna('Null').groupby([col,col_woe]).agg(
        {col_woe: ['count'],
         col_target : ['sum','mean']}))

for i in range(len(woe_change)):
    woe_change[i]
woe_change[0].columns = [('branch_code_woe count'),
            (   'default_flag count'),
            (   'default_flag rate')]
pd.DataFrame(woe_change[0])

### Numerical - Missing Imputation

Missing value imputation is done by filling the mean value to each predictor

In [None]:
cols_num_missing = data[cols_pred_num].columns[data[cols_pred_num].isnull().any()].tolist()
#filling the missing value with mean
for c in cols_num_missing:
    mean = data[c].mean()
    data[c+'_imp'] = data[c].fillna(mean,axis=0)

### Numerical - Standardization

Scaling is done on numerical predictors to avoid the outlier/bigger magnitude value effects on the model. Standardization is one of the methods for scaling, it transformed all the values by centering its mean at 0 then scales the variance at 1. 

The pros of this method is it keeping the shape of the predictor's original distribution

In [None]:
from sklearn.preprocessing import StandardScaler
#listed the imputation and non-imputation predictor for scaling
cols_pred_num2 = list(map(lambda x: x+'_imp' if x in cols_num_missing else x, cols_pred_num))       

scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit(data[train_mask][cols_pred_num2])
# print(scaler.mean_)
data_sd = scaler.transform(data[train_mask|valid_mask|test_mask][cols_pred_num2])

In [None]:
# stored the standardscaler output on cols_sd
cols_sd = [i+'_sd' for i in cols_pred_num]

data[cols_sd] = data_sd
data[cols_sd].head()

### Wrapped up all the transformed predictor

In [None]:
cols_shortlist = []

for c in cols_sd:
    cols_shortlist.append(c)
for c in cols_woe:
    cols_shortlist.append(c)

display(cols_shortlist)

# Predictor Selection
<br>

Selecting the best predictor for the model, it applied to all transformed predictors. The selection metrics would be **gini, IV** (Predictive power), and **inter-predictor correlation**

## Predictive power comparison

Calculates IV and Gini of each predictor, sorts the predictors by their power. The power is calculated for each of the samples (train, validate, test).

In [None]:
from ia_pkg.metrics import iv,gini

power_tab = []
for j in range(0,len(cols_shortlist)):
    power_tab.append({'Name':cols_shortlist[j]
                    ,'Gini Train':gini(data.loc[train_mask,col_target],data.loc[train_mask,cols_shortlist[j]])                    
                    ,'Gini Validate':gini(data.loc[valid_mask,col_target],data.loc[valid_mask,cols_shortlist[j]])
                    ,'Gini Test':gini(data.loc[test_mask,col_target],data.loc[test_mask,cols_shortlist[j]])
                    ,'IV Train':iv(data.loc[train_mask,col_target],data.loc[train_mask,cols_shortlist[j]])
                    ,'IV Validate':iv(data.loc[valid_mask,col_target],data.loc[valid_mask,cols_shortlist[j]])
                    ,'IV Test':iv(data.loc[test_mask,col_target],data.loc[test_mask,cols_shortlist[j]])     
                     })
power_out = pd.DataFrame.from_records(power_tab)
power_out = power_out.set_index('Name').abs()
power_out = power_out.sort_values('Gini Validate',ascending=False)

pd.options.display.max_rows = 1000
display(power_out)
pd.options.display.max_rows = 30

## Correlations

Show correlation matrix of all predictor

In [None]:
cormat = data[sorted(cols_shortlist)].corr()

plt.rcParams.update({'font.size': 15})
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True

fig, ax = plt.subplots(figsize=(12,10), dpi=50)
fig.suptitle('Correlations of Variables',fontsize=25)
sns.heatmap(cormat, ax=ax, annot=True, fmt="0.1f", linewidths=.5, annot_kws={"size":15},cmap="OrRd")
plt.tick_params(labelsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=0)

plt.show()
plt.clf();plt.close()

In [None]:
max_ok_correlation = 0.5

# find highest pairwise correlation (correlation greater than .. in absolute value)
hicors = []
for i in range(0,len(cormat)):
    for j in range(0,len(cormat)):
        if ((cormat.iloc[i][j] > max_ok_correlation or cormat.iloc[i][j] < -max_ok_correlation) and i < j):
            hicors.append((i,j,cormat.index[i],cormat.index[j],cormat.iloc[i][j],abs(cormat.iloc[i][j])))
hicors.sort(key= lambda x: x[5], reverse=True)

hicors2 = pd.DataFrame(list(zip(*list(zip(*hicors))[2:5])), columns = ['predictor_1', 'predictor_2', 'corr'])

# print list of highest correlations
hicors2

Combining output set from these selection methods, we choosing the predictor which placed on top individual predictive power and eliminate which both ranked on bottom(low gini) and having inter-predictor correlation (>0.5)


This new predictor set expected can prevent the low quality and mulitcollinearity issue that may occur on the model(e.g. Logistic Regression)

# Modeling

Modeling using two methods (CV Logistic Regression and XGBoost) on training data set. We take a different set of predictors for each model.

For Logistic Regression, we take transformed(*WoE* and *Imputation-Standardization*) and selected(*individual gini* and *correlation-based*) predictor called **pred_lr**

For XGBoost, we take transformed WoE and non-transformed numerical predictors (leave it as it is), XGBoost decision-tree is robust on outlier and null values so we confidently don't use numerical transformation on this. The set called **pred_xgb**

## Logistic Regression Session

Selected predictor for Logistic Regression

In [None]:
cols_shortlist2 = ['number_of_cards_sd',
#  'outstanding_sd',
 'credit_limit_sd',
#  'bill_sd',
#  'total_cash_usage_sd',
 'total_retail_usage_sd',
#  'remaining_bill_sd',
 'payment_ratio_sd',
#  'overlimit_percentage_sd',
 'payment_ratio_3month_sd',
 'payment_ratio_6month_sd',
 'delinquency_score_sd',
#  'years_since_card_issuing_sd',
#  'total_usage_sd',
#  'remaining_bill_per_number_of_cards_sd',
 'remaining_bill_per_limit_sd',
 'total_usage_per_limit_sd',
 'total_3mo_usage_per_limit_sd',
#  'total_6mo_usage_per_limit_sd',
#  'utilization_3month_sd',
#  'utilization_6month_sd',
 'branch_code_woe']

In [None]:
pred_lr = cols_shortlist2

### Tuning parameter

Tuning is done to know which regularization parameter (C) would be the best to estimate the model, estimated by his ability to balance the bias-variance

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
#Grid Search
logreg = LogisticRegression(class_weight='balanced',penalty='l2')

param = {'C':[0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,50,100,500,1000,5000,10000]}
gs = GridSearchCV(logreg,param,scoring='roc_auc',refit=True,cv=5)
gs.fit(data[train_mask|valid_mask][pred_lr],data[train_mask|valid_mask][col_target])
print('Best roc_auc: {:.4}, with best C: {}'.format(gs.best_score_, gs.best_params_))

### Train the model

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=gs.best_params_.get('C'))

gs = GridSearchCV(logreg,param,refit=True,cv=5)
gs.fit(data[train_mask|valid_mask][pred_lr],data[train_mask|valid_mask][col_target])
# logreg.fit(data[train_mask|valid_mask][pred_lr],data[train_mask|valid_mask][col_target])

lr_scored = gs.predict_proba(data[pred_lr])[:,1]

Plotted the model's Coefficient and Intercept

In [None]:
o = []
o.append('|LR MODEL | COEFFICIENTS |\n| --- | --- |')
o.append('| Intercept: | {} |'.format(gs.best_estimator_.intercept_[0]))
for p,b in zip(pred_lr,list(gs.best_estimator_.coef_[0])):
    o.append('| {} | {} |'.format(p,b))
display(Markdown('\n'.join(o)))

Coefficient magnitude could tell us the predictor contribution to the model, 

since we scaling them with the same method, we could say that **delinquency score** *(B=.676)* and **total_usage_per_limit** *(B=-.977)* are the biggest contributorsto the model. **delinquency_score** has a positive and **total_usage_per_limit** has a negative relationship to the defaults

## XGBoost Session

Needs xgboost library to be installed.

First we train a gradient boosting model using a "standard" set of hyperparameters.

In [None]:
pred_xgb = cols_pred_num + [c + '_woe' for c in cols_pred_cat]

### Train initial model

In [None]:
from ia_pkg.metrics import gini
import xgboost as xgb
# pred_xgb.remove('delinquency_score')
dt_xgb = data[pred_xgb]

xgb_params = {'eta': 0.1,
  'max_depth': 3,
  'objective': 'binary:logistic',
  'eval_metric': 'auc',
  'min_child_weight': 30,
  'subsample': 0.85}

evals_result = {}

ibooster= xgb.train(params= xgb_params,
                        dtrain= xgb.DMatrix(dt_xgb[train_mask],data[train_mask][col_target]),
                        num_boost_round= 200,
                        early_stopping_rounds = 20,
                        evals= ((xgb.DMatrix(dt_xgb[train_mask],data[train_mask][col_target]),'train'),
                                 (xgb.DMatrix(dt_xgb[valid_mask],data[valid_mask][col_target]),'valid')
                                ), 
                        evals_result= evals_result,)

ixgb_scored= ibooster.predict(xgb.DMatrix(dt_xgb), ntree_limit=ibooster.best_ntree_limit)

In [None]:
print('     Train gini:',gini(data[train_mask][col_target], ixgb_scored[train_mask]))
print('Validation gini:',gini(data[valid_mask][col_target], ixgb_scored[valid_mask]))

### Evaluate predictor

Predictors evaluated due to their sorted importances on two metrics (weight and gain). At first, we can set the number of predictors which we want to see

In [None]:
n_top = 10 #how many best predictors I want to see

#### Weight of each predictor

Select *n_top* predictors with highest weight (i.e. those which were in most trees)

In [None]:
pred_xgb_wgh = [x[0] for x in sorted([(k, v) for k, v in ibooster.get_score(importance_type = 'weight').items()]\
                                     , key=lambda x:x[1], reverse = True)]
if len(pred_xgb_wgh) > n_top:
    pred_xgb_wgh = pred_xgb_wgh[:n_top]

#### Gain of each predictor

Select *n_top* predictors with highest gain (i.e. relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model)

In [None]:
pred_xgb_gain = [x[0] for x in sorted([(k, v) for k, v in ibooster.get_score(importance_type = 'gain').items()]\
                                      , key=lambda x:x[1], reverse = True)]
if len(pred_xgb_gain) > n_top:
    pred_xgb_gain = pred_xgb_gain[:n_top]

#### Selected Predictor

Select the final predictors as we combining (union or intersection) the output from each metrics

In [None]:
def union(lst1, lst2):
    final_list  = list(set(lst1) | set(lst2))
    return final_list

def intersect(lst1, lst2):
    final_list = list(set(lst1) & set(lst2))
    return final_list

In [None]:
pred_xgb = union(pred_xgb_wgh, pred_xgb_gain)
display(pred_xgb)

### Tuning parameter

Hyperparameter tuning applied to two inputs (max_depth and learning rate).
Tuning set then will be evaluated on the valid sample.

There are two options on the best estimation to choose from.

*best_valid* for tuning set that has best gini on valid sample

*best_diff* for tuning set that has train-valid lowest gini difference

In [None]:
from ia_pkg.metrics import gini

import xgboost as xgb

dt_xgb = data[pred_xgb]

col_result = ['eta', 'max_depth', 'gini_train', 'gini_valid', 'difference']
result = pd.DataFrame(columns = col_result)
grid_params = {
            'eta' : [0.1,0.2,0.3],
            'max_depth' : [2,3,4]
#               'min_child_weight' : [10,20,30,40,50],
#               'subsample' : [0.5, 0.6, 0.7, 0.8, 0.9]      
}

flag = False

for eta in grid_params['eta']:
    for max_depth in grid_params['max_depth']:
        xgb_params = {'eta': eta,
                            'max_depth': max_depth,
                            'objective': 'binary:logistic',
                            'eval_metric': 'auc',
                            'min_child_weight': 30,
                            'subsample': 0.85}

        evals_result = {}

        tbooster = xgb.train(params = xgb_params,
                                    dtrain = xgb.DMatrix(dt_xgb[train_mask],data[train_mask][col_target]),
                                    num_boost_round = 200,
                                    early_stopping_rounds = 20,
                                    evals = ((xgb.DMatrix(dt_xgb[train_mask],data[train_mask][col_target]),'train'),
                                             (xgb.DMatrix(dt_xgb[valid_mask],data[valid_mask][col_target]),'valid')
                                            ), 
                                    evals_result = evals_result,)

        txgb_scored = tbooster.predict(xgb.DMatrix(dt_xgb), ntree_limit=tbooster.best_ntree_limit)
        gini_train = gini(data[train_mask][col_target], txgb_scored[train_mask])
        gini_valid = gini(data[valid_mask][col_target], txgb_scored[valid_mask])
        added = [eta, max_depth, gini_train, gini_valid, (gini_train-gini_valid)]
        if flag == False:
            result = pd.DataFrame([added], columns = col_result)
            flag = True
        else:
            result = pd.concat([result, pd.DataFrame([added], columns = col_result)], axis=0)

In [None]:
display(result)

In [None]:
best_valid = result.loc[result['gini_valid'] == result['gini_valid'].max(),['eta', 'max_depth']].to_dict('list')
best_diff = result.loc[result['difference'] == result['difference'].min(),['eta', 'max_depth']].to_dict('list')

print('hyperparameter for best_valid: ', best_valid)
print('hyperparameter for best_diff, ', best_diff)

### Final Model

In [None]:
import xgboost as xgb

dt_xgb = data[pred_xgb]
tuning = best_diff # set hyperparameter

xgb_params = {'eta': tuning.get('eta')[0],
    'max_depth': tuning.get('max_depth')[0],
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'min_child_weight': 30,
    'subsample': 0.85}

evals_result = {}

fbooster = xgb.train(params = xgb_params,
                        dtrain = xgb.DMatrix(dt_xgb[train_mask],data[train_mask][col_target]),
                        num_boost_round = 500,
                        early_stopping_rounds = 20,
                        evals = ((xgb.DMatrix(dt_xgb[train_mask],data[train_mask][col_target]),'train'),
                                 (xgb.DMatrix(dt_xgb[valid_mask],data[valid_mask][col_target]),'valid')
                                ), 
                        evals_result = evals_result,)

fxgb_scored = fbooster.predict(xgb.DMatrix(dt_xgb), ntree_limit=fbooster.best_ntree_limit)

### Evaluate final model

In [None]:
print('     Train gini:',gini(data[train_mask][col_target], fxgb_scored[train_mask]))
print('Validation gini:',gini(data[valid_mask][col_target], fxgb_scored[valid_mask]))

#### Gain Importance

In [None]:
fs = fbooster.get_score(importance_type = 'gain') # available importance types: 'gain', 'cover', 'weight'
imp = sorted([(k, v) for k, v in fs.items()], key = lambda x:x[1], reverse = True)
imp.reverse()

fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111)
ax.barh(range(len(imp)), [v for k, v in imp], color="blue",  align='center')
plt.yticks(range(len(imp)), [k for k, v in imp], fontsize=15)
plt.xticks(fontsize=15)
plt.xlabel('Importance',fontsize=15)
plt.ylim([-1, len(imp)])
plt.xlim([0, max([v for k, v in imp])*1.2])
plt.show()

Gain importance tells us the predictor relative contribution on each of the tree in the model. As you can see the **total_usage_per_limit**, **total_retail_usage**, and **delinquency_score** are the highest contributor to the model

Furthermore, if we want to see the more specific explanation of these predictors SHAP module could be used

# Score the dataset

Create a new column with the prediction (probability of default).

In [None]:
col_score = 'LR_SCORE'
col_score1 = 'XGB_SCORE'

data[col_score] = lr_scored
print('Column',col_score,'with the prediction added/modified. Number of columns:',data.shape[1])

data[col_score1] = fxgb_scored
print('Column',col_score1,'with the prediction added/modified. Number of columns:',data.shape[1])

# Performance characteristics
Performance characteristics of the models (Gini, Lift, KS) and their visualisations.

In [None]:
from ia_pkg.metrics import gini, lift, kolmogorov_smirnov
lift_perc = 10

## Performance per sample

In [None]:
perf = pd.DataFrame({'sample':[
    'train',
    'valid',
    'test'    
    ], 'LR_gini':[
    gini(data[train_mask][col_target],data[train_mask][col_score]) #train
    ,gini(data[valid_mask][col_target],data[valid_mask][col_score]) #valid
    ,gini(data[test_mask][col_target],data[test_mask][col_score]) #test
    ], 'XGB_gini':[
    gini(data[train_mask][col_target],data[train_mask][col_score1]) #train
    ,gini(data[valid_mask][col_target],data[valid_mask][col_score1]) #valid
    ,gini(data[test_mask][col_target],data[test_mask][col_score1]) #test
    ], 'LR_lift'+str(lift_perc):[
    lift(data[train_mask][col_target],-data[train_mask][col_score],lift_perc) #train
    ,lift(data[valid_mask][col_target],-data[valid_mask][col_score],lift_perc) #valid
    ,lift(data[test_mask][col_target],-data[test_mask][col_score],lift_perc) #test
    ], 'XGB_lift'+str(lift_perc):[
    lift(data[train_mask][col_target],-data[train_mask][col_score1],lift_perc) #train
    ,lift(data[valid_mask][col_target],-data[valid_mask][col_score1],lift_perc) #valid
    ,lift(data[test_mask][col_target],-data[test_mask][col_score1],lift_perc) #test
    ], 'LR_KS':[
    kolmogorov_smirnov(data[train_mask][col_score],data[train_mask][col_target]) #train
    ,kolmogorov_smirnov(data[valid_mask][col_score],data[valid_mask][col_target]) #valid
    ,kolmogorov_smirnov(data[test_mask][col_score],data[test_mask][col_target]) #test
    ], 'XGB_KS':[
    kolmogorov_smirnov(data[train_mask][col_score1],data[train_mask][col_target]) #train
    ,kolmogorov_smirnov(data[valid_mask][col_score1],data[valid_mask][col_target]) #valid
    ,kolmogorov_smirnov(data[test_mask][col_score1],data[test_mask][col_target]) #test
    ]}).set_index('sample')

In [None]:
display(perf)

There is a huge difference on overall performance generated by these models. Could be data leakage issue on XGB model since the performance on valid and test is too high, need further investigation to prove it.

## ROC Curve

In [None]:
from sklearn.metrics import roc_curve

# Compute ROC curve for each models
fpr = dict()
fpr = dict()
fpr1 = dict()    
tpr1 = dict()

fpr, tpr, _ = roc_curve(data[test_mask][col_target], data[test_mask][col_score])
fpr1, tpr1, _ = roc_curve(data[test_mask][col_target], data[test_mask][col_score1])

#Plot of a ROC curve
f, ax1 = plt.subplots(figsize=(6,6))
lw = 2
ax1.plot(fpr, tpr, color='y',label='LR ROC curve')
ax1.plot([0, 1], [0, 1], color='b', lw=lw, linestyle='--') 

ax2 = ax1.twinx()
ax2.plot(fpr1, tpr1, color='r',label='XGB ROC curve')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.0])
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('Receiver operating characteristic')
ax1.legend(bbox_to_anchor=(1, 0.1), borderaxespad=0.1)
ax2.legend(bbox_to_anchor=(1, 0.05), borderaxespad=0.1)
plt.show()

## Score Linearity on Holdout Sample

In [None]:
from ia_pkg.plots import plot_score_linearity
plot_score_linearity(data[test_mask],
                    col_score=col_score1,
                    col_target=col_target,
                    bins=10)

In [None]:
plot_score_linearity(data[test_mask],
                    col_score=col_score,
                    col_target=col_target,
                    bins=10)

Score distribution is plotted in decile to show the linearity of the output score when we link it to their actual default rate.

As you can see XGB could produce more consistent monotonicity than LR model, the reason is higher gini on XGB model

Also when we see the PD score on x-axis for both models, it seems the spread tends to gathered at lower PD value, so default threshold 0.5 for cutoff would not be relevant in this case

## Cut-Off Estimation

In [None]:
from ia_pkg.function import cutoff_df
from ia_pkg.plots import cutoff_plot

# Initialized all variable (expected default rate and scores)
exp_def_rate =0.025 # setting up the 2.5% expected default rate
scores = [col_score, col_score1]

# plot cutoff score
fig, ax = plt.subplots(figsize=(8,5))
for s in scores:
    dt = cutoff_df(data[valid_mask|test_mask],
                   col_score=s,
                   col_target=col_target)
    cutoff_plot(dt,
                col_score=s,
                exp_def_rate=exp_def_rate,
                ax=ax)

plt.title('Cumulative Distribution - PD Score vs Expected Default Rate')
plt.xlabel('PD Score')
plt.ylabel('Cummulative default rate')
plt.show()

Answering the problem task, the cutoff score is estimated by plotting the cumulative distribution of default rate to sorted score.

Two models presented to see the optimal cutoff to manage below 2.5% default rate on the portfolio

From the graph, we can decide for each cutoff score

**LR model cutoff score : 0.01482** 

**XGB model cutoff score : 0.0177**

# Conclusion

- Credit score is created to predict the probability of default on provided dataset


- Transformation and selection procedures is done to find the most useful predictors to the target


- Two models(LR and XGB) with different methods are presented to be compared. From the evaluation, XGB has better performance than LR


- The *deliquency_score* and *total_usage_per_limit* predictors considered as the highest contributor on both models


- Decision for the choosed model not yet to be made since we need further investigation of too high performance and interpretability on one of the model (XGB model)


- Optimal cutoff for 2.5% default rate : LR model is 0.01482 and XGB model is 0.0177