<a href="https://colab.research.google.com/github/dborgesm/Banking-Analytics/blob/master/Modelling_for_Scorecard_Application.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modelling

Logistic Regression was used and to verify the model performance the AUC-ROC curve was plotted. 

In [0]:
#Installing packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


In [0]:
!pip install scorecardpy

In [0]:
import scorecardpy as sc

Loading the train and testing data from last script and also the clean data set from data preprocessing


In [0]:
!gdown https://drive.google.com/uc?id=1MEfgUPP_chwwGCoLScjY0mqv0RArNfxQ


Downloading...
From: https://drive.google.com/uc?id=1MEfgUPP_chwwGCoLScjY0mqv0RArNfxQ
To: /content/train_woe.csv
190MB [00:01, 175MB/s]


In [0]:
!gdown https://drive.google.com/uc?id=1XRQRUVl235ayeIg1i0Y17h-cIPQ1htFk


Downloading...
From: https://drive.google.com/uc?id=1XRQRUVl235ayeIg1i0Y17h-cIPQ1htFk
To: /content/test_woe.csv
82.3MB [00:00, 95.0MB/s]


In [0]:
!gdown https://drive.google.com/uc?id=19tu2dEmP9aDPcclvrYh5gAdDRf0ux_0W

Downloading...
From: https://drive.google.com/uc?id=19tu2dEmP9aDPcclvrYh5gAdDRf0ux_0W
To: /content/as_data.csv
445MB [00:08, 50.9MB/s]


In [0]:
train_woe = pd.read_csv("/content/train_woe.csv")
test_woe = pd.read_csv("/content/test_woe.csv")
data = pd.read_csv("/content/as_data.csv")

In [0]:
# I will eliminate the columns from the "data" data set so i can creat my test and training data with no woe applied, also this data set will contain all the indeterminate cases, not only the charge off. 
drop_col = ['open_act_il',
        'percent_bc_gt_75', 'mo_sin_rcnt_tl',
        'inq_last_6mths', 'dti',
        'tot_hi_cred_lim', 'mo_sin_rcnt_rev_tl_op',
        'annual_inc', 'inq_last_12m',
        'mths_since_recent_bc', 'initial_list_status',
        'home_ownership', 'num_actv_rev_tl',
        'total_cu_tl', 'mths_since_recent_inq',
        'mort_acc', 'mo_sin_old_rev_tl_op',
        'purpose', 'inq_fi',
        'mo_sin_old_il_acct', 'mths_since_last_major_derog',
        'zip_code', 'open_il_12m',
        'num_accts_ever_120_pd', 'num_rev_accts',
        'addr_region', 'pct_tl_nvr_dlq',
        'total_il_high_credit_limit', 'num_il_tl',
        'emp_length', 'tot_coll_amt',
        'term', 'delinq_2yrs', 'pub_rec_bankruptcies',
        'num_tl_90g_dpd_24m', 'pub_rec']

In [0]:
# Eliminating unused variables
data.drop(drop_col, axis=1, inplace=True)

# Same train-test split as before
train_nowoe, test_nowoe = sc.split_df(data,
                          y = 'default',
                          ratio = 0.7, seed = 251082976).values()

#eliminating the indeterminates in the training data
indeterminate = train_nowoe[train_nowoe.loan_status == "Late (31-120 days)"].index
train_nowoe.drop(indeterminate, inplace=True)
train_nowoe.drop(["loan_status"], axis=1, inplace=True)
test_nowoe.drop(["loan_status"], axis=1, inplace=True)
# Give breaks for WoE
breaks_adj = {
    'all_util': [4,40,60],
    'bc_open_to_buy': [4000,10000,18000,29000],
    'int_rate': [12,15,17],
    'mths_since_rcnt_il': [1.0,7.0,15.0],
    'il_util': [1,60,80],
    'acc_open_past_24mths': [2.0,4.0,7.0],
    'max_bal_bc': [500.0,6500.0]
}
# Apply breaks.
bins_adj = sc.woebin(train_nowoe, y="default",
                     breaks_list=breaks_adj)

[INFO] creating woe binning ...
Binning on 1343093 rows and 8 columns in 00:00:37


I will try 3 models, because some of the variables have IV less than 0.1 the lowest value is 0.7 and 0.8


Model 2: il util and acc_open_past_24mths are not included

In [0]:
train_woe1 =  train_woe.drop(["acc_open_past_24mths_woe", "il_util_woe"], axis=1)
test_woe1 = test_woe.drop(["acc_open_past_24mths_woe", "il_util_woe"], axis=1)

In [0]:
train_nowoe1 =  train_nowoe.drop(["acc_open_past_24mths", "il_util"], axis=1)
test_nowoe1 = test_nowoe.drop(["acc_open_past_24mths", "il_util"], axis=1)
# Give breaks for WoE
breaks_adj1 = {
    'all_util': [4,40,60],
    'bc_open_to_buy': [4000,10000,18000,29000],
    'int_rate': [12,15,17],
    'mths_since_rcnt_il': [1.0,7.0,15.0],
    'max_bal_bc': [500.0,6500.0]
}
# Apply breaks.
bins_adj1 = sc.woebin(train_nowoe1, y="default",
                     breaks_list=breaks_adj1)

[INFO] creating woe binning ...
Binning on 1343093 rows and 6 columns in 00:00:28


Model 3 where only the variables with a IV greater or equal than 0.1


In [0]:
train_woe2 =  train_woe.drop(["acc_open_past_24mths_woe", "il_util_woe", "max_bal_bc_woe", "bc_open_to_buy_woe"], axis=1)
test_woe2 = test_woe.drop(["acc_open_past_24mths_woe", "il_util_woe", "max_bal_bc_woe", "bc_open_to_buy_woe"], axis=1)

train_nowoe2 =  train_nowoe.drop(["acc_open_past_24mths", "il_util", "max_bal_bc", "bc_open_to_buy"], axis=1)
test_nowoe2 = test_nowoe.drop(["acc_open_past_24mths", "il_util", "max_bal_bc", "bc_open_to_buy"], axis=1)
# Give breaks for WoE
breaks_adj2 = {
    'all_util': [4,40,60],
    'int_rate': [12,15,17],
    'mths_since_rcnt_il': [1.0,7.0,15.0],
}
# Apply breaks.
bins_adj2 = sc.woebin(train_nowoe1, y="default",
                     breaks_list=breaks_adj2)

[INFO] creating woe binning ...
Binning on 1343093 rows and 6 columns in 00:00:32


## Generating a logistic regression object

In [0]:
from sklearn.linear_model import LogisticRegression

cl_logreg1 = LogisticRegression(penalty='l1', # Lasso penalization
                                     tol=0.0001, # Tolerance for parameters
                                     C=1.0, # Penalty constant, see below
                                     fit_intercept=True, # Use constant?
                                     class_weight='balanced', # Weights, see below
                                     random_state=251082976, # Random seed
                                     max_iter=10000, # Maximum iterations
                                     verbose=1, # Show process. 1 is yes.
                                     solver = 'saga',
                                     warm_start=False # Train anew or start from previous weights. For repeated training. 
                                    )
cl_logreg2 = LogisticRegression(penalty='l1', # Lasso penalization
                                     tol=0.0001, # Tolerance for parameters
                                     C=1.0, # Penalty constant, see below
                                     fit_intercept=True, # Use constant?
                                     class_weight='balanced', # Weights, see below
                                     random_state=251082976, # Random seed
                                     max_iter=10000, # Maximum iterations
                                     verbose=1, # Show process. 1 is yes.
                                     solver = 'saga',
                                     warm_start=False # Train anew or start from previous weights. For repeated training. 
                                    )

cl_logreg3 = LogisticRegression(penalty='l1', # Lasso penalization
                                     tol=0.0001, # Tolerance for parameters
                                     C=1.0, # Penalty constant, see below
                                     fit_intercept=True, # Use constant?
                                     class_weight='balanced', # Weights, see below
                                     random_state=251082976, # Random seed
                                     max_iter=10000, # Maximum iterations
                                     verbose=1, # Show process. 1 is yes.
                                     solver = 'saga',
                                     warm_start=False # Train anew or start from previous weights. For repeated training. 
                                    )

## Training!
Applying the method ```fit``` to our data, giving it the training set and the target variable as inputs.

In [0]:
model1 = cl_logreg1.fit(X = train_woe.iloc[:, 1:],
                    y = train_woe['default']
                   )

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 31 epochs took 29 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   29.3s finished


In [0]:
model2 = cl_logreg2.fit(X = train_woe1.iloc[:, 1:], 
                    y = train_woe1['default'] 
                   )

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 32 epochs took 28 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.5s finished


In [0]:
model3 = cl_logreg3.fit(X = train_woe2.iloc[:, 1:], # All rows and from the second var to end
                    y = train_woe2['default'] # The target
                   )

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 37 epochs took 31 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   31.4s finished


Let's check the parameters

In [0]:
#Parameters of the full model (all variables after appying WoE)
coef_df1 = pd.concat([pd.DataFrame({'column': train_woe.columns[1:]}), 
                    pd.DataFrame(np.transpose(model1.coef_))],
                    axis = 1
                   )

coef_df1
model1.intercept_

array([-0.00215631])

In [0]:
# Parameters of the second model 
coef_df2 = pd.concat([pd.DataFrame({'column': train_woe1.columns[1:]}), 
                    pd.DataFrame(np.transpose(model2.coef_))],
                    axis = 1
                   )

coef_df2
model2.intercept_

array([-0.0017227])

In [0]:
# Parameters of the third model
coef_df3 = pd.concat([pd.DataFrame({'column': train_woe2.columns[1:]}), 
                    pd.DataFrame(np.transpose(model3.coef_))],
                    axis = 1
                   )

display(coef_df3)
display(model3.intercept_)

Unnamed: 0,column,0
0,mths_since_rcnt_il_woe,0.472666
1,all_util_woe,0.426562
2,int_rate_woe,0.941109


array([-0.00178921])

Let's use cross validation to verify the performance of each model

In [0]:
from sklearn.model_selection import cross_val_score

In [0]:
# Full model
cv_logreg = cross_val_score(cl_logreg1, train_woe.iloc[:,1:], train_woe.default, 
                            cv = 5)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 31 epochs took 24 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   24.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 31 epochs took 23 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 30 epochs took 22 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 32 epochs took 23 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 31 epochs took 23 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.8s finished


In [0]:
cv_logreg.mean()

0.6074434985393898

In [0]:
# Model 2 
cv_logreg1 = cross_val_score(cl_logreg2, train_woe1.iloc[:,1:], train_woe1.default, 
                            cv = 5)

In [0]:
cv_logreg1.mean()

0.6137490597922799

In [0]:
# Model 3
cv_logreg3 = cross_val_score(cl_logreg3, train_woe2.iloc[:,1:], train_woe2.default, 
                            cv = 5)

In [0]:
cv_logreg3.mean()

0.6091261300035847

Looking at the performance I will choose the second model, having the full model with IV values less than 0.8 will bring noise to the model, also the mean score cross validation value is not that different. 

Searching for the best parameters using GridSearch

In [0]:
from sklearn.model_selection import  GridSearchCV

In [0]:
c_val = 1/np.linspace(0.01,10,25)
params = dict(C=c_val)

In [0]:
c_val = 1/np.linspace(0.01,10,25)
params = dict(C=c_val)
grid1 = GridSearchCV(estimator=cl_logreg1, param_grid = params, n_jobs=-1, cv=5)

In [0]:
lasso_model1 = grid1.fit(X = train_woe1.iloc[:, 1:], # All rows and from the second var to end
                    y = train_woe1['default'] # The target
                   )

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 32 epochs took 28 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.3s finished


In [0]:
display(lasso_model1.best_score_)
display(lasso_model1.best_params_)

0.6137490106790818

{'C': 100.0}

The best parameter resulted C = 100, let's add this inputs to the logistic regression


In [0]:
cl_logreg_final = LogisticRegression(penalty='l1', # Lasso penalization
                                     tol=0.0001, # Tolerance for parameters
                                     C=100.0, # Penalty constant, see below
                                     fit_intercept=True, # Use constant?
                                     class_weight='balanced', # Weights, see below
                                     random_state=251082976, # Random seed
                                     max_iter=10000, # Maximum iterations
                                     verbose=1, # Show process. 1 is yes.
                                     solver = 'saga',
                                     warm_start=False # Train anew or start from previous weights. For repeated training. 
                                    )

final_model = cl_logreg_final.fit(X = train_woe1.iloc[:, 1:], # All rows and from the second var to end
                    y = train_woe1['default'] # The target
                   )

# Parameters of the second model 
coef_df_final = pd.concat([pd.DataFrame({'column': train_woe1.columns[1:]}), 
                    pd.DataFrame(np.transpose(model2.coef_))],
                    axis = 1
                   )

coef_df_final
final_model.intercept_

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 32 epochs took 28 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.8s finished


array([-0.00172296])

In [0]:
coef_df_final


Unnamed: 0,column,0
0,bc_open_to_buy_woe,0.349993
1,mths_since_rcnt_il_woe,0.435899
2,all_util_woe,0.204709
3,int_rate_woe,0.906979
4,max_bal_bc_woe,0.266724


## Applying the model to the test set

In [0]:
pred_class_test = final_model.predict(test_woe1.iloc[:, 1:])
probs_test = final_model.predict_proba(test_woe1.iloc[:, 1:])
print(probs_test[0:5], pred_class_test[0:5])

[[0.70822563 0.29177437]
 [0.72836679 0.27163321]
 [0.55902512 0.44097488]
 [0.51623331 0.48376669]
 [0.73636048 0.26363952]] [0 0 0 0 0]


In [0]:
pred_class_test.mean()

0.4268481371906146

Scikit-learn will give, by default, one probability per class.  The second column is the one that applies for class Default = 1.

We will get the confusion matrix to check our accuracy. These are included in the subpackage ```sklearn.metrics```

In [0]:
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve

In [0]:
confusion_matrix(y_true = test_woe1['default'], y_pred = pred_class_test)

array([[307681, 196451],
       [ 25338,  51561]])

Measure Errors and AUC


In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Calculate confusion matrix
confusion_matrix_sc = confusion_matrix(y_true = test_nowoe1['default'], 
                    y_pred = pred_class_test)

# Turn matrix to percentages
confusion_matrix_sc = confusion_matrix_sc.astype('float') / confusion_matrix_sc.sum(axis=1)[:, np.newaxis]

# Turn to dataframe
df_cm = pd.DataFrame(
        confusion_matrix_sc, index=['good', 'bad'], columns=['good', 'bad'], 
)

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(test_nowoe1["default"], probs_test[:,1])

# Save the AUC in a variable to display it. Round it first
auc = np.round(roc_auc_score(y_true = test_nowoe1['default'], 
                             y_score = probs_test[:,1]),
              decimals = 3)

fig, ax = plt.subplots(1,2,figsize=(22, 9) )

# Make it nicer
heatmap = sns.heatmap(df_cm, annot=True, fmt='.2f', vmin=0, vmax=0.6, ax=ax[0])

heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, 
                             ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,
                             ha='right', fontsize=fontsize)

# Add labels
plt.title("Confusion matrix")
plt.ylabel('True label')
plt.xlabel('Predicted label')


plt.plot([0, 1], [0, 1],'r--')
plt.plot(fpr,tpr,label="ROC, auc="+str(auc))
plt.legend(loc=4)
plt.title("ROC")
plt.show()

## Scorecards

 As optional arguments it receives a PDO, a base score of 700, and decimal base odds (so instead of 50:1, it receives 0.02).
 

In [0]:
data_sc = sc.scorecard(bins_adj, final_model, 
             train_woe1.columns[1:], # The column names in the trained LR
             points0=700, # Base points
             odds0=.02, # Base odds
             pdo=50) # PDO 


In [0]:
# Applying the credit score. Applies over the original data!
train_score = sc.scorecard_ply(train_nowoe1, data_sc, 
                               print_step=0)
test_score = sc.scorecard_ply(test_nowoe1, data_sc, 
                               print_step=0)

In [0]:
test_score.describe()

Unnamed: 0,score
count,581031.0
mean,430.238949
std,49.161598
min,343.0
25%,388.0
50%,434.0
75%,472.0
max,521.0
