## IS453 Financial Analytics
## Week 11 - Credit Scoring Lab Data

### Credit risk scorecard construction with scorecardpy

## HMEQ Dataset

The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. 
The data is originally taken from the Credit Risk Analytics: Measurement Techniques, Applications, and Examples in SAS book website - https://www.bartbaesens.com/book/6/credit-risk-analytics.
A cleaner version of the data is on Kaggle - https://www.kaggle.com/akhil14shukla/loan-defaulter-prediction/data


**Variables definition**

1. BAD: Binary response variable
    - 1 = applicant defaulted on loan or seriously delinquent; 
    - 0 = applicant paid loan or customer is current on loan payments. This is the class column.
2. LOAN: Requested loan amount
3. MORTDUE: Amount due on existing mortgage
4. VALUE: Value of current property
5. REASON: 
    - DebtCon = debt consolidation(customer uses home equity loan to pay back high interest loans)
    - HomeImp = home improvement
6. JOB: Occupational categories
    - ProfExe
    - Mgr
    - Office
    - Self
    - Sales
    - Other
7. YOJ: Years at present job
8. DEROG: Number of major derogatory reports(issued for loans taken in the past when customer fails to keep up the contract or payback on time).
9. DELINQ: Number of delinquent credit lines
10. CLAGE: Age of oldest credit line in months
11. NINQ: Number of recent credit inquiries
12. CLNO: Number of credit lines
13. DEBTINC: Debt-to-income ratio in percent

**Install scorecardpy**
This is a python version of R package scorecard. The API link has more info :

https://pypi.org/project/scorecardpy/

https://github.com/shichenxie/scorecardpy/

https://cran.r-project.org/web/packages/scorecard/scorecard.pdf

In [None]:
# make sure you are running Python 3.9 or later

# depending on your environment, either pip install or conda install the following packages
# !pip install pandas==2.1.1
# !pip install scorecardpy==0.1.9.7

# after downloading, restart your kernel

In [1]:
# ignore scorecardpy compatability warnings
import warnings

import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics
import scorecardpy as sc
import pprint

**Read in the original hmeq_data.csv file**

It will have missing values, but that is alright.

In [2]:
# sample code
hmeq_data = pd.read_csv('hmeq_data.csv')

# use a copy of hmeq_data for credit risk model
hmeq_data_forsc = hmeq_data.copy()

# check for missing values
hmeq_data_forsc.isnull().sum()

BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64

Drop MORTDUE, is highly correlated with VALUE


In [3]:
# sample code

hmeq_data_forsc.drop(columns='MORTDUE', inplace=True)
hmeq_data_forsc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   VALUE    5848 non-null   float64
 3   REASON   5708 non-null   object 
 4   JOB      5681 non-null   object 
 5   YOJ      5445 non-null   float64
 6   DEROG    5252 non-null   float64
 7   DELINQ   5380 non-null   float64
 8   CLAGE    5652 non-null   float64
 9   NINQ     5450 non-null   float64
 10  CLNO     5738 non-null   float64
 11  DEBTINC  4693 non-null   float64
dtypes: float64(8), int64(2), object(2)
memory usage: 558.9+ KB


**Do train-test split**

`sc.split_df` returns a dictionary of train and test dataset. It uses a fixed random seed.

In [4]:
# sample code

# split data into 70% train and 30% test
train, test = sc.split_df(hmeq_data_forsc, y = 'BAD', ratio = .7).values()
print(train.shape)
print(test.shape)

(4172, 12)
(1788, 12)


**Generate WOE bins**

`sc.woebin()` generates groupings as a python dictionary object and also provides a method to plot WOE for the bins.  It will optimize for IV, but will not attempt to make the trend monotonic.

Scorecardpy will automatically do the one-hot encoding as part of the binning process so it is not neccesary to do that in advance.

It will also create missing bins for all the variables, so there is no need to imput or remove missing values.

*Ignore any Python warning messages.*

In [5]:
# automatically calculate bin ranges, bins is a dictionary
bins = sc.woebin(train, y = 'BAD')

for variables, bindetails in bins.items():
    print(variables, " : ")
    display(bindetails)
    print("--"*50)

[INFO] creating woe binning ...


  datetime_cols = dat.apply(pd.to_numeric,errors='ignore').select_dtypes(object).apply(pd.to_datetime,errors='ignore').select_dtypes('datetime64').columns.tolist()
  datetime_cols = dat.apply(pd.to_numeric,errors='ignore').select_dtypes(object).apply(pd.to_datetime,errors='ignore').select_dtypes('datetime64').columns.tolist()
  binning_sv = pd.merge(
  binning_sv = pd.merge(
  init_bin = dtm.groupby('bin', group_keys=False)['y'].agg([n0, n1])\
  binning = dtm.groupby(['variable','bin'], group_keys=False)['y'].agg([n0, n1])\
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  total_iv_all_brks = pd.melt(
  tot

  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_sv = pd.merge(
  binning_sv = pd.merge(
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_sv = pd.merge(
  binning_sv = pd.merge(
  init_bin = dtm.groupby('bin', group_keys=False)['y'].agg([n0, n1])\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=Fa

  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_sv = pd.merge(
  binning_sv = pd.m

  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  total_

  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 

  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_sv = pd.merge(
  binning_sv = pd.merge(
  init_bin = dtm.groupby(

  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  init_bin = init_bin.groupby('brkp', group_keys=False).agg({
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  bin

  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_sv = pd.merge(
  binning_sv = pd.merge(
  init_bin = dtm.groupby(

  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_all_brks = pd.melt(
  total_iv_all_brks = pd.melt(
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  total_iv_al

CLAGE  : 


  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\
  binning_1bst_brk = binning_1bst_brk.groupby(['variable', 'bstbin'], group_keys=False)\


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,CLAGE,missing,209,0.050096,155,54,0.258373,0.335453,0.006205,0.213844,missing,True
1,CLAGE,"[-inf,70.0)",220,0.052733,136,84,0.381818,0.908056,0.054704,0.213844,70.0,False
2,CLAGE,"[70.0,170.0)",1708,0.409396,1291,417,0.244145,0.259807,0.029793,0.213844,170.0,False
3,CLAGE,"[170.0,240.0)",1168,0.279962,979,189,0.161815,-0.254891,0.01681,0.213844,240.0,False
4,CLAGE,"[240.0,280.0)",374,0.089645,343,31,0.082888,-1.01385,0.066341,0.213844,280.0,False
5,CLAGE,"[280.0,inf)",493,0.118169,436,57,0.115619,-0.644697,0.03999,0.213844,inf,False


----------------------------------------------------------------------------------------------------
REASON  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,REASON,missing,168,0.040268,136,32,0.190476,-0.057025,0.000129,0.014904,missing,True
1,REASON,DebtCon,2730,0.654362,2222,508,0.186081,-0.085788,0.004692,0.014904,DebtCon,False
2,REASON,HomeImp,1274,0.305369,982,292,0.229199,0.177056,0.010083,0.014904,HomeImp,False


----------------------------------------------------------------------------------------------------
DEROG  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,DEROG,missing,482,0.115532,422,60,0.124481,-0.560767,0.030411,0.379931,missing,True
1,DEROG,"[-inf,1.0)",3163,0.75815,2642,521,0.164717,-0.233648,0.038509,0.379931,1.0,False
2,DEROG,"[1.0,2.0)",313,0.075024,188,125,0.399361,0.981765,0.09224,0.379931,2.0,False
3,DEROG,"[2.0,inf)",214,0.051294,88,126,0.588785,1.748839,0.218771,0.379931,inf,False


----------------------------------------------------------------------------------------------------
JOB  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,JOB,missing,191,0.045781,176,15,0.078534,-1.07254,0.03718,0.124782,missing,True
1,JOB,Office,663,0.158917,572,91,0.137255,-0.448386,0.027747,0.124782,Office,False
2,JOB,ProfExe,880,0.21093,740,140,0.159091,-0.275114,0.01466,0.124782,ProfExe,False
3,JOB,"Mgr%,%Other",2218,0.53164,1701,517,0.233093,0.198965,0.022307,0.124782,"Mgr%,%Other",False
4,JOB,"Sales%,%Self",220,0.052733,151,69,0.313636,0.60672,0.022887,0.124782,"Sales%,%Self",False


----------------------------------------------------------------------------------------------------
DEBTINC  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,DEBTINC,missing,866,0.207574,326,540,0.623557,1.894565,1.044727,1.863035,missing,True
1,DEBTINC,"[-inf,40.0)",2674,0.64094,2510,164,0.061331,-1.338278,0.741917,1.863035,40.0,False
2,DEBTINC,"[40.0,42.0)",367,0.087967,328,39,0.106267,-0.739558,0.03796,1.863035,42.0,False
3,DEBTINC,"[42.0,inf)",265,0.063519,176,89,0.335849,0.708046,0.03843,1.863035,inf,False


----------------------------------------------------------------------------------------------------
LOAN  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,LOAN,"[-inf,6000.0)",219,0.052493,113,106,0.484018,1.325945,0.124071,0.167658,6000.0,False
1,LOAN,"[6000.0,17000.0)",1963,0.470518,1559,404,0.205807,0.039509,0.000743,0.167658,17000.0,False
2,LOAN,"[17000.0,38000.0)",1748,0.418984,1482,266,0.152174,-0.327758,0.040642,0.167658,38000.0,False
3,LOAN,"[38000.0,inf)",242,0.058006,186,56,0.231405,0.189499,0.002202,0.167658,inf,False


----------------------------------------------------------------------------------------------------
CLNO  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,CLNO,missing,145,0.034756,111,34,0.234483,0.206724,0.001578,0.084517,missing,True
1,CLNO,"[-inf,10.0)",416,0.099712,280,136,0.326923,0.667759,0.053173,0.084517,10.0,False
2,CLNO,"[10.0,24.0)",2152,0.51582,1762,390,0.181227,-0.118164,0.006947,0.084517,24.0,False
3,CLNO,"[24.0,33.0)",951,0.227948,800,151,0.15878,-0.277438,0.0161,0.084517,33.0,False
4,CLNO,"[33.0,inf)",508,0.121764,387,121,0.238189,0.227259,0.006719,0.084517,inf,False


----------------------------------------------------------------------------------------------------
NINQ  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,NINQ,missing,350,0.083893,297,53,0.151429,-0.333547,0.008412,0.158557,missing,True
1,NINQ,"[-inf,1.0)",1767,0.423538,1489,278,0.157329,-0.288345,0.032201,0.158557,1.0,False
2,NINQ,"[1.0,2.0)",937,0.224593,756,181,0.19317,-0.039651,0.000349,0.158557,2.0,False
3,NINQ,"[2.0,4.0)",832,0.199425,634,198,0.237981,0.226112,0.01089,0.158557,4.0,False
4,NINQ,"[4.0,inf)",286,0.068552,164,122,0.426573,1.094048,0.106706,0.158557,inf,False


----------------------------------------------------------------------------------------------------
YOJ  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,YOJ,missing,364,0.087248,312,52,0.142857,-0.401866,0.012423,0.057791,missing,True
1,YOJ,"[-inf,5.0)",1313,0.314717,997,316,0.24067,0.240885,0.019585,0.057791,5.0,False
2,YOJ,"[5.0,6.0)",246,0.058965,196,50,0.203252,0.023802,3.4e-05,0.057791,6.0,False
3,YOJ,"[6.0,10.0)",772,0.185043,643,129,0.167098,-0.216439,0.008109,0.057791,10.0,False
4,YOJ,"[10.0,21.0)",1086,0.260307,855,231,0.212707,0.08121,0.001759,0.057791,21.0,False
5,YOJ,"[21.0,inf)",391,0.09372,337,54,0.138107,-0.441205,0.015881,0.057791,inf,False


----------------------------------------------------------------------------------------------------
DELINQ  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,DELINQ,missing,409,0.098035,356,53,0.129584,-0.514745,0.022075,0.60362,missing,True
1,DELINQ,"[-inf,1.0)",2944,0.705657,2539,405,0.137568,-0.445745,0.121867,0.60362,1.0,False
2,DELINQ,"[1.0,2.0)",444,0.106424,292,152,0.342342,0.73702,0.070214,0.60362,2.0,False
3,DELINQ,"[2.0,inf)",375,0.089885,153,222,0.592,1.762133,0.389464,0.60362,inf,False


----------------------------------------------------------------------------------------------------
VALUE  : 


Unnamed: 0,variable,bin,count,count_distr,good,bad,badprob,woe,bin_iv,total_iv,breaks,is_special_values
0,VALUE,missing,78,0.018696,3,75,0.961538,4.608769,0.411314,0.565594,missing,True
1,VALUE,"[-inf,40000.0)",231,0.055369,149,82,0.354978,0.792667,0.042762,0.565594,40000.0,False
2,VALUE,"[40000.0,50000.0)",217,0.052013,155,62,0.285714,0.473603,0.013314,0.565594,50000.0,False
3,VALUE,"[50000.0,90000.0)",1649,0.395254,1346,303,0.183748,-0.101266,0.00393,0.565594,90000.0,False
4,VALUE,"[90000.0,125000.0)",1086,0.260307,944,142,0.130755,-0.504405,0.056474,0.565594,125000.0,False
5,VALUE,"[125000.0,170000.0)",464,0.111218,362,102,0.219828,0.123222,0.001751,0.565594,170000.0,False
6,VALUE,"[170000.0,200000.0)",212,0.050815,194,18,0.084906,-0.987593,0.035997,0.565594,200000.0,False
7,VALUE,"[200000.0,inf)",235,0.056328,187,48,0.204255,0.029986,5.1e-05,0.565594,inf,False


----------------------------------------------------------------------------------------------------


### Logistic regression with WOE encoding

Use `sc.woebin_ply` to encode the WOE values

Generate the logistic regression model based on the encoded WOE values

In [6]:
# sample code

# prepare a dataset with the WOE values for Logistic Regression training
# woebin_ply() converts original values of input data into woe
train_woe = sc.woebin_ply(train, bins)
test_woe = sc.woebin_ply(test, bins)
train_woe

[INFO] converting into woe values ...
[INFO] converting into woe values ...


Unnamed: 0,BAD,CLAGE_woe,REASON_woe,DEROG_woe,JOB_woe,DEBTINC_woe,LOAN_woe,CLNO_woe,NINQ_woe,YOJ_woe,DELINQ_woe,VALUE_woe
0,1,0.259807,0.177056,-0.233648,0.198965,1.894565,1.325945,0.667759,-0.039651,0.081210,-0.445745,0.792667
1,1,0.259807,0.177056,-0.233648,0.198965,1.894565,1.325945,-0.118164,-0.288345,-0.216439,1.762133,-0.101266
3,1,0.335453,-0.057025,-0.560767,-1.072540,1.894565,1.325945,0.206724,-0.333547,-0.401866,-0.514745,4.608769
4,0,0.259807,0.177056,-0.233648,-0.448386,1.894565,1.325945,-0.118164,-0.288345,0.240885,-0.445745,-0.504405
5,1,0.259807,0.177056,-0.233648,0.198965,-1.338278,1.325945,0.667759,-0.039651,-0.216439,-0.445745,0.473603
...,...,...,...,...,...,...,...,...,...,...,...,...
5951,0,-0.254891,-0.085788,-0.233648,0.198965,-1.338278,0.189499,-0.118164,-0.288345,0.081210,-0.445745,-0.504405
5952,0,-0.254891,-0.085788,-0.233648,0.198965,-1.338278,0.189499,-0.118164,-0.288345,0.081210,-0.445745,-0.504405
5955,0,-0.254891,-0.085788,-0.233648,0.198965,-1.338278,0.189499,-0.118164,-0.288345,0.081210,-0.445745,-0.504405
5957,0,-0.254891,-0.085788,-0.233648,0.198965,-1.338278,0.189499,-0.118164,-0.288345,0.081210,-0.445745,-0.504405


In [7]:
# sample code

# create the X, y parts of data for train and test
y_train = train_woe.loc[:, 'BAD']
X_train = train_woe.loc[:, train_woe.columns != 'BAD']
y_test = test_woe.loc[:, 'BAD']
X_test = test_woe.loc[:, train_woe.columns != 'BAD']

# create a logistic regression model object
lr = linear_model.LogisticRegression(class_weight='balanced')
lr.fit(X_train, y_train)
pd.Series(np.concatenate([lr.intercept_, lr.coef_[0]]),
          index = np.concatenate([['intercept'], lr.feature_names_in_]) )

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():


intercept      0.017257
CLAGE_woe      1.059094
REASON_woe     0.064315
DEROG_woe      0.691944
JOB_woe        0.949626
DEBTINC_woe    0.952677
LOAN_woe       0.461214
CLNO_woe       0.951065
NINQ_woe       0.529535
YOJ_woe        1.069509
DELINQ_woe     0.957346
VALUE_woe      0.881628
dtype: float64

### Generate scorecard

Use `sc.scorecard` to generate the scorecard

In [8]:
# sample code

# generate a card from the model and bins. The scores will be based on probability of default from the model
# bins = bins created from sc.woebin
# lr = fitted logistic regression model
# align target odds with probabity of default = 5%
# odds = p/(1-p) = 0.05/(1-0.05) = 0.0526 ~= 1/19
card = sc.scorecard(bins, lr, X_train.columns, points0 = 600, odds0 = 1/19, pdo = 20, basepoints_eq0 = True)

pprint.pprint(card)

{'CLAGE':   variable            bin  points
0    CLAGE        missing    37.0
1    CLAGE    [-inf,70.0)    19.0
2    CLAGE   [70.0,170.0)    39.0
3    CLAGE  [170.0,240.0)    55.0
4    CLAGE  [240.0,280.0)    78.0
5    CLAGE    [280.0,inf)    66.0,
 'CLNO':    variable          bin  points
26     CLNO      missing    41.0
27     CLNO  [-inf,10.0)    28.0
28     CLNO  [10.0,24.0)    50.0
29     CLNO  [24.0,33.0)    54.0
30     CLNO   [33.0,inf)    41.0,
 'DEBTINC':    variable          bin  points
18  DEBTINC      missing    -5.0
19  DEBTINC  [-inf,40.0)    84.0
20  DEBTINC  [40.0,42.0)    67.0
21  DEBTINC   [42.0,inf)    27.0,
 'DELINQ':    variable         bin  points
42   DELINQ     missing    61.0
43   DELINQ  [-inf,1.0)    59.0
44   DELINQ   [1.0,2.0)    26.0
45   DELINQ   [2.0,inf)    -2.0,
 'DEROG':    variable         bin  points
9     DEROG     missing    58.0
10    DEROG  [-inf,1.0)    51.0
11    DEROG   [1.0,2.0)    27.0
12    DEROG   [2.0,inf)    12.0,
 'JOB':    variable   

**Ex Q1. Calculate the approval status for a new application**

Manually calcuate the score and approval status for a cutoff score of 600 and an application with the following information:<BR>
- LOAN = 88,900
- VALUE = 57,264
- REASON = DebtCon
- JOB = Other
- YOJ = 16.0
- DEROG = 0
- DELINQ = 0
- CLAGE = 221.8
- NINQ = 0
- CLNO = 16
- DEBTINC = 36.1

Your answer here

Use `sc.scorecard_ply` to score a new application with the same values

In [9]:
# sample code

# calulate credit score for new application
col = ['LOAN','VALUE','REASON','JOB','YOJ','DEROG','DELINQ','CLAGE','NINQ','CLNO','DEBTINC']
val = [[88900,57264,'DebtCon','Other',16.0,0.0,0.0,221.8,0.0,16.0,36.1]]
new_appl = pd.DataFrame(val, columns = col)

new_appl_score = sc.scorecard_ply(new_appl, card, only_total_score = False).transpose()
new_appl_score.index = new_appl_score.index.str.replace('_points', '')

summary = pd.concat([new_appl.transpose(), new_appl_score], axis=1)
summary.columns = ['App Value', 'Points']
print(summary)

        App Value  Points
LOAN        88900    44.0
VALUE       57264    49.0
REASON    DebtCon    47.0
JOB         Other    41.0
YOJ          16.0    44.0
DEROG         0.0    51.0
DELINQ        0.0    59.0
CLAGE       221.8    55.0
NINQ          0.0    51.0
CLNO         16.0    50.0
DEBTINC      36.1    84.0
score         NaN   575.0


### Score all the test and train data

Use `sc.scorecard_ply` to score all the test and train data

In [12]:
# sample code

# credit score for samples in test and train
train_score = sc.scorecard_ply(train, card)
test_score = sc.scorecard_ply(test, card)

### Evaluate the model's performance

**Calculate Percentage Correctly Classified measures on the scorecard model**


In [11]:
# sample code

# check model performance at 5:1 odds of default
cutoff=560

# create sets of predicted bad to compare with actual bad
predicted_bad_train = (train_score < cutoff)
predicted_bad_train_list = predicted_bad_train.astype(int).values.flatten().tolist()
predicted_bad_test = (test_score < cutoff)
predicted_bad_test_list = predicted_bad_test.astype(int).values.flatten().tolist()

print('*** Training Data Performance ***')
print('Confusion matrix:')
print(metrics.confusion_matrix(y_train, predicted_bad_train_list))
print('PCC measures:')
print(metrics.classification_report(y_train, predicted_bad_train_list))

print('*** Test Data Performance ***')
print('Confusion matrix:')
print(metrics.confusion_matrix(y_test, predicted_bad_test_list))
print('PCC measures:')
print(metrics.classification_report(y_test, predicted_bad_test_list))

*** Training Data Performance ***
Confusion matrix:
[[1981 1359]
 [  35  797]]
PCC measures:
              precision    recall  f1-score   support

           0       0.98      0.59      0.74      3340
           1       0.37      0.96      0.53       832

    accuracy                           0.67      4172
   macro avg       0.68      0.78      0.64      4172
weighted avg       0.86      0.67      0.70      4172

*** Test Data Performance ***
Confusion matrix:
[[882 549]
 [ 16 341]]
PCC measures:
              precision    recall  f1-score   support

           0       0.98      0.62      0.76      1431
           1       0.38      0.96      0.55       357

    accuracy                           0.68      1788
   macro avg       0.68      0.79      0.65      1788
weighted avg       0.86      0.68      0.72      1788



**Ex Q2. Compare the train vs test model performance**

- How do the f1-scores for the training and test dataset compare?
- How do the recall and specificity compare?
- Does the model appear to be overfitting the training data? 

Your answer here

### Evaluate effect of changing the cutoff score

Examine the distribution of the scores

In [None]:
# combine scores for train and test data to assess distribution for entire population
combined_score = pd.concat([train_score, test_score], ignore_index=True)

# plot distribution of scores on copmbined data
combined_score.hist(figsize = (7, 4), bins = 60)
plt.tight_layout()

In [None]:
# sample code
cutoff = 560

approval_count = train_score[train_score["score"]>cutoff].count()['score']
approval_rate = approval_count/train_score.shape[0]
print(f'Cutoff score of {cutoff:.0f}: {approval_count:,.0f} applications approved ({approval_rate:.1%})')

In [None]:
# sample code

# calculate expected number of defaults
odds_at_cutoff = 5

default_prob = 1/(1+odds_at_cutoff)
defaults = default_prob*approval_count
print(f'Cutoff score of {cutoff:.0f}: {defaults:.0f} defaults expected')

**Ex Q3. Evaluate the effect of adjusting the cutoff score**

Change the cutoff score to 640
- What is the number of applications approved?
- What is the number of defaults expected? 
- How does the recall and specificity performance change?

In [None]:
# your code here

Your answer here

### DIY

**Use scorecardpy for your group assignment**