# Credit Risk Modelling in Python

In this notebook we are going to predict the expected loss with the models we have just created of a dataset that has been already preprocessed. For a raw input the pipeline should first go through all the Credit_Risk_Modelling_Preprocessing notebook and then been taken here to predict the expected loss.

We recall from the previous notebooks that the expected loss (EL) can be calculated as:

EL = PD * LGD * EAD

- PD = Probability of Default. Likelihood that a borrower would not be able or willing to repay their debt in full or in time. It usually refers to a particular time horizon.
- LGD = Loss Given Default. Share of an asset that is lost if a borrower defaults. Is the proportion of the total exposure that cannot be recovered by the lender once a default has occurred.
- EAD = Exposure at Default. Total value that a lender is exposed to when a borrower defaults. Is the maximum that a bank may lose when a borrower defaults on a loan.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import pickle
from custom_model_classes import LogisticRegression_with_p_values, LinearRegression

## Read preprocessed data

In [2]:
loan_data_inputs_test = pd.read_csv('loan_data_inputs_test.csv', index_col = 0)
loan_data_inputs_test.drop(columns='Unnamed: 0',inplace=True)
loan_data_targets_test = pd.read_csv('loan_data_targets_test.csv', index_col = 0, header = None)

In [3]:
loan_data_inputs_test.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,dti:>35,mths_since_last_record:Missing,mths_since_last_record:0-2,mths_since_last_record:3-20,mths_since_last_record:21-31,mths_since_last_record:32-80,mths_since_last_record:81-86,mths_since_last_record:>86,purpose:educ__sm_b__wedd__ren_en__mov__house,purpose:oth__med__vacation
362514,19677589,21900299,32500,32500,32500.0,60 months,14.99,773.01,C,C5,...,0,1,0,0,0,0,0,0,0,0
288564,29755527,32278795,11000,11000,11000.0,60 months,20.99,297.53,E,E4,...,0,1,0,0,0,0,0,0,0,0
213591,1343334,1588314,30000,30000,30000.0,36 months,14.65,1034.83,C,C2,...,0,1,0,0,0,0,0,0,0,0
263083,33131681,35775007,14400,14400,14400.0,60 months,14.49,338.74,C,C4,...,0,1,0,0,0,0,0,0,0,0
165001,3293168,4066358,15000,15000,14900.0,36 months,8.9,476.3,A,A5,...,0,1,0,0,0,0,0,0,0,0


In [4]:
loan_data_targets_test.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
362514,1
288564,1
213591,1
263083,1
165001,1


## Read variables to use in PD, saved PD model and predict PD

In [5]:
with open("features_list_final", "rb") as fp:
    features_list_final = pickle.load(fp)
    
with open("ref_categories_final", "rb") as fp:
    ref_categories_final = pickle.load(fp)

In [6]:
inputs_test = loan_data_inputs_test.loc[:,features_list_final]
inputs_test = inputs_test.drop(ref_categories_final, axis = 1)
inputs_test.head()

Unnamed: 0,grade:A,grade:B,grade:C,grade:D,grade:E,grade:F,home_ownership:OWN,home_ownership:MORTGAGE,addr_state:NM_VA,addr_state:NY,...,mths_since_last_delinq:Missing,mths_since_last_delinq:4-30,mths_since_last_delinq:31-56,mths_since_last_delinq:>=57,mths_since_last_record:Missing,mths_since_last_record:3-20,mths_since_last_record:21-31,mths_since_last_record:32-80,mths_since_last_record:81-86,mths_since_last_record:>86
362514,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
288564,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
213591,0,0,1,0,0,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
263083,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
165001,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


In [7]:
with open("pd_model.sav", "rb") as fp:
    pd_model = pickle.load(fp)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [8]:
pd_model.model.predict_proba(inputs_test)

array([[0.07586062, 0.92413938],
       [0.15166964, 0.84833036],
       [0.11439684, 0.88560316],
       ...,
       [0.02677902, 0.97322098],
       [0.04025101, 0.95974899],
       [0.04763147, 0.95236853]])

In our target we defined 0 as being a bad borrower (default) and 1 as being a good one (not default). Therefore de PD is the probability of being a bad borrower or that the target takes the 0 value.

In [9]:
pd_prob = pd_model.model.predict_proba(inputs_test)[:][: , 0]

In [10]:
pd_prob

array([0.07586062, 0.15166964, 0.11439684, ..., 0.02677902, 0.04025101,
       0.04763147])

In [11]:
loan_data_inputs_test['PD'] = pd_prob

## Read variables to use in LGD, saved LGD models and predict LGD

For the recovery_rate as lots of observations had a value of zero, so we took a 2-step approach:
1) Is recovery rate equal to 0 or greater than 0? -> Logistic regression

2) If recovery rate is greater than 0, how much exactly is it? -> Linear regression

### Stage 1

In [12]:
with open("features_lgd", "rb") as fp:
    features_lgd = pickle.load(fp)
    
with open("features_reference_cat_lgd", "rb") as fp:
    features_reference_cat_lgd = pickle.load(fp)

In [13]:
inputs_test_lgd = loan_data_inputs_test.loc[:,features_lgd]
inputs_test_lgd = inputs_test_lgd.drop(features_reference_cat_lgd, axis = 1)
inputs_test_lgd.head()

Unnamed: 0,grade:A,grade:B,grade:C,grade:D,grade:E,grade:F,home_ownership:MORTGAGE,home_ownership:NONE,home_ownership:OTHER,home_ownership:OWN,...,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,total_acc,acc_now_delinq,total_rev_hi_lim
362514,0,0,1,0,0,0,1,0,0,0,...,9.32,0.0,0.0,,,7.0,0.0,20.0,0.0,25100.0
288564,0,0,0,0,1,0,1,0,0,0,...,24.19,1.0,0.0,3.0,,18.0,0.0,43.0,0.0,24000.0
213591,0,0,1,0,0,0,1,0,0,0,...,13.34,0.0,0.0,41.0,,9.0,0.0,38.0,0.0,30000.0
263083,0,0,1,0,0,0,1,0,0,0,...,17.69,0.0,0.0,,,12.0,0.0,31.0,0.0,17400.0
165001,1,0,0,0,0,0,1,0,0,0,...,8.3,0.0,0.0,39.0,,7.0,0.0,20.0,0.0,8700.0


In [14]:
inputs_test_lgd.loc[:,'mths_since_last_delinq'].fillna(0, inplace = True)
inputs_test_lgd.loc[:,'mths_since_last_record'].fillna(0, inplace = True)

In [15]:
with open("lgd_model_stage_1.sav", "rb") as fp:
    lgd_model_stage_1 = pickle.load(fp)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [16]:
lgd_model_stage_1.model.predict(inputs_test_lgd)

array([1, 0, 1, ..., 1, 1, 1])

In [17]:
loan_data_inputs_test['LGD_1'] = lgd_model_stage_1.model.predict(inputs_test_lgd)

### Stage 2

In [18]:
with open("lgd_model_stage_2.sav", "rb") as fp:
    lgd_model_stage_2 = pickle.load(fp)

In [19]:
lgd_model_stage_2.predict(inputs_test_lgd)

array([0.12564521, 0.16920196, 0.09723996, ..., 0.10792421, 0.10086725,
       0.08829037])

In [20]:
loan_data_inputs_test['LGD_2'] = lgd_model_stage_2.predict(inputs_test_lgd)

### Combining the 2 stages

In [21]:
loan_data_inputs_test['LGD'] = loan_data_inputs_test['LGD_1']*loan_data_inputs_test['LGD_2']

## Read variables to use in EAD, saved EAD model and predict EAD

In [22]:
with open("reg_ead.sav", "rb") as fp:
    reg_ead = pickle.load(fp)

In [25]:
# Same variables used for lgd are used for ead model.
# The output of the model is the credit conversion factor
loan_data_inputs_test['ccf'] = reg_ead.predict(inputs_test_lgd)
loan_data_inputs_test['ccf'] = np.where(loan_data_inputs_test['ccf'] < 0, 0, loan_data_inputs_test['ccf'])
loan_data_inputs_test['ccf'] = np.where(loan_data_inputs_test['ccf'] > 1, 1, loan_data_inputs_test['ccf'])
# The EAD is calculated as ccf * initial funded amount
loan_data_inputs_test['EAD'] = loan_data_inputs_test['ccf']*loan_data_inputs_test['funded_amnt']

## With all estimations, calculate the final Expected Loss

In [26]:
loan_data_inputs_test['EL'] = loan_data_inputs_test['PD'] * loan_data_inputs_test['LGD'] * loan_data_inputs_test['EAD']

In [28]:
loan_data_inputs_test[['member_id','funded_amnt','PD','LGD','EAD','EL']]

Unnamed: 0,member_id,funded_amnt,PD,LGD,EAD,EL
362514,21900299,32500,0.075861,0.125645,28790.530994,274.417637
288564,32278795,11000,0.151670,0.000000,9758.399675,0.000000
213591,1588314,30000,0.114397,0.097240,19193.946020,213.512381
263083,35775007,14400,0.059426,0.000000,13087.245790,0.000000
165001,4066358,15000,0.031304,0.091791,8772.848036,25.207839
...,...,...,...,...,...,...
115,1287293,8000,0.120097,0.067547,4455.528534,36.144358
296284,31546575,24000,0.033560,0.000000,21004.356848,0.000000
61777,10787405,35000,0.026779,0.107924,23393.376294,67.609308
91763,8724517,18000,0.040251,0.100867,11368.069624,46.154457
