You have now moved to a new team assisting the retail banking arm, which has been experiencing higher-than-expected default rates on personal loans. Loans are an important source of revenue for banks, but they are also associated with the risk that borrowers may default on their loans. A default occurs when a borrower stops making the required payments on a debt.

The risk team has begun to look at the existing book of loans to see if more defaults should be expected in the future and, if so, what the expected loss will be. They have collected data on customers and now want to build a predictive model that can estimate the probability of default based on customer characteristics. A better estimate of the number of customers defaulting on their loan obligations will allow us to set aside sufficient capital to absorb that loss. They have decided to work with you in the QR team to help predict the possible losses due to the loans that would potentially default in the next year.

Charlie, an associate in the risk team, who has been introducing you to the business area, sends you a small sample of their loan book and asks if you can try building a prototype predictive model, which she can then test and incorporate into their loss allowances.

Here is your task
The risk manager has collected data on the loan borrowers. The data is in tabular format, with each row providing details of the borrower, including their income, total loans outstanding, and a few other metrics. There is also a column indicating if the borrower has previously defaulted on a loan. You must use this data to build a model that, given details for any loan described above, will predict the probability that the borrower will default (also known as PD: the probability of default). Use the provided data to train a function that will estimate the probability of default for a borrower. Assuming a recovery rate of 10%, this can be used to give the expected loss on a loan.

You should produce a function that can take in the properties of a loan and output the expected loss.
You can explore any technique ranging from a simple regression or a decision tree to something more advanced. You can also use multiple methods and provide a comparative analysis.
Submit your code below.

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('Task 3 and 4_Loan_Data.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               10000 non-null  int64  
 1   credit_lines_outstanding  10000 non-null  int64  
 2   loan_amt_outstanding      10000 non-null  float64
 3   total_debt_outstanding    10000 non-null  float64
 4   income                    10000 non-null  float64
 5   years_employed            10000 non-null  int64  
 6   fico_score                10000 non-null  int64  
 7   default                   10000 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 625.1 KB


In [6]:
df.head()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0


In [7]:
df.tail()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
9995,3972488,0,3033.647103,2553.733144,42691.62787,5,697,0
9996,6184073,1,4146.239304,5458.163525,79969.50521,8,615,0
9997,6694516,2,3088.223727,4813.090925,38192.67591,5,596,0
9998,3942961,0,3288.901666,1043.09966,50929.37206,2,647,0
9999,5533570,1,1917.65248,3050.248203,30611.62821,6,757,0


In [70]:
df.describe()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4974577.0,1.4612,4159.677034,8718.916797,70039.901401,4.5528,637.5577,0.1851
std,2293890.0,1.743846,1421.399078,6627.164762,20072.214143,1.566862,60.657906,0.388398
min,1000324.0,0.0,46.783973,31.652732,1000.0,0.0,408.0,0.0
25%,2977661.0,0.0,3154.235371,4199.83602,56539.867903,3.0,597.0,0.0
50%,4989502.0,1.0,4052.377228,6732.407217,70085.82633,5.0,638.0,0.0
75%,6967210.0,2.0,5052.898103,11272.26374,83429.166133,6.0,679.0,0.0
max,8999789.0,5.0,10750.67781,43688.7841,148412.1805,10.0,850.0,1.0


In [7]:
df.corr()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
customer_id,1.0,0.006729,-0.013857,0.003541,-0.008064,-0.008098,0.008044,0.006927
credit_lines_outstanding,0.006729,1.0,0.080249,0.85221,0.022272,-0.0879,-0.258177,0.862815
loan_amt_outstanding,-0.013857,0.080249,1.0,0.397403,0.835815,-0.158416,-0.031373,0.098978
total_debt_outstanding,0.003541,0.85221,0.397403,1.0,0.394397,-0.174353,-0.232246,0.758868
income,-0.008064,0.022272,0.835815,0.394397,1.0,0.001814,-0.010528,0.016309
years_employed,-0.008098,-0.0879,-0.158416,-0.174353,0.001814,1.0,0.255873,-0.284506
fico_score,0.008044,-0.258177,-0.031373,-0.232246,-0.010528,0.255873,1.0,-0.324515
default,0.006927,0.862815,0.098978,0.758868,0.016309,-0.284506,-0.324515,1.0


In [25]:
df[ df['default'] == 0] 

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
2,2256073,0,3363.009259,2027.830850,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0
5,4661159,0,5376.886873,7189.121298,85529.84591,2,697,0
...,...,...,...,...,...,...,...,...
9995,3972488,0,3033.647103,2553.733144,42691.62787,5,697,0
9996,6184073,1,4146.239304,5458.163525,79969.50521,8,615,0
9997,6694516,2,3088.223727,4813.090925,38192.67591,5,596,0
9998,3942961,0,3288.901666,1043.099660,50929.37206,2,647,0


In [27]:
df.sort_values(by='total_debt_outstanding', ascending=False)

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
2133,6869178,5,8276.572480,43688.784100,121048.353200,2,701,1
9203,4836461,5,9105.964213,42558.451490,133913.382300,3,601,1
8080,2335501,5,8048.848585,41978.368840,116584.130800,5,513,1
4658,4435910,5,6694.514660,41095.348330,126382.503600,2,649,1
3204,6184791,5,7177.776071,40614.444320,124551.109300,5,593,1
...,...,...,...,...,...,...,...,...
4509,7237328,1,50.203718,103.133734,1000.000000,6,615,0
4413,1267580,0,57.706078,94.426487,1000.000000,4,634,0
6795,7208811,1,80.689935,78.613971,1053.763493,6,587,0
6558,4919219,0,57.348647,57.214220,1000.000000,3,628,0


In [66]:
my_input = 'fico_score'

df[ df[my_input] <= 550].sort_values(by=my_input, ascending=False)

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
992,1146826,1,3988.563878,7287.83646,56036.14865,3,550,0
8736,3933004,5,4789.405005,19576.93980,79943.21840,6,550,1
1112,7832950,5,6165.982666,25371.39523,100796.46540,3,550,1
5858,2505345,5,4864.976346,19722.49110,72277.30041,4,550,1
915,7273820,5,4142.745291,21820.94088,75350.08916,4,550,1
...,...,...,...,...,...,...,...,...
2629,1337395,5,4271.314690,22756.28103,83475.30929,4,438,1
5521,1252008,5,5176.915602,22990.26543,82417.59227,2,425,1
7001,2585781,4,6734.984475,26384.58439,97668.03091,2,418,1
6556,6901345,3,5281.352243,16411.51801,79905.09892,1,409,1


In [193]:
# will use RandomForestRegressor model

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


# output / target var / dependent var
y = df['default']

# inputs / features / independent var
# fico score - one of the factors indicated in copilot search for probability of default; by observation, lower fico score, the more likely will default; < 550
# credit_lines_outstanding - by observation, people who have defaulted typically have a handful of credit lines 
# total_debt_outstanding - by observation, higher total debt, the more likely will default; 15000< 
# 'years_employed' - by observation, the lower years employed, the more likely will default; <4
features = ['fico_score', 'credit_lines_outstanding', 'total_debt_outstanding','years_employed']

X = df[features]


# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=7)

# Define the model. Set random_state to 1
loan_default_model = RandomForestRegressor(random_state=1)

# Fit Model
loan_default_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = loan_default_model.predict(val_X)




In [194]:
val_X

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed
1977,697,1,8038.248905,5
3880,567,2,18391.211750,3
52,610,1,4179.967159,5
2551,516,0,4709.254935,5
2246,743,0,2329.688758,2
...,...,...,...,...
1355,679,1,5796.005855,6
4856,713,0,6713.580429,4
2392,566,1,14473.137850,1
3734,772,1,2860.209762,8


In [195]:
pd.merge(val_X, pd.DataFrame(index=val_X.index,data=val_predictions, columns=['predicted default']), left_index=True, right_index=True)

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed,predicted default
1977,697,1,8038.248905,5,0.00
3880,567,2,18391.211750,3,0.00
52,610,1,4179.967159,5,0.00
2551,516,0,4709.254935,5,0.00
2246,743,0,2329.688758,2,0.00
...,...,...,...,...,...
1355,679,1,5796.005855,6,0.00
4856,713,0,6713.580429,4,0.00
2392,566,1,14473.137850,1,0.03
3734,772,1,2860.209762,8,0.00


In [196]:
final_df = pd.concat([val_X, pd.DataFrame(index=val_X.index,data=val_predictions, columns=['predicted default']), pd.DataFrame(data=df['default'].loc[val_X.index]) ],axis=1)


final_df

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed,predicted default,default
1977,697,1,8038.248905,5,0.00,0
3880,567,2,18391.211750,3,0.00,0
52,610,1,4179.967159,5,0.00,0
2551,516,0,4709.254935,5,0.00,0
2246,743,0,2329.688758,2,0.00,0
...,...,...,...,...,...,...
1355,679,1,5796.005855,6,0.00,0
4856,713,0,6713.580429,4,0.00,0
2392,566,1,14473.137850,1,0.03,0
3734,772,1,2860.209762,8,0.00,0


In [197]:
# check if the tested entries still have the correct default column values with the original dataframe

compare_df = ( final_df['default'].loc[final_df.index] == df['default'].loc[final_df.index] )

compare_df

1977    True
3880    True
52      True
2551    True
2246    True
        ... 
1355    True
4856    True
2392    True
3734    True
7467    True
Name: default, Length: 2500, dtype: bool

In [198]:
compare_df.value_counts() # all True, confirmed

default
True    2500
Name: count, dtype: int64

In [199]:
print(f'Mean Absolute Error = {mean_absolute_error(val_predictions, val_y)}')

Mean Absolute Error = 0.004843999999999999


In [143]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2500 entries, 1977 to 7467
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   fico_score                2500 non-null   int64  
 1   credit_lines_outstanding  2500 non-null   int64  
 2   total_debt_outstanding    2500 non-null   float64
 3   years_employed            2500 non-null   int64  
 4   predicted default         2500 non-null   float64
 5   default                   2500 non-null   int64  
dtypes: float64(2), int64(4)
memory usage: 201.3 KB


In [148]:
# compare raw values between predicted default and original default

final_df[ ~(final_df['default'] == final_df['predicted default']) ]

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed,predicted default,default
3534,521,3,15241.377410,5,0.11,0
2170,622,3,19691.359430,4,0.32,0
6931,629,3,16001.271260,4,0.01,0
3552,691,3,14744.247130,4,0.26,0
6088,602,3,19622.360860,5,0.04,0
...,...,...,...,...,...,...
3080,616,5,9935.454440,6,0.59,1
9735,692,4,19607.013500,6,0.92,0
7422,650,3,11115.935330,3,0.42,1
7213,650,1,5819.928334,0,0.03,0


In [191]:
# might need to use logistic regression - useful for categorical/binary outputs i.e. yes/no, 0/1, demo/rep/ind
# theory: https://www.youtube.com/watch?v=zM4VZR0px8E

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load your dataset (replace with your own data)
# X: Features, y: Target variable (binary)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

# Initialize the logistic regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)


print(f'Mean Absolute Error = {mean_absolute_error(y_pred, y_test)}')
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)



Mean Absolute Error = 0.0032
Accuracy: 1.00
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2039
           1       0.99      1.00      0.99       461

    accuracy                           1.00      2500
   macro avg       0.99      1.00      0.99      2500
weighted avg       1.00      1.00      1.00      2500



In [159]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2500 entries, 1977 to 7467
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   fico_score                2500 non-null   int64  
 1   credit_lines_outstanding  2500 non-null   int64  
 2   total_debt_outstanding    2500 non-null   float64
 3   years_employed            2500 non-null   int64  
dtypes: float64(1), int64(3)
memory usage: 97.7 KB


In [165]:
# check if X_test and val_X are the same
pd.DataFrame(data=(X_test.index == val_X.index)).value_counts() #confirmed

True    2500
Name: count, dtype: int64

In [169]:
X_test

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed
1977,697,1,8038.248905,5
3880,567,2,18391.211750,3
52,610,1,4179.967159,5
2551,516,0,4709.254935,5
2246,743,0,2329.688758,2
...,...,...,...,...
1355,679,1,5796.005855,6
4856,713,0,6713.580429,4
2392,566,1,14473.137850,1
3734,772,1,2860.209762,8


In [180]:
final_df2 = pd.concat([X_test, pd.DataFrame(data=y_pred,columns=['default predicted'], index=X_test.index), pd.DataFrame(data=df['default'].loc[X_test.index]) ], axis=1)

In [179]:
final_df2

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed,default predicted,default
1977,697,1,8038.248905,5,0,0
3880,567,2,18391.211750,3,0,0
52,610,1,4179.967159,5,0,0
2551,516,0,4709.254935,5,0,0
2246,743,0,2329.688758,2,0,0
...,...,...,...,...,...,...
1355,679,1,5796.005855,6,0,0
4856,713,0,6713.580429,4,0,0
2392,566,1,14473.137850,1,0,0
3734,772,1,2860.209762,8,0,0


In [185]:
comparison_df = ( final_df2['default predicted'] == final_df2['default'] )

comparison_df.value_counts()

True     2492
False       8
Name: count, dtype: int64

In [188]:
( y_test == final_df2['default'] ).value_counts()

default
True    2500
Name: count, dtype: int64

In [551]:
df

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.752520,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.830850,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0
...,...,...,...,...,...,...,...,...
9995,3972488,0,3033.647103,2553.733144,42691.62787,5,697,0
9996,6184073,1,4146.239304,5458.163525,79969.50521,8,615,0
9997,6694516,2,3088.223727,4813.090925,38192.67591,5,596,0
9998,3942961,0,3288.901666,1043.099660,50929.37206,2,647,0


In [552]:
# GIVEN THE ABOVE LOGISTIC REG MODEL, FIGURE OUT HOW TO CALCULATE PD AND EXPECTED LOSS
# theory: https://www.wallstreetmojo.com/probability-of-default/

# Probability of Default - likelihood that borrower will default 
# Recovery Rate (RR) - % of borrowed capital that can be recovered
# Loss Given Default (LGD) - % of borrowed capital that is lost or cannot be recovered, i.e. LGD = 1 - RR
# Exposure at Default (EAD) - $ amount at which the lender will lose when the borrower default; a.k.a. "borrowed capital"

#  assumptions: RR = 10%
# find: (1) probability density function of PD , 
#       (2) Expected Loss = PD * EAD * LGD
#             source:  https://fastercapital.com/topics/expected-loss-(el).html#:~:text=from%20various%20perspectives.-,1.,loss%20given%20default%20(LGD).

# (1) PD(z) = 1 / (1 + e^(-z)), where z = β0 + β1*×1 + β2*×2 + … + βn**xn
#                               where β1, β2, …, βn are the coefficients associated with each borrower characteristic.
#                               where x1, x2, …, xn are the values of the borrower’s characteristics.
#           how to extract β and x from the logistic regression model? - model.coef_ for βn
#                                                                      - model.intercept_ for β0 intercept
# how β is calculated?? - stochastic gradient descent! theory: https://machinelearningmastery.com/logistic-regression-tutorial-for-machine-learning/

# (2) given RR = 10%, Loss Given Default (LGD) = 90%, 




In [201]:
# might need to use logistic regression - useful for categorical/binary outputs i.e. yes/no, 0/1, demo/rep/ind
# theory: https://www.youtube.com/watch?v=zM4VZR0px8E

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load your dataset (replace with your own data)
# X: Features, y: Target variable (binary)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

# Initialize the logistic regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)


print(f'Mean Absolute Error = {mean_absolute_error(y_pred, y_test)}')
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)

Mean Absolute Error = 0.0032
Accuracy: 1.00
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2039
           1       0.99      1.00      0.99       461

    accuracy                           1.00      2500
   macro avg       0.99      1.00      0.99      2500
weighted avg       1.00      1.00      1.00      2500



In [458]:
X_test

Unnamed: 0,fico_score,credit_lines_outstanding,total_debt_outstanding,years_employed
1977,697,1,8038.248905,5
3880,567,2,18391.211750,3
52,610,1,4179.967159,5
2551,516,0,4709.254935,5
2246,743,0,2329.688758,2
...,...,...,...,...
1355,679,1,5796.005855,6
4856,713,0,6713.580429,4
2392,566,1,14473.137850,1
3734,772,1,2860.209762,8


In [574]:
model_coef = model.coef_ # extract the coefficients from the log reg model in numpy array format
                         # array([β1, β2, ..., βn])
intercept = model.intercept_[0]

input_df = X_test

# guideline: my inputs should be as raw as I can that can be obtained from the model
# inputs: model_coef, intercept, inputs_df (e.g. X_test)
# outputs:   output_df - df containing columns: 
#                        'z': z-values of all the X_test row entries (borrowers) with indices corresponding to the original df
#                        'PD' : PD of all the X_test row entries (borrowers)
#                       


def calculate_PD(model_coef, intercept, input_df):
    
    betas_ser = pd.Series(data=np.reshape(model_coef,4),index=input_df.columns) # convert the coefficients into a series, IMPORTANT: need to label the coefficients as per the input_df columns
    input_df_t = input_df.T # transpose the input df

    output_df = pd.DataFrame(columns=['z', 'PD']) # initialize the output_df

    for col in input_df_t.columns:

        append_row = pd.DataFrame( {'z': [(input_df_t[col]  * betas_ser).sum() + intercept] }, index=[col] ) # per row β0 + β1*×1 + β2*×2 + … + βn*xn calculation

        output_df = pd.concat([output_df, append_row], axis=0) # each row is appended to the output_df

    import math
    output_df['PD'] = output_df['z'].apply( lambda x : 1/(1 + math.exp(-x) ) ) # calculates PD
    
    return output_df


In [575]:
analysis_df = calculate_PD(model_coef, intercept, input_df)

  output_df = pd.concat([output_df, append_row], axis=0) # each row is appended to the output_df


In [576]:
analysis_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2500 entries, 1977 to 7467
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   z       2500 non-null   float64
 1   PD      2500 non-null   float64
dtypes: float64(2)
memory usage: 58.6 KB


In [577]:
analysis_df.head()

Unnamed: 0,z,PD
1977,-22.248593,2.175499e-10
3880,-5.054286,0.006341452
52,-20.335657,1.473453e-09
2551,-25.821968,6.104663e-12
2246,-23.765253,4.773992e-11


In [579]:
df.head()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0


In [585]:
# calculating expected loss requires the loan amount outstanding column from the original df
loan_amount_df = pd.DataFrame(data=df['loan_amt_outstanding'].loc[input_df.index])

loan_amount_df

Unnamed: 0,loan_amt_outstanding
1977,3197.668552
3880,6159.160924
52,1888.053856
2551,3223.384312
2246,3263.762985
...,...
1355,3194.125308
4856,2969.221427
2392,4865.029134
3734,3500.838904


In [587]:
analysis_df = pd.concat([analysis_df, loan_amount_df],axis=1)

analysis_df

Unnamed: 0,z,PD,loan_amt_outstanding
1977,-22.248593,2.175499e-10,3197.668552
3880,-5.054286,6.341452e-03,6159.160924
52,-20.335657,1.473453e-09,1888.053856
2551,-25.821968,6.104663e-12,3223.384312
2246,-23.765253,4.773992e-11,3263.762985
...,...,...,...
1355,-24.602487,2.066691e-11,3194.125308
4856,-27.989176,6.989649e-13,2969.221427
2392,-7.906449,3.682247e-04,4865.029134
3734,-32.445364,8.112554e-15,3500.838904


In [588]:
# Expected Loss = PD * EAD * LGD = PD * loan amount * (1 - RR) = PD * loan amount * 0.90
# portfolio Expected Loss = take the sum of the Expected Loss column

analysis_df['Expected Loss'] = analysis_df['PD'] * analysis_df['loan_amt_outstanding'] * 0.90

In [591]:
analysis_df[ (analysis_df['PD'] >= 0.50) & (analysis_df['PD'] <= 0.70)  ]

Unnamed: 0,z,PD,loan_amt_outstanding,Expected Loss
4484,0.61287,0.648595,5435.881905,3173.117852
6234,0.620577,0.65035,2993.392193,1752.076898
4575,0.036081,0.509019,3015.903829,1381.638015
7148,0.778042,0.685258,4206.631181,2594.364798
965,0.163987,0.540905,4197.826085,2043.563162
7739,0.425456,0.604788,3051.61728,1661.02332
222,0.420759,0.603665,3872.037505,2103.671841
9661,0.310216,0.576938,3486.675626,1810.436335
7828,0.153006,0.538177,4465.159228,2162.741935
1335,0.613058,0.648638,3828.030126,2234.705073


In [None]:
# below is just studying the data set

In [561]:
df2 = calculate_z(model_coef, intercept, input_df)
df2

  output_df = pd.concat([output_df, append_row], axis=0) # each row is appended to the output_df


Unnamed: 0,z,PD
1977,-22.248593,2.175499e-10
3880,-5.054286,6.341452e-03
52,-20.335657,1.473453e-09
2551,-25.821968,6.104663e-12
2246,-23.765253,4.773992e-11
...,...,...
1355,-24.602487,2.066691e-11
4856,-27.989176,6.989649e-13
2392,-7.906449,3.682247e-04
3734,-32.445364,8.112554e-15


In [565]:
df2 = pd.concat([ log_reg_df, pd.DataFrame(data=df['default'].loc[log_reg_df.index]) ], axis=1) # compares original default data to the PD

df2 


Unnamed: 0,z,PD,default
1977,-22.248593,2.175499e-10,0
3880,-5.054286,6.341452e-03,0
52,-20.335657,1.473453e-09,0
2551,-25.821968,6.104663e-12,0
2246,-23.765253,4.773992e-11,0
...,...,...,...
1355,-24.602487,2.066691e-11,0
4856,-27.989176,6.989649e-13,0
2392,-7.906449,3.682247e-04,0
3734,-32.445364,8.112554e-15,0


In [573]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2500 entries, 1977 to 7467
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   z        2500 non-null   float64
 1   PD       2500 non-null   float64
 2   default  2500 non-null   int64  
dtypes: float64(2), int64(1)
memory usage: 78.1 KB


In [566]:
df2[ df2['default'] == 1 ]


Unnamed: 0,z,PD,default
270,13.503012,0.999999,1
403,20.705488,1.000000,1
1719,21.331641,1.000000,1
3435,17.536726,1.000000,1
5077,16.052972,1.000000,1
...,...,...,...
8279,5.048070,0.993619,1
37,17.152549,1.000000,1
2109,17.968240,1.000000,1
7422,0.225902,0.556236,1


In [571]:
df2[ (df2['PD'] <= 0.7) & (df2['PD'] >= 0.5)] 

Unnamed: 0,z,PD,default
4484,0.61287,0.648595,1
6234,0.620577,0.65035,1
4575,0.036081,0.509019,0
7148,0.778042,0.685258,1
965,0.163987,0.540905,1
7739,0.425456,0.604788,1
222,0.420759,0.603665,0
9661,0.310216,0.576938,1
7828,0.153006,0.538177,0
1335,0.613058,0.648638,1


In [597]:
randomForest_pred_default_df = pd.DataFrame(data=val_predictions, index=val_X.index, columns=['predicted default'])
randomForest_pred_default_df

Unnamed: 0,predicted default
1977,0.00
3880,0.00
52,0.00
2551,0.00
2246,0.00
...,...
1355,0.00
4856,0.00
2392,0.03
3734,0.00


In [602]:
df3 = pd.concat( [analysis_df['PD'], randomForest_pred_default_df['predicted default'] ],axis=1)
df3 

Unnamed: 0,PD,predicted default
1977,2.175499e-10,0.00
3880,6.341452e-03,0.00
52,1.473453e-09,0.00
2551,6.104663e-12,0.00
2246,4.773992e-11,0.00
...,...,...
1355,2.066691e-11,0.00
4856,6.989649e-13,0.00
2392,3.682247e-04,0.03
3734,8.112554e-15,0.00


In [603]:
df3[ (df3['PD'] <= 0.7) & (df3['PD'] >= 0.5) ] # this shows that the Random Forest predictions are not reliable relative to PD predictions using logistic reg

Unnamed: 0,PD,predicted default
4484,0.648595,0.14
6234,0.65035,0.97
4575,0.509019,0.44
7148,0.685258,0.98
965,0.540905,0.97
7739,0.604788,0.66
222,0.603665,0.0
9661,0.576938,0.68
7828,0.538177,0.17
1335,0.648638,0.97
