# **IRB Retail Mortgage**
# **PD Model Development**


In the previous notebooks, we completed the full data preparation and feature engineering required for retail PD modelling. This included portfolio-level analysis, target definition, exploratory data analysis, variable selection, monotonic binning, Weight of Evidence (WOE) transformation, and missing value treatment across categorical, numerical count, and numerical continuous variables.

All candidate predictors have now been transformed into stable, interpretable, and model-ready WOE variables. The objective of this notebook is to develop the Probability of Default (PD) model using logistic regression. Specifically, we will:
- Assemble the final modelling dataset using WOE-transformed variables
- Perform univariate sanity checks on coefficient signs
- Build a baseline multivariate logistic regression model
- Assess statistical significance and economic intuition
- Iteratively refine the model by removing weak or unstable predictors

## **Loading the Dataset**

In [1]:
# importing the required modules
import sys
sys.path.append("..")

from src.config import *
from src.utility_functions import *

In [2]:
# loading the dataset after EDA and preliminary variable strategy decisions
model_df = save_as_pickle_if_not_exists_and_load(pickle_file_path = f"{main_dir}/data/processed/mortgage_model_data_woe.pkl")
model_df.head(5)

Pickle file already exists at: /Users/sharmadipanshu/Developer/KPMG/irb_retail_pd_sas_vs_python//data/processed/mortgage_model_data_woe.pkl
(2260668, 84)


Unnamed: 0,id,term,loan_amnt,int_rate,installment,grade,sub_grade,purpose,verification_status,annual_inc,emp_length,home_ownership,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,t0,default_flag,clubbed_term,clubbed_sub_grade,clubbed_purpose,clubbed_verification_status,clubbed_home_ownership,clubbed_emp_length,clubbed_term_woe,clubbed_term_iv,clubbed_sub_grade_woe,clubbed_sub_grade_iv,clubbed_purpose_woe,clubbed_purpose_iv,clubbed_verification_status_woe,clubbed_verification_status_iv,clubbed_home_ownership_woe,clubbed_home_ownership_iv,clubbed_emp_length_woe,clubbed_emp_length_iv,delinq_2yrs_adj,clubbed_delinq_2yrs,clubbed_delinq_2yrs_woe,clubbed_delinq_2yrs_iv,inq_last_6mths_adj,clubbed_inq_last_6mths,clubbed_inq_last_6mths_woe,clubbed_inq_last_6mths_iv,pub_rec_adj,clubbed_pub_rec,clubbed_pub_rec_woe,clubbed_pub_rec_iv,open_acc_adj,clubbed_open_acc,clubbed_open_acc_woe,clubbed_open_acc_iv,total_acc_adj,clubbed_total_acc,clubbed_total_acc_woe,clubbed_total_acc_iv,loan_amnt_category,loan_amnt_bin_id,loan_amnt_woe,int_rate_category,int_rate_bin_id,int_rate_woe,installment_category,installment_bin_id,installment_woe,annual_inc_category,annual_inc_bin_id,dti_category,dti_bin_id,revol_bal_category,revol_bal_bin_id,revol_bal_woe,revol_util_category,revol_util_bin_id,dti_bin_id_adj,annual_inc_bin_id_adj,revol_util_bin_id_adj,dti_woe,annual_inc_woe,revol_util_woe
774204,139972821,60 months,10400.0,22.35,289.32,D,D5,debt_consolidation,Not Verified,0.0,,RENT,,1.0,1.0,5.0,0.0,0.0,,23.0,2018-09-30,0,60 months,D5,1. medium_risk,Not Verified,1. rent,unknown,-0.3569,0.0419,-0.6693,0.0122,-0.0923,0.0051,0.4333,0.0524,-0.1773,0.0133,-0.2178,0.0033,1,1,-0.081,0.0008,1,1,-0.1399,0.0053,0,0,0.0475,0.0019,2. 3_5,2. 3_5,0.1796,0.0028,4. 21_30,4. 21_30,-0.0119,0.0,10025.0 - 15000.0,3,-0.0778,20.9 - 30.99,8,-0.9367,230.64 - 327.69,2,0.0459,0.0 - 39502.0,0,missing,-1,0.0 - 22577.0,0,-0.0333,missing,-1,0,0,0,0.3402,-0.209,0.4562
1488539,135666561,60 months,12000.0,17.47,301.28,D,D1,credit_card,Not Verified,0.0,,RENT,,0.0,0.0,1.0,0.0,0.0,,3.0,2018-06-30,0,60 months,D1,0. low_risk,Not Verified,1. rent,unknown,-0.3569,0.0419,-0.4494,0.0086,0.2187,0.0135,0.4333,0.0524,-0.1773,0.0133,-0.2178,0.0033,0,0,0.0289,0.0007,0,0,0.1971,0.022,0,0,0.0475,0.0019,1. 0_2,1. 0_2,0.2806,0.0004,1. 0_5,1. 0_5,0.193,0.0005,10025.0 - 15000.0,3,-0.0778,15.1 - 17.47,6,-0.4327,230.64 - 327.69,2,0.0459,0.0 - 39502.0,0,missing,-1,0.0 - 22577.0,0,-0.0333,missing,-1,0,0,0,0.3402,-0.209,0.4562
2081649,122649959,36 months,16800.0,9.93,541.54,B,B2,debt_consolidation,Not Verified,0.0,,MORTGAGE,,0.0,0.0,2.0,0.0,0.0,,10.0,2017-11-30,0,36 months,B2,1. medium_risk,Not Verified,0. owned,unknown,0.1765,0.0207,0.6412,0.018,-0.0923,0.0051,0.4333,0.0524,0.1304,0.0098,-0.2178,0.0033,0,0,0.0289,0.0007,0,0,0.1971,0.022,0,0,0.0475,0.0019,1. 0_2,1. 0_2,0.2806,0.0004,2. 6_10,2. 6_10,0.1058,0.0009,15025.0 - 40000.0,4,-0.0971,8.88 - 10.47,2,0.6826,327.7 - 1246.76,3,-0.0898,0.0 - 39502.0,0,missing,-1,0.0 - 22577.0,0,-0.0333,missing,-1,0,0,0,0.3402,-0.209,0.4562
2100238,123937016,36 months,20000.0,10.42,649.3,B,B3,debt_consolidation,Not Verified,0.0,,MORTGAGE,,1.0,0.0,2.0,0.0,0.0,,13.0,2017-11-30,0,36 months,B3,1. medium_risk,Not Verified,0. owned,unknown,0.1765,0.0207,0.4299,0.0091,-0.0923,0.0051,0.4333,0.0524,0.1304,0.0098,-0.2178,0.0033,1,1,-0.081,0.0008,0,0,0.1971,0.022,0,0,0.0475,0.0019,1. 0_2,1. 0_2,0.2806,0.0004,3. 11_20,3. 11_20,0.0264,0.0002,15025.0 - 40000.0,4,-0.0971,8.88 - 10.47,2,0.6826,327.7 - 1246.76,3,-0.0898,0.0 - 39502.0,0,missing,-1,0.0 - 22577.0,0,-0.0333,missing,-1,0,0,0,0.3402,-0.209,0.4562
1654329,79967,36 months,5000.0,7.43,155.38,A,A2,other,Not Verified,,< 1 year,NONE,1.0,,,,,0.0,,,2007-08-31,0,36 months,A2,2. high_risk,Not Verified,2. other,0. <2_YEARS,0.1765,0.0207,1.6643,0.046,-0.0544,0.0004,0.4333,0.0524,0.4045,0.0001,-0.0075,0.0,missing,0,0.0289,0.0007,missing,0,0.1971,0.022,missing,0,0.0475,0.0019,0. missing,1. 0_2,0.2806,0.0004,0. missing,1. 0_5,0.193,0.0005,500.0 - 6000.0,0,0.2224,7.39 - 8.81,1,0.9118,4.93 - 162.66,0,0.3412,missing,-1,-1.0 - 7.31,0,0.0 - 22577.0,0,-0.0333,missing,-1,0,8,0,0.3402,0.4164,0.4562


In [3]:
# defining the variable lists
# categorical variables
clubbed_categorical_vars = ['clubbed_term', 'clubbed_sub_grade', 'clubbed_purpose', 'clubbed_verification_status', 'clubbed_home_ownership', 'clubbed_emp_length']

# numerical continuous variables 
num_continuous_vars = ['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti', 'revol_bal', 'revol_util']

# numerical count variables 
num_count_vars = ['delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc']

final_var_list = [f"{i}_woe" for i in clubbed_categorical_vars] + [f"clubbed_{i}_woe" for i in num_count_vars] + [f"{i}_woe" for i in num_continuous_vars]
model_df[final_var_list].isna().sum()

clubbed_term_woe                   0
clubbed_sub_grade_woe              0
clubbed_purpose_woe                0
clubbed_verification_status_woe    0
clubbed_home_ownership_woe         0
clubbed_emp_length_woe             0
clubbed_delinq_2yrs_woe            0
clubbed_inq_last_6mths_woe         0
clubbed_open_acc_woe               0
clubbed_pub_rec_woe                0
clubbed_total_acc_woe              0
loan_amnt_woe                      0
int_rate_woe                       0
installment_woe                    0
annual_inc_woe                     0
dti_woe                            0
revol_bal_woe                      0
revol_util_woe                     0
dtype: int64

## **Single Factor Analysis**

- The PD model is developed using a structured Single Factor Analysis (SFA) and Multi-Factor Analysis (MFA) framework. In SFA, each candidate WOE-transformed variable is evaluated individually for sign logic, statistical relevance, and economic intuition. Variables failing these checks are excluded.

- Variables that pass SFA are then evaluated jointly through MFA, starting with a parsimonious core model and gradually expanding model size. The final model is selected based on stability, interpretability, and marginal contribution, rather than purely statistical metrics.
