# **IRB Retail Mortgage**
## **Feature Engineering**

In this notebook, we move from **exploratory analysis** to **feature engineering**, which is a core step in retail PD model development.

Based on the outcomes of the previous notebook [`01_data_preparation_and_eda.ipynb`](../notebooks/01_data_preparation_and_eda.ipynb), we have finalised the modelling base table, defined the target variable (`default_flag`), documented variable-level treatment strategies. The focus of this notebook is:

1. Variable categorisation (numeric, categorical, count)
2. Coarse classing and binning strategy
3. WOE calculation
4. Information Value (IV) analysis
5. Variable screening and shortlist creation

Model estimation and validation are handled in subsequent notebooks.

In [None]:
# importing the required modules
import sys
sys.path.append("..")

from src.config import *
from src.utility_functions import *

In [2]:
# loading the dataset after EDA and preliminary variable strategy decisions
model_df = save_as_pickle_if_not_exists_and_load(pickle_file_path = f"{main_dir}/data/processed/mortgage_model_data.pkl")
model_df.head(5)

Pickle file already exists at: /Users/sharmadipanshu/Developer/KPMG/irb_retail_pd_sas_vs_python//data/processed/mortgage_model_data.pkl
(2260668, 22)


Unnamed: 0,id,term,loan_amnt,int_rate,installment,grade,sub_grade,purpose,verification_status,annual_inc,emp_length,home_ownership,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,t0,default_flag
0,68407277,36 months,3600.0,13.99,123.03,C,C4,debt_consolidation,Not Verified,55000.0,10+ years,MORTGAGE,5.91,0.0,1.0,7.0,0.0,2765.0,29.7,13.0,2015-12-31,0
1,68355089,36 months,24700.0,11.99,820.28,C,C1,small_business,Not Verified,65000.0,10+ years,MORTGAGE,16.06,1.0,4.0,22.0,0.0,21470.0,19.2,38.0,2015-12-31,0
2,68341763,60 months,20000.0,10.78,432.66,B,B4,home_improvement,Not Verified,63000.0,10+ years,MORTGAGE,10.78,0.0,0.0,6.0,0.0,7869.0,56.2,18.0,2015-12-31,0
3,66310712,60 months,35000.0,14.85,829.9,C,C5,debt_consolidation,Source Verified,110000.0,10+ years,MORTGAGE,17.06,0.0,0.0,13.0,0.0,7802.0,11.6,17.0,2015-12-31,0
4,68476807,60 months,10400.0,22.45,289.91,F,F1,major_purchase,Source Verified,104433.0,3 years,MORTGAGE,25.37,1.0,3.0,12.0,0.0,21929.0,64.5,35.0,2015-12-31,0


In [17]:
# categorisation of variables into identification and reference variables, numerical variables, and categorical variables

# identification and reference variables
id_cols = ['id', 'issue_d', 't0']

# target variable
target_var = 'default_flag'

# numerical continuous variables 
num_continuous_vars = ['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti', 'revol_bal', 'revol_util']

# numerical count variables 
num_count_vars = ['delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc']

# categorical variables 
categorical_vars = ['term' , 'grade', 'sub_grade', 'purpose', 'verification_status', 'home_ownership', 'emp_length']

# checking the columns which are not present in the above deined lists 
undefined_vars = [col for col in model_df.columns.tolist() if col not in [target_var] + num_continuous_vars + num_count_vars + categorical_vars + id_cols]
print(f"Undefined Variables: {undefined_vars}")
    

Undefined Variables: []
