### **Loan Default Prediction**

---

#### **Introduction**

In the financial sector, particularly in lending institutions, assessing the creditworthiness of loan applications is a critical process that directly impacts the profitability and stability of banks. Home equity loans, which allow homeowners to borrow against the equity in their property, carry inherent risks due to the possibility of borrower default. This is the reason why lenders must carefully evaluate the likelihood that the borrower will repay it back or not before approving a loan as this evaluation is crucial because if too many borrowers fail to repay their loans, it can lead to significant financial lossess for the lender. 

The Home Equity dataset (HMEQ) contains detailed information about recent home equity loan applications. It includes data such as the loan amount requested, the applicant's employment history, credit background, and whether they eventually defaulted on the loan or not.

This projects aim to use this dataset to build tools that help banks make smarter lending decisions. By combining data analysis and machine learning techniques with human expertise, we can improve the accuracy of credit assessments and reduce the number of loans that are not repaid.

---

#### **Problem Identification**

One of the biggest challenges in lending is identifying which applicants are likely to repay their loans and which may default. If this is not done correctly, banks can suffer large financial losses due to non-performing loans (NPL), which are loans that are not being paid back as agreed. Currently, banks still rely on manual review by loan officers to decide who gets approved. While experienced reviewers are valuable, this method has some key limitations such as:

- It can be slow and difficult to scale.

- Decisions may vary between different reviewers.

- Human judgement can sometimes be influenced by unconscious biases. 

- Important patterns in the data may be missed due to complexity or volumes of applications. 


These issues can result in poor lending decisions, either approving risky applications or rejecting reliable ones.

---

#### **Project Objectives**

Our main objective is to develop an effective and reliable system that helps banks assess the risk of loan defaults among home equity loan applicants using data-driven methods. These are the objectives and goals that we aim to do:

1. To make analysis on the Home equity loan dataset and identify the most important factors that influence whether a borrower will repay or default.

2. To build and test classification machine learning models to predict clients who are likely to default on their loans.

3. To clean and prepare the dataset for modeling to ensure accurate results.

4. To avoid the risk of misclassification of default loans predicted as non-default as this results in high losses. 

---

#### **Dataset Description**

The Home Equity Loans (HMEQ) dataset contains detailed information on 5,960 recent home equity loan applications . It was collected to help lenders understand and predict which applicants are more likely to default (fail to repay) or repay their loans successfully. This dataset includes both demographic and financial details about each applicant, making it suitable for use in credit risk modeling, loan approval decisions, and predictive analytics.

- **Total Records**: 5,960 recent home equity loan applications.

- **Default Rate**: Approximately 20% (1189 out of 5960 applicants defaulted).

- **Type of Variability**: Loan Financial Varaibles,  Purpose and Employment Variables, and Credit History Variables. In total, there are 12 features and 1 target variable.

- **Target Variable**: `Bad` indicates whether the borrower defaulter (1) or repaid (0). 

<br>

| Variable  | Description                                                                 |
|-----------|-----------------------------------------------------------------------------|
| BAD       | 1 = Client defaulted ; 0 = Loan repaid on time                                     |
| LOAN      | Amount of the laon approved for the home equity loan                                             |
| MORTDUE   | Amount still due on the existing mortgage                                  |
| VALUE     | Current market value of the property                                       |
| REASON    | Purpose of the loan: HomeImp = home improvement, DebtCon = debt consolidation |
| JOB       | Type of job held by applicant                                              |
| YOJ       | Number of years at current job                                                       |
| DEROG     | Number of major derogatory reports (late payments, collections, charge-offs), indicates past credit problems          |
| DELINQ    | Number of delinquent credit lines (a credit line become delinquent when minimum payments are missed for 30-60+ days)                                    |
| CLAGE     | Age of oldest credit line in months (a credit line is a reusable loan that lets you borrow money up to certain limit)                                    |
| NINQ      | Number of recent credit inquiries (Each time a lender pulls credit report)                                       |
| CLNO      | Total number of credit lines currently open (how many accounts the borrower is managing)                               |
| DEBTINC   | Debt-to-income ratio (%), measure the borrower's ability to manage monthly payments         |

#### **Setup and Import necessary libraries**

In [1]:
# Data manipulation and visualization 
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go  
import shap 

# Data preprocessing 
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Imbalanced data handling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

# Cross-validation and evaluation
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score

# Model saving
import pickle
import joblib


#### **Data understanding**

In [9]:
# Read the dataset 
path = "../data/raw/home_equity_loan_applications.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


In [12]:
# Check the basic information about the data
print(f"Shape of the dataset: ", df.shape)
df.info()

Shape of the dataset:  (5960, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB


In [23]:
# Check for missing values and duplicated rows
missing_values = np.round(df.isnull().sum() / len(df) * 100, 2)
print("Percentage of missing values in each column:")
print(missing_values.astype(str) + " %")
print()

duplicated_rows = df.duplicated().sum()
print("Number of duplicated rows: ", duplicated_rows)

Percentage of missing values in each column:
BAD          0.0 %
LOAN         0.0 %
MORTDUE     8.69 %
VALUE       1.88 %
REASON      4.23 %
JOB         4.68 %
YOJ         8.64 %
DEROG      11.88 %
DELINQ      9.73 %
CLAGE       5.17 %
NINQ        8.56 %
CLNO        3.72 %
DEBTINC    21.26 %
dtype: object

Number of duplicated rows:  0


**Findings**: Based on the information from the dataset, there are 13 columns within the table with 1 target variable and 12 features, having a total number of rows of 5960. Here is the characteristic of the dataset: 

- The dataset contains floating, integer, and object data type. 

- There are 11 columns that have missing values. The column with the highest missing values is `DEBTINC` accounted for 21.26% of the total data points, followed by `DEROG` that accounted for 11.88%, which means we need to do imputation. 

- There are no duplicated rows within the dataset, so there is no need to drop duplication.

- The dataset column names are hard to understand, so we might need to rename them accordingly.

In [33]:
# Get the categorical and numerical features 
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print("Numerical features: ", numerical_features)
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
categorical_features.append("BAD")
print("Categorical features: ", categorical_features)

Numerical features:  ['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
Categorical features:  ['REASON', 'JOB', 'BAD']


In [34]:
# Change the target variable and object to categorical
for col in categorical_features:
    df[col] = df[col].astype('category')

In [36]:
# Check summary statistics for object columns
df.describe(include="category")

Unnamed: 0,BAD,REASON,JOB
count,5960,5708,5681
unique,2,2,6
top,0,DebtCon,Other
freq,4771,3928,2388


In [41]:
# Check the unique values in each categorical column
for col in categorical_features:
    print(df[col].value_counts())
    print()

REASON
DebtCon    3928
HomeImp    1780
Name: count, dtype: int64

JOB
Other      2388
ProfExe    1276
Office      948
Mgr         767
Self        193
Sales       109
Name: count, dtype: int64

BAD
0    4771
1    1189
Name: count, dtype: int64



**Findings**: There are three categorical columns in the dataset such as `REASON`, `JOB`, and `BAD`. 

- `BAD`: The target variable is imbalanced which is true in reality for loan application dataset. Only around 20% were defaulted loans. So, we need to apply the right data handling on imbalanced class so as to not create noise and bias.

- `REASON`: There are two unique values within the reason column, and most of them is *DebtCon* (Loan to pay off other existing debts) which accounted about 65% and the rest is *HomeImp* (Loan for home improvement).

- `JOB`: There are six unqiue values within the job column. Most of them is *Other* without specific title, followed by *ProfExe*, *Office*, *Mgr*, *Self*, *Sales*.

In [31]:
# Check summary statistics for numerical columns
df.describe(include=["int64", "float64"])

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
count,5960.0,5960.0,5442.0,5848.0,5445.0,5252.0,5380.0,5652.0,5450.0,5738.0,4693.0
mean,0.199497,18607.969799,73760.8172,101776.048741,8.922268,0.25457,0.449442,179.766275,1.186055,21.296096,33.779915
std,0.399656,11207.480417,44457.609458,57385.775334,7.573982,0.846047,1.127266,85.810092,1.728675,10.138933,8.601746
min,0.0,1100.0,2063.0,8000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.524499
25%,0.0,11100.0,46276.0,66075.5,3.0,0.0,0.0,115.116702,0.0,15.0,29.140031
50%,0.0,16300.0,65019.0,89235.5,7.0,0.0,0.0,173.466667,1.0,20.0,34.818262
75%,0.0,23300.0,91488.0,119824.25,13.0,0.0,0.0,231.562278,2.0,26.0,39.003141
max,1.0,89900.0,399550.0,855909.0,41.0,10.0,15.0,1168.233561,17.0,71.0,203.312149
