# Loan Default Prediction Project

This project explores patterns, correlations, and risk indicators associated with loan default behavior using a filtered LendingClub-style loan dataset. The notebook performs a structured exploratory data analysis to understand which borrower attributes are most associated with default outcomes.

## Modeling Approach

I developed a binary classification model using LightGBM, a gradient-boosted decision tree algorithm optimized for speed and performance on large tabular datasets. LightGBM was selected for its ability to:
* Handle heterogeneous, non-linear relationships
* Work well with 50â€“100+ features
* Manage class imbalance through built-in weighting
* The target variable is loan_status (0 = fully paid, 1 = default).

### Import Packages

In [52]:
# Data handling
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.feature_selection import SelectKBest, mutual_info_classif, VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models (start with these)
from lightgbm import LGBMClassifier
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import RandomizedSearchCV

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report, precision_recall_curve
)

# Handling imbalance
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

### Import Dataset

In [54]:
df = pd.read_csv("../filtered.csv")
df.dropna(inplace = True)

### Feature Engineering

In addition to some formating changes regarding time, I opted to create some useful features for both organizational purposes as well as model training. I averaged out the fico ranges, created various ratios with income, and built a simple delinquency scores as well as a delinquency flag. More specific descriptions of other engineered features are within the comments in the cell below.

In [55]:
#term
df['term'] = df['term'].astype(str).str.extract(r'(\d+)').astype(float)

#interest rate
df['int_rate'].astype(str).str.rstrip('%').replace('', np.nan).astype(float)

#emp_length
s = df['emp_length'].astype(str)
#< 1 year 
s = s.str.replace(r'[^0-9]+', '', regex = True)
df['emp_length'] = pd.to_numeric(s, errors = 'coerce')
df['emp_length'] = df['emp_length'].fillna(0).clip(lower = 0, upper = 10)

#issue date
df['issue_d'] = pd.to_datetime(df['issue_d'], format="%b-%Y", errors='coerce')
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format="%b-%Y", errors='coerce')

# Credit history length (in years) using months diff for accuracy
mask = df['issue_d'].notna() & df['earliest_cr_line'].notna()
df['credit_history_length'] = np.nan  # init

# months difference, then convert to years
months_diff = (df.loc[mask, 'issue_d'].dt.year - df.loc[mask, 'earliest_cr_line'].dt.year) * 12 + \
              (df.loc[mask, 'issue_d'].dt.month - df.loc[mask, 'earliest_cr_line'].dt.month)
df.loc[mask, 'credit_history_length'] = (months_diff / 12).astype(float)

#feature engineering
df['fico_avg'] = df[['fico_range_low', 'fico_range_high']].mean(axis = 1)

#zip code
df['zip3'] = df['zip_code'].astype(str).str[:3]
df.drop(columns = ['zip_code'], inplace = True)

#Drop original columns
drop_list = ['loan_status', 'fico_range_low', 'fico_range_high', 'earliest_cr_line', 'issue_d']
df.drop(columns = drop_list, inplace = True)

# 1. Installment burden relative to monthly income
df["installment_to_income"] = df["installment"] / (df["annual_inc"] / 12 + 1e-6)

# 2. Loan amount relative to annual income
df["loan_to_income"] = df["loan_amnt"] / (df["annual_inc"] + 1e-6)

# 3. Revolving balance relative to income
df["revol_bal_to_income"] = df["revol_bal"] / (df["annual_inc"] + 1e-6)

# 4. Revolving balance relative to total revolving high credit limit
df["revol_ratio"] = df["revol_bal"] / (df["total_rev_hi_lim"] + 1e-6)

# 5. High utilization flag (very high revolving utilization)
df["high_util_flag"] = (df["revol_util"] > 80).astype(int)

# 6. Weighted delinquency score (more weight for more severe delinquencies)
df["recent_delinquency_score"] = (
    df["num_tl_30dpd"] +
    2 * df["num_tl_90g_dpd_24m"] +
    3 * df["num_accts_ever_120_pd"]
)

# 7. Simple "ever delinquent" flag
df["ever_delinquent_flag"] = (
    (df["delinq_2yrs"] > 0) |
    (df["num_tl_30dpd"] > 0) |
    (df["num_tl_90g_dpd_24m"] > 0) |
    (df["num_accts_ever_120_pd"] > 0)
).astype(int)

# 8. Share of accounts that are installment loans
df["installment_loan_ratio"] = df["num_il_tl"] / (df["total_acc"] + 1e-6)

# 9. Share of accounts that are revolving
df["revolving_loan_ratio"] = df["num_rev_accts"] / (df["total_acc"] + 1e-6)

# 10. Total balance (ex mortgage) vs total high credit limit
df["total_bal_over_high"] = df["total_bal_ex_mort"] / (df["tot_hi_cred_lim"] + 1e-6)

# 11. Current balance vs total high credit limit
df["total_cur_bal_ratio"] = df["tot_cur_bal"] / (df["tot_hi_cred_lim"] + 1e-6)

# 12. FICO bucket (categorical risk bands)
df["fico_bucket"] = pd.cut(
    df["fico_avg"],
    bins=[0, 640, 680, 720, 760, 900],
    labels=[1, 2, 3, 4, 5]
)

### Modeling

After train test split, I opted to create a stratified sample within the training subset. This was because I wanted to perform hyperparameter tuning but due to computational limitations, I opted to train my model on 100,000 approved loans as opposed to the ~800,000 in the dataset. 

In [56]:
# Create target array and features data frame
y = df['outcome']

X = df.drop(columns=['outcome'])
# encode categorical variables
X = pd.get_dummies(X, drop_first = True)

In [57]:
X_train, x_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = .25, 
                                                    random_state = 42)

In [58]:
# Make a stratified sample of up to 100k rows from X_train / y_train
max_sample_size = 100_000

X_sample, _, y_sample, _ = train_test_split(
    X_train,
    y_train,
    train_size=max_sample_size,
    stratify=y_train,
    random_state=42,
)

print(X_sample.shape, y_sample.shape)


(100000, 1122) (100000,)


### Hyperparameter Optimization and Modeling Fitting
I opted to use a RandomizedSearchCV once again due to computational limitations. GridSearchCV would be the more optimal choice however due to the size of the dataset, RandmizedSearchCV felt like the better option. I would then go on to use the the model trained on the sample to obtain the best best performing parameters, which I would then use to train on the entire dataset.  

In [60]:
param_dist = {
    "num_leaves": np.arange(31, 200, 10),
    "max_depth": [-1, 4, 5, 6, 7, 8, 9, 10],
    "learning_rate": np.linspace(0.005, 0.2, 30),
    "n_estimators": np.arange(200, 1200, 100),
    "min_child_samples": np.arange(10, 200, 10),
    "subsample": np.linspace(0.5, 1.0, 11),
    "colsample_bytree": np.linspace(0.5, 1.0, 11),
    "reg_lambda": np.linspace(0.0, 5.0, 20),
}

lgbm_base = LGBMClassifier(
    objective="binary",
    class_weight="balanced",
    random_state=42,
    n_jobs = -1
)

search = RandomizedSearchCV(
    estimator = lgbm_base, 
    param_distributions = param_dist,
    n_iter=10,
    scoring = "roc_auc",
    cv = 4,
    n_jobs = -1,
    random_state = 42
)
    
search.fit(X_sample, y_sample)

[LightGBM] [Info] Number of positive: 14997, number of negative: 60003
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Number of positive: 14997, number of negative: 60003
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002

[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Number of positive: 14998, number of negative: 60002
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.117783 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9863
[LightGBM] [Info] Number of d

In [61]:
#training using the best parameters obtained from hyperparameter tuning
best_LGBM = search.best_estimator_

best_LGBM.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 150562, number of negative: 602373
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.078158 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 11526
[LightGBM] [Info] Number of data points in the train set: 752935, number of used features: 970
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000


### Model Results
The LightGBM model achieved a recall of .683 meaning that ~68.3% of all loan defaults were successfully identified and my model also obtained an ROC score of .7308. 


In [103]:
y_prob = best_LGBM.predict_proba(x_test)[:, 1]
y_predict = best_LGBM.predict(x_test)

auc = roc_auc_score(y_test, y_prob)
print(f"Validation ROC AUC: {auc:.4f}")

Validation ROC AUC: 0.7308


In [105]:
print(classification_report(y_test, y_predict, digits=3))

              precision    recall  f1-score   support

           0      0.891     0.651     0.752    200523
           1      0.330     0.683     0.445     50456

    accuracy                          0.658    250979
   macro avg      0.610     0.667     0.599    250979
weighted avg      0.778     0.658     0.691    250979

