# Loan Default Prediction Project

This project explores patterns, correlations, and risk indicators associated with loan default behavior using a filtered LendingClub-style loan dataset. The notebook performs a structured exploratory data analysis to understand which borrower attributes are most associated with default outcomes.

## Modeling Approach

We developed a binary classification model using sklearn logistic regression, a gradient-boosted decision tree algorithm optimized for speed and performance on large tabular datasets. LightGBM was selected for its ability to:
* Handle heterogeneous, non-linear relationships
* Work well with 50-100+ features
* Manage class imbalance through built-in weighting
* The target variable is loan_status (0 = fully paid, 1 = default).

### Libraries Import

In [138]:
# Data handling
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder 
from sklearn.feature_selection import SelectKBest, mutual_info_classif, VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)

### Data Import

In [141]:
df = pd.read_csv("../filtered.csv")
df.dropna(inplace = True)

### Feature Engineering

In addition to some formating changes regarding time, I opted to create some useful features for both organizational purposes as well as model training. I averaged out the fico ranges, created various ratios with income, and built a simple delinquency scores as well as a delinquency flag. More specific descriptions of other engineered features are within the comments in the cell below.

In [144]:
#term
df['term'] = df['term'].astype(str).str.extract(r'(\d+)').astype(float)

#interest rate
df['int_rate'].astype(str).str.rstrip('%').replace('', np.nan).astype(float)

#emp_length
s = df['emp_length'].astype(str)
#< 1 year 
s = s.str.replace(r'[^0-9]+', '', regex = True)
df['emp_length'] = pd.to_numeric(s, errors = 'coerce')
df['emp_length'] = df['emp_length'].fillna(0).clip(lower = 0, upper = 10)

#issue date
df['issue_d'] = pd.to_datetime(df['issue_d'], format="%b-%Y", errors='coerce')
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format="%b-%Y", errors='coerce')

# Credit history length (in years) using months diff for accuracy
mask = df['issue_d'].notna() & df['earliest_cr_line'].notna()
df['credit_history_length'] = np.nan  # init

# months difference, then convert to years
months_diff = (df.loc[mask, 'issue_d'].dt.year - df.loc[mask, 'earliest_cr_line'].dt.year) * 12 + \
              (df.loc[mask, 'issue_d'].dt.month - df.loc[mask, 'earliest_cr_line'].dt.month)
df.loc[mask, 'credit_history_length'] = (months_diff / 12).astype(float)

#feature engineering
df['fico_avg'] = df[['fico_range_low', 'fico_range_high']].mean(axis = 1)

#zip code
df['zip3'] = df['zip_code'].astype(str).str[:3]
df.drop(columns = ['zip_code'], inplace = True)

#Drop original columns
drop_list = ['loan_status', 'fico_range_low', 'fico_range_high', 'earliest_cr_line', 'issue_d']
df.drop(columns = drop_list, inplace = True)

# 1. Installment burden relative to monthly income
df["installment_to_income"] = df["installment"] / (df["annual_inc"] / 12 + 1e-6)

# 2. Loan amount relative to annual income
df["loan_to_income"] = df["loan_amnt"] / (df["annual_inc"] + 1e-6)

# 3. Revolving balance relative to income
df["revol_bal_to_income"] = df["revol_bal"] / (df["annual_inc"] + 1e-6)

# 4. Revolving balance relative to total revolving high credit limit
df["revol_ratio"] = df["revol_bal"] / (df["total_rev_hi_lim"] + 1e-6)

# 5. High utilization flag (very high revolving utilization)
df["high_util_flag"] = (df["revol_util"] > 80).astype(int)

# 6. Weighted delinquency score (more weight for more severe delinquencies)
df["recent_delinquency_score"] = (
    df["num_tl_30dpd"] +
    2 * df["num_tl_90g_dpd_24m"] +
    3 * df["num_accts_ever_120_pd"]
)

# 7. Simple "ever delinquent" flag
df["ever_delinquent_flag"] = (
    (df["delinq_2yrs"] > 0) |
    (df["num_tl_30dpd"] > 0) |
    (df["num_tl_90g_dpd_24m"] > 0) |
    (df["num_accts_ever_120_pd"] > 0)
).astype(int)

# 8. Share of accounts that are installment loans
df["installment_loan_ratio"] = df["num_il_tl"] / (df["total_acc"] + 1e-6)

# 9. Share of accounts that are revolving
df["revolving_loan_ratio"] = df["num_rev_accts"] / (df["total_acc"] + 1e-6)

# 10. Total balance (ex mortgage) vs total high credit limit
df["total_bal_over_high"] = df["total_bal_ex_mort"] / (df["tot_hi_cred_lim"] + 1e-6)

# 11. Current balance vs total high credit limit
df["total_cur_bal_ratio"] = df["tot_cur_bal"] / (df["tot_hi_cred_lim"] + 1e-6)

# 12. FICO bucket (categorical risk bands)
df["fico_bucket"] = pd.cut(
    df["fico_avg"],
    bins=[0, 640, 680, 720, 760, 900],
    labels=[1, 2, 3, 4, 5]
)

### Train Test Split

In [147]:
X = df.drop(columns = ['outcome']) 
y = df['outcome'] 

X_train, x_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 30)

cat_cols = X.select_dtypes(include=['object', 'category']).columns
num_cols = X.select_dtypes(exclude=['object', 'category']).columns

### Modeling

My pipeline consists of 4 main components:
* One-hot encode categorical variables with controls on rare and high-cardinality categories to keep the feature space efficient and stable.
* Remove low-variance features to eliminate uninformative predictors created during encoding.
* Select the top 200 features using mutual information to reduce dimensionality and focus on the most predictive signals.
* Train a logistic regression model with L2 regularization and class_weight="balanced" to handle class imbalance and produce calibrated probabilities.

And then an end-to-end pipeline to ensure consistent preprocessing

In [150]:
#Modeling
ohe = OneHotEncoder(
    handle_unknown = "ignore",
    sparse_output = True,
    min_frequency = 1000,
    max_categories = 50)

preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", num_cols),
        ("cat", ohe, cat_cols),
    ]
)

nzv = VarianceThreshold(threshold = 1e-5)

selector = SelectKBest(mutual_info_classif, k = 50)

logit = LogisticRegression(solver = "saga",
                           penalty = "l2",
                           max_iter = 1000,
                           class_weight = "balanced",
                           verbose = 1)

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE


pipe = Pipeline(steps = [
    ("encoding", ohe),
    ("nzv", nzv),
    ("kbest", selector),
    ("model", logit)])

In [159]:
pipe.fit(X_train, y_train)

Epoch 1, change: 1.00000000
Epoch 2, change: 0.47393380
Epoch 3, change: 0.26401383
Epoch 4, change: 0.20883194
Epoch 5, change: 0.16764044
Epoch 6, change: 0.13639280
Epoch 7, change: 0.09404495
Epoch 8, change: 0.03884402
Epoch 9, change: 0.03324460
Epoch 10, change: 0.03007730
Epoch 11, change: 0.01723062
Epoch 12, change: 0.00932272
Epoch 13, change: 0.00700648
Epoch 14, change: 0.00729282
Epoch 15, change: 0.00552663
Epoch 16, change: 0.00488580
Epoch 17, change: 0.00430526
Epoch 18, change: 0.00375261
Epoch 19, change: 0.00340069
Epoch 20, change: 0.00304295
Epoch 21, change: 0.00275291
Epoch 22, change: 0.00248463
Epoch 23, change: 0.00226285
Epoch 24, change: 0.00202310
Epoch 25, change: 0.00185319
Epoch 26, change: 0.00166277
Epoch 27, change: 0.00152103
Epoch 28, change: 0.00135190
Epoch 29, change: 0.00124085
Epoch 30, change: 0.00109822
Epoch 31, change: 0.00100544
Epoch 32, change: 0.00091584
Epoch 33, change: 0.00083276
Epoch 34, change: 0.00073799
Epoch 35, change: 0.000

In [163]:
y_pred = pipe.predict(x_test)
y_proba = pipe.predict_proba(x_test)

### Results

My logistic regression model achieves a recall of 0.68 for the default class, indicating that it successfully identifies a substantial majority of borrowers who ultimately default—an essential requirement in credit risk applications where the cost of missing a high-risk borrower is significant. While precision for this class is lower at 0.31, this reflects an intentional and acceptable trade-off, as catching more true defaults naturally increases false positives.

In [166]:
print("\nClassification report: \n", classification_report(y_test, y_pred))


Classification report: 
               precision    recall  f1-score   support

           0       0.88      0.62      0.73    200798
           1       0.31      0.68      0.42     50181

    accuracy                           0.63    250979
   macro avg       0.60      0.65      0.58    250979
weighted avg       0.77      0.63      0.67    250979



In [168]:
print("ROC AUC: ", roc_auc_score(y_test, y_proba[:, 1]))

ROC AUC:  0.7061978389155072
