# Loan Default Prediction Project

This project explores patterns, correlations, and risk indicators associated with loan default behavior using a filtered LendingClub-style loan dataset. The notebook performs a structured exploratory data analysis to understand which borrower attributes are most associated with default outcomes.

## Model Approach

I trained an XGBoost model to predict the likelihood of loan default using a filtered LendingClub-style dataset. My objective was to identify high-risk borrowers by ranking loan applicants according to their probability of default and then selecting an optimal decision threshold for classification. I chose XGBoost because it performs exceptionally well on large, tabular datasets with mixed feature types, which is ideal for consumer lending data. Its gradient-boosted tree framework enables the model to capture complex nonlinear relationships between borrower characteristics and default risk, while its built-in regularization controls overfitting. Additionally, XGBoost offers strong support for handling imbalanced classification problems through mechanisms like sample weighting, making it well-suited for detecting minority-class defaults. I tuned the model using RandomizedSearchCV and evaluated its performance on a held-out test set using both the default and optimized thresholds.

### Library Import

In [90]:
# Data handling
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, mutual_info_classif, VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.pipeline import Pipeline

# Models
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV


# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report, precision_recall_curve
)

### Data Import

In [93]:
df = pd.read_csv("../filtered.csv")
df.dropna(inplace = True)

### Feature Engineering

In addition to some formating changes regarding time, I opted to create some useful features for both organizational purposes as well as model training. I averaged out the fico ranges, created various ratios with income, and built a simple delinquency scores as well as a delinquency flag. More specific descriptions of other engineered features are within the comments in the cell below.

In [94]:
#term
df['term'] = df['term'].astype(str).str.extract(r'(\d+)').astype(float)

#interest rate
df['int_rate'].astype(str).str.rstrip('%').replace('', np.nan).astype(float)

#emp_length
s = df['emp_length'].astype(str)
#< 1 year 
s = s.str.replace(r'[^0-9]+', '', regex = True)
df['emp_length'] = pd.to_numeric(s, errors = 'coerce')
df['emp_length'] = df['emp_length'].fillna(0).clip(lower = 0, upper = 10)

#issue date
df['issue_d'] = pd.to_datetime(df['issue_d'], format="%b-%Y", errors='coerce')
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format="%b-%Y", errors='coerce')

# Credit history length (in years) using months diff for accuracy
mask = df['issue_d'].notna() & df['earliest_cr_line'].notna()
df['credit_history_length'] = np.nan  # init

# months difference, then convert to years
months_diff = (df.loc[mask, 'issue_d'].dt.year - df.loc[mask, 'earliest_cr_line'].dt.year) * 12 + \
              (df.loc[mask, 'issue_d'].dt.month - df.loc[mask, 'earliest_cr_line'].dt.month)
df.loc[mask, 'credit_history_length'] = (months_diff / 12).astype(float)

#feature engineering
df['fico_avg'] = df[['fico_range_low', 'fico_range_high']].mean(axis = 1)

#zip code
df['zip3'] = df['zip_code'].astype(str).str[:3]
df.drop(columns = ['zip_code'], inplace = True)

#Drop original columns
drop_list = ['loan_status', 'fico_range_low', 'fico_range_high', 'earliest_cr_line', 'issue_d']
df.drop(columns = drop_list, inplace = True)

# 1. Installment burden relative to monthly income
df["installment_to_income"] = df["installment"] / (df["annual_inc"] / 12 + 1e-6)

# 2. Loan amount relative to annual income
df["loan_to_income"] = df["loan_amnt"] / (df["annual_inc"] + 1e-6)

# 3. Revolving balance relative to income
df["revol_bal_to_income"] = df["revol_bal"] / (df["annual_inc"] + 1e-6)

# 4. Revolving balance relative to total revolving high credit limit
df["revol_ratio"] = df["revol_bal"] / (df["total_rev_hi_lim"] + 1e-6)

# 5. High utilization flag (very high revolving utilization)
df["high_util_flag"] = (df["revol_util"] > 80).astype(int)

# 6. Weighted delinquency score (more weight for more severe delinquencies)
df["recent_delinquency_score"] = (
    df["num_tl_30dpd"] +
    2 * df["num_tl_90g_dpd_24m"] +
    3 * df["num_accts_ever_120_pd"]
)

# 7. Simple "ever delinquent" flag
df["ever_delinquent_flag"] = (
    (df["delinq_2yrs"] > 0) |
    (df["num_tl_30dpd"] > 0) |
    (df["num_tl_90g_dpd_24m"] > 0) |
    (df["num_accts_ever_120_pd"] > 0)
).astype(int)

# 8. Share of accounts that are installment loans
df["installment_loan_ratio"] = df["num_il_tl"] / (df["total_acc"] + 1e-6)

# 9. Share of accounts that are revolving
df["revolving_loan_ratio"] = df["num_rev_accts"] / (df["total_acc"] + 1e-6)

# 10. Total balance (ex mortgage) vs total high credit limit
df["total_bal_over_high"] = df["total_bal_ex_mort"] / (df["tot_hi_cred_lim"] + 1e-6)

# 11. Current balance vs total high credit limit
df["total_cur_bal_ratio"] = df["tot_cur_bal"] / (df["tot_hi_cred_lim"] + 1e-6)

# 12. FICO bucket (categorical risk bands)
df["fico_bucket"] = pd.cut(
    df["fico_avg"],
    bins=[0, 640, 680, 720, 760, 900],
    labels=[1, 2, 3, 4, 5]
)

### Train and Test Cohorts

In [95]:
y = df['outcome']

X = df.drop(columns=['outcome'])
X = pd.get_dummies(X, drop_first=True)

In [96]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.25,
    random_state = 12,
    stratify = y
)

#Class imbalance
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()

scale_pos_weight = neg / pos

### Modeling

Similarly to my LightGBM approach, I used a computationally efficient training strategy to tune an XGBoost classifier for predicting loan defaults. I drew a stratified sample of 100,000 observations from the training data to reduce computational loan, particulalry during hyperparameter tuning. To address class imbalance, per-row sample weights are computed using class_weight="balanced" and passed directly into model fitting. An XGBoost base model is defined with reasonable defaults for binary classification, and I defined a parameter distribution for RandomizedSearchCV to evaulated 10 combinations based on ROC score. With the best parameters, I then trained a final XGBoost model on the entire dataset, containing ~800,000 observations.

In [97]:
#Tuning train test split for computational considerations
X_tune, _, y_tune, _ = train_test_split(
    X_train,
    y_train,
    train_size = 100000,
    stratify = y_train,
    random_state = 42)

sample_weight_tune = compute_sample_weight(
    class_weight="balanced",
    y = y_tune
)

In [101]:
#modeling
xgb_base = XGBClassifier(
    objective="binary:logistic",  # predict probability of default (class 1)
    eval_metric="auc",
    tree_method="hist",
    random_state=42,
    n_jobs=-1
)

param_dist = {
    "n_estimators": [200, 400, 600],
    "max_depth": [4, 6, 8],
    "learning_rate": [0.03, 0.05, 0.1],
    "subsample": [0.7, 0.85, 1.0],
    "colsample_bytree": [0.7, 0.85, 1.0],
    "min_child_weight": [1, 5, 10],
    "gamma": [0, 1],
}

search = RandomizedSearchCV(
    estimator = xgb_base,
    param_distributions = param_dist,
    n_iter = 10,
    scoring = "roc_auc",
    cv = 4,
    verbose = 1,
    n_jobs = 1)

search.fit(X_tune, y_tune, sample_weight = sample_weight_tune)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


In [148]:
# best parameters from hyperparameter tuning
best_params = search.best_estimator_

best_params.fit(X_train, y_train)

### Results

The model exhibited extremely conservative behavior, essentially only labeling defaults if the probability is extremely high. As a result, only 8.9% percent of true defaults are caught. The results still look okay on the surface because the accuracy appears high due to the imbalanced classes. As a result, I opted to do some threshold tuning to optimize the threshold. 

In [122]:
y_prob = best_params.predict_proba(X_test)[:, 1]  # P(default)
y_pred = best_params.predict(X_test)

print(classification_report(y_test, y_pred, digits=3))

print("Test AUC:", roc_auc_score(y_test, y_prob))

              precision    recall  f1-score   support

           0      0.812     0.984     0.890    200724
           1      0.589     0.089     0.155     50255

    accuracy                          0.805    250979
   macro avg      0.701     0.537     0.522    250979
weighted avg      0.767     0.805     0.743    250979

Test AUC: 0.7296492557056875


In [108]:
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-9)

best_idx = f1_scores.argmax()
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)
print("Best F1 (defaults):", f1_scores[best_idx])

Best threshold: 0.22123234
Best F1 (defaults): 0.44589846760034424


### Threshold Tuning Results

After tuning for the best threshold using the precision_recall_curve() function to maximize the f1 score. We were able to improve to a recall of .624 using this XGboost model as well as a f1-score of .446, which also saw an improvement. This means that about 62.4% of true loan defaults were correctly labeled by my XGBoost model. Although accuracy saw a decrease, this is alright because the imbalanced classes cause accuracy to be a misleading metric for model evaluation.

In [109]:
y_pred_best = (y_prob > best_threshold).astype(int)

print("\nClassification report at tuned threshold:")
print(classification_report(y_test, y_pred_best, digits=3))


Classification report at tuned threshold:
              precision    recall  f1-score   support

           0      0.882     0.706     0.784    200724
           1      0.347     0.624     0.446     50255

    accuracy                          0.690    250979
   macro avg      0.615     0.665     0.615    250979
weighted avg      0.775     0.690     0.717    250979

