# 1. Introduction

This notebook focuses on training, evaluating, and selecting machine learning models for **loan default (credit risk) prediction**.

All data preprocessing steps—including exploratory data analysis, feature encoding, numerical scaling, and feature assembly—have already been completed in a previous notebook. This notebook assumes that the input features are fully numerical and ready for modeling.

The objective of this notebook is to experiment with multiple model families, compare their performance using appropriate evaluation metrics, and select the most suitable model for final evaluation on unseen data.

The modeling workflow is designed to ensure fair comparison between models and to prevent data leakage by keeping the test dataset completely isolated until the final evaluation stage.



# 2. Data Loading

In [1]:
import numpy as np

In [2]:
# Load processed feature matrices
X_train = np.load("../artifacts/X_train.npy")
X_test = np.load("../artifacts/X_test.npy")

# Load target variables
y_train = np.load("../artifacts/y_train.npy")
y_test = np.load("../artifacts/y_test.npy")

# Sanity checks: shapes and alignment
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

# Row alignment checks
assert X_train.shape[0] == y_train.shape[0], "Mismatch between X_train and y_train"
assert X_test.shape[0] == y_test.shape[0], "Mismatch between X_test and y_test"

print("Data loaded successfully and shapes are aligned.")

X_train shape: (204277, 31)
y_train shape: (204277,)
X_test shape: (51070, 31)
y_test shape: (51070,)
Data loaded successfully and shapes are aligned.


# 3. Training–Validation Split

- To enable fair comparison between different machine learning models, the training data is further split into training and validation subsets.

- The training subset is used to fit the models, while the validation subset is used to evaluate and compare model performance during model selection. This separation helps prevent overfitting and ensures that model choices are not influenced by the test data.

- The validation split is created only from the training data, and the test dataset remains completely untouched until the final evaluation stage. Stratified sampling is used to preserve the class distribution of the target variable, which is important due to class imbalance in loan default prediction.

In [3]:
from sklearn.model_selection import train_test_split

# Split training data into train and validation sets
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.25,        # 60% train, 20% val (since original train is 80%)
    random_state=42,
    stratify=y_train
)

print("X_train_sub shape:", X_train_sub.shape)
print("y_train_sub shape:", y_train_sub.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)


X_train_sub shape: (153207, 31)
y_train_sub shape: (153207,)
X_val shape: (51070, 31)
y_val shape: (51070,)


# 4 Evaluation Metrics

Loan default prediction is a **class-imbalanced classification problem**, where the number of non-defaulters significantly exceeds the number of defaulters. In such scenarios, overall accuracy can be misleading, as a model may achieve high accuracy by simply predicting the majority class.

To evaluate model performance more effectively, the following metrics are used:

### Recall (Default Class)
Recall measures the proportion of actual defaulters that are correctly identified by the model. This metric is especially important in credit risk modeling, as failing to identify a defaulter (**false negative**) can result in significant financial loss.

### Precision
Precision measures the proportion of predicted defaulters that are truly defaulters. This metric helps assess the cost of **false positives**, such as rejecting creditworthy applicants.

### F1-score
The F1-score is the **harmonic mean of precision and recall**. It provides a balanced measure when there is a trade-off between identifying defaulters and avoiding false alarms.

### ROC-AUC
The Area Under the Receiver Operating Characteristic Curve (**ROC-AUC**) evaluates a model’s ability to distinguish between defaulters and non-defaulters across different classification thresholds. A higher ROC-AUC indicates better overall discriminative performance.

Accuracy may still be reported for completeness, but it is **not used as the primary metric** for model selection.


# 5. Model Training and Hyperparameter Tuning


## 1. Logistic Regression

Logistic Regression is used as a **baseline model** for this credit risk prediction task. It is a widely used linear classification algorithm that provides interpretable results and serves as a strong reference point for evaluating more complex models.

As a baseline, Logistic Regression helps establish whether the engineered features contain sufficient predictive signal before introducing non-linear or ensemble-based models.

The model is first trained using **default hyperparameters** to obtain baseline performance. Subsequently, **limited hyperparameter tuning** is applied to improve performance, particularly with respect to **class imbalance handling** and **regularization**.

Model performance is evaluated on the **validation dataset** using the following metrics:
- Recall
- Precision
- F1-score
- ROC-AUC

The **test dataset is not used** at this stage and is reserved exclusively for **final model evaluation**.


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, f1_score, roc_auc_score

In [5]:
log_reg = LogisticRegression(
    max_iter=3000,
    n_jobs=-1,
    random_state=42
)

In [6]:
log_reg.fit(X_train_sub, y_train_sub)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,3000


In [7]:
y_val_pred = log_reg.predict(X_val)
y_val_proba = log_reg.predict_proba(X_val)[:, 1]

In [8]:
# Compute evaluation metrics
recall = recall_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)
roc_auc = roc_auc_score(y_val, y_val_proba)

# Print validation results
print("Logistic Regression (Baseline) - Validation Metrics")
print(f"Recall:    {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")

Logistic Regression (Baseline) - Validation Metrics
Recall:    0.0352
Precision: 0.6220
F1-score:  0.0667
ROC-AUC:   0.7457


# Tuning Logistic Regression

In [9]:
# enabling class weights as 'balanced'

log_reg = LogisticRegression(
    max_iter=3000,
    n_jobs=-1,
    random_state=42,
    class_weight='balanced'
)

log_reg.fit(X_train_sub, y_train_sub)

y_val_pred = log_reg.predict(X_val)
y_val_proba = log_reg.predict_proba(X_val)[:, 1]

# Compute evaluation metrics
recall = recall_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)
roc_auc = roc_auc_score(y_val, y_val_proba)

# Print validation results
print("Logistic Regression (Baseline) - Validation Metrics")
print(f"Recall:    {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")

Logistic Regression (Baseline) - Validation Metrics
Recall:    0.6903
Precision: 0.2159
F1-score:  0.3289
ROC-AUC:   0.7454


In [10]:
# tuning regularization
C_values = [0.01, 0.1, 1, 10, 100]

results = []

for C in C_values:
    model = LogisticRegression(
        max_iter=3000,
        random_state=42,
        n_jobs=-1,
        class_weight='balanced',
        C=C
    )
    
    model.fit(X_train_sub, y_train_sub)
    
    y_val_pred = model.predict(X_val)
    y_val_proba = model.predict_proba(X_val)[:, 1]
    
    recall = recall_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    roc_auc = roc_auc_score(y_val, y_val_proba)
    
    results.append({
        "C": C,
        "Recall": recall,
        "Precision": precision,
        "F1": f1,
        "ROC_AUC": roc_auc
    })

# Display results
for r in results:
    print(r)

{'C': 0.01, 'Recall': 0.6902714550665993, 'Precision': 0.21589410958181723, 'F1': 0.3289145978950751, 'ROC_AUC': 0.7453953366526369}
{'C': 0.1, 'Recall': 0.6902714550665993, 'Precision': 0.21589410958181723, 'F1': 0.3289145978950751, 'ROC_AUC': 0.7453927331805815}
{'C': 1, 'Recall': 0.6902714550665993, 'Precision': 0.21589410958181723, 'F1': 0.3289145978950751, 'ROC_AUC': 0.7453931291175083}
{'C': 10, 'Recall': 0.6902714550665993, 'Precision': 0.21588272516346763, 'F1': 0.32890138582044587, 'ROC_AUC': 0.7453932112930969}
{'C': 100, 'Recall': 0.6902714550665993, 'Precision': 0.21588272516346763, 'F1': 0.32890138582044587, 'ROC_AUC': 0.7453932673219071}


In [11]:
best_C = 1

In [13]:
# Train final Logistic Regression model with best regularization
final_log_reg = LogisticRegression(
    max_iter=3000,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced',
    C=1
)

final_log_reg.fit(X_train_sub, y_train_sub)

# Get validation probabilities once
y_val_proba = final_log_reg.predict_proba(X_val)[:, 1]

# ROC-AUC (threshold independent)
roc_auc = roc_auc_score(y_val, y_val_proba)

# Thresholds to evaluate
thresholds = np.linspace(0.1, 0.9, 17)

threshold_results = []

for t in thresholds:
    y_val_pred = (y_val_proba >= t).astype(int)

    recall = recall_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)

    threshold_results.append({
        "Threshold": round(t, 2),
        "Recall": recall,
        "Precision": precision,
        "F1": f1
    })

# Display results
for r in threshold_results:
    print(r)

print("\nROC-AUC (independent of threshold):", roc_auc)

{'Threshold': np.float64(0.1), 'Recall': 0.9973023098971505, 'Precision': 0.11930454426269187, 'F1': 0.21311475409836064}
{'Threshold': np.float64(0.15), 'Recall': 0.988703422694318, 'Precision': 0.1251173508577281, 'F1': 0.22212541904202732}
{'Threshold': np.float64(0.2), 'Recall': 0.9699881976058, 'Precision': 0.13258815395252363, 'F1': 0.2332880517426654}
{'Threshold': np.float64(0.25), 'Recall': 0.9419996627887371, 'Precision': 0.14170492302229437, 'F1': 0.24635125005511707}
{'Threshold': np.float64(0.3), 'Recall': 0.9089529590288316, 'Precision': 0.15283644714087263, 'F1': 0.26167362392000776}
{'Threshold': np.float64(0.35), 'Recall': 0.8669701568032372, 'Precision': 0.16538548132900197, 'F1': 0.27778077899627246}
{'Threshold': np.float64(0.4), 'Recall': 0.8179059180576631, 'Precision': 0.1801470588235294, 'F1': 0.29526157217200766}
{'Threshold': np.float64(0.45), 'Recall': 0.7597369752149722, 'Precision': 0.19704390414553088, 'F1': 0.31292753220597935}
{'Threshold': np.float64(0.

### Final threshold selection:

A threshold of 0.55 was selected based on validation performance.
This threshold provides a strong balance between recall and precision while prioritizing the identification of defaulters.
In credit risk modeling, false negatives are more costly than false positives, making recall a higher-priority metric.

In [15]:
# Initialize comparison DataFrame (run once, at the start)
model_comparison_df = pd.DataFrame(columns=[
    "model_name",
    "variant",
    "C",
    "threshold",
    "val_recall",
    "val_precision",
    "val_f1",
    "val_roc_auc"
])

# Add final Logistic Regression results (after tuning)
model_comparison_df.loc[len(model_comparison_df)] = {
    "model_name": "LogisticRegression",
    "variant": "class_weighted + threshold_tuned",
    "C": 1,
    "threshold": 0.55,
    "val_recall": 0.6162535828696678,
    "val_precision": 0.24027083881146463,
    "val_f1": 0.34574090715603273,
    "val_roc_auc": 0.7453931291175083
}

model_comparison_df


Unnamed: 0,model_name,variant,C,threshold,val_recall,val_precision,val_f1,val_roc_auc
0,LogisticRegression,class_weighted + threshold_tuned,1,0.55,0.616254,0.240271,0.345741,0.745393


## Decision Tree Classifier


In [16]:
from sklearn.tree import DecisionTreeClassifier

In [19]:
dt = DecisionTreeClassifier()
dt.fit(X_train_sub,y_train_sub)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [27]:
y_val_proba = dt.predict_proba(X_val)[:,1]
y_val_pred = dt.predict(X_val)

recall = recall_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)
roc_auc = roc_auc_score(y_val, y_val_proba)

# Print validation results
print("Decision Tree Model - not tuned")
print(f"Recall:    {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")

Decision Tree Model - not tuned
Recall:    0.2271
Precision: 0.1972
F1-score:  0.2111
ROC-AUC:   0.5528


In [36]:
# tune the decision tree with the max_depth

max_depths = [3,5,7,8,9,10,15]
max_depth_scores = []

for d in max_depths:
    dt = DecisionTreeClassifier(max_depth=d)
    dt.fit(X_train_sub,y_train_sub)

    y_val_proba = dt.predict_proba(X_val)[:,1]
    y_val_pred = dt.predict(X_val)

    recall = recall_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    roc_auc = roc_auc_score(y_val, y_val_proba)

    max_depth_scores.append((d,recall,precision,f1,roc_auc))

df_max_depth = pd.DataFrame(max_depth_scores,columns=['max_depth','recall','precision','f1','roc_auc'])
    

    
    
    
    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [37]:
df_max_depth

Unnamed: 0,max_depth,recall,precision,f1,roc_auc
0,3,0.0,0.0,0.0,0.682161
1,5,0.016523,0.624204,0.032194,0.710305
2,7,0.045861,0.507463,0.084119,0.717787
3,8,0.037936,0.494505,0.070467,0.71836
4,9,0.056989,0.448276,0.101122,0.714794
5,10,0.06289,0.412611,0.109144,0.70371
6,15,0.145338,0.263206,0.187269,0.598233


In [43]:
max_depth = 8

dt = DecisionTreeClassifier(max_depth=7,class_weight='balanced')
dt.fit(X_train_sub,y_train_sub)

y_val_proba = dt.predict_proba(X_val)[:,1]
y_val_pred = dt.predict(X_val)

recall = recall_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)
roc_auc = roc_auc_score(y_val, y_val_proba)

# Print validation results
print("Decision Tree Model - tuned max_depth=7 & class_weight=balanced")
print(f"Recall:    {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")


Decision Tree Model - tuned max_depth=7 & class_weight=balanced
Recall:    0.6724
Precision: 0.1991
F1-score:  0.3072
ROC-AUC:   0.7186


In [45]:
# Decision tree tuning - min_samples_leaf

min_sample_leaves = [1, 5, 10, 20, 50]

min_sample_leaf_scores = []

for leaf in min_sample_leaves:
    dt = DecisionTreeClassifier(max_depth=7,class_weight='balanced',min_samples_leaf=leaf)
    dt.fit(X_train_sub,y_train_sub)

    y_val_proba = dt.predict_proba(X_val)[:,1]
    y_val_pred = dt.predict(X_val)

    recall = recall_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    roc_auc = roc_auc_score(y_val, y_val_proba)

    min_sample_leaf_scores.append((leaf,recall,precision,f1,roc_auc))

df_min_sample_leaf_scores = pd.DataFrame(min_sample_leaf_scores,columns=['min_samples_leaf','recall','precision','f1','roc_auc'])

df_min_sample_leaf_scores

    


Unnamed: 0,min_samples_leaf,recall,precision,f1,roc_auc
0,1,0.672399,0.199081,0.307206,0.718552
1,5,0.672399,0.199081,0.307206,0.718552
2,10,0.672399,0.199072,0.307195,0.718573
3,20,0.672736,0.199122,0.307289,0.718981
4,50,0.673411,0.198894,0.307089,0.719081


In [47]:
# Add final Decision Tree results to the comparison DataFrame

model_comparison_df.loc[len(model_comparison_df)] = {
    "model_name": "DecisionTree",
    "variant": "max_depth=7 + class_weight=balanced",
    "C": None,                 # Not applicable for Decision Tree
    "threshold": None,         # Threshold tuning not applied
    "val_recall": 0.672399,
    "val_precision": 0.199081,
    "val_f1": 0.307206,
    "val_roc_auc": 0.718552
}

model_comparison_df


  model_comparison_df.loc[len(model_comparison_df)] = {


Unnamed: 0,model_name,variant,C,threshold,val_recall,val_precision,val_f1,val_roc_auc
0,LogisticRegression,class_weighted + threshold_tuned,1.0,0.55,0.616254,0.240271,0.345741,0.745393
1,DecisionTree,max_depth=7 + class_weight=balanced,,,0.672399,0.199081,0.307206,0.718552
2,DecisionTree,max_depth=7 + class_weight=balanced,,,0.672399,0.199081,0.307206,0.718552
