# Modeling Phase

In this notebook, we will develop baseline models to predict credit default.
These models will later be audited for explainability, robustness, and compliance.

## Objectives:
- Train baseline models: Logistic Regression, Random Forest, XGBoost.
- Evaluate performance using AUC-ROC, F1-Score and other metrics.
- Save models for future auditing.


In [1]:
# Credit Risk Audit Tool - Modeling Phase
# Dataset: Prosper Loan Data

import sys
import os

# Add the project root to the Python path
sys.path.append(os.path.abspath('..'))

import pandas as pd
import numpy as np
from audit_tool import modeling as mdl
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

pd.set_option('display.max_columns', None)

# Load Lasso-selected, scaled training and test sets
X_train = pd.read_csv('../data/processed/prosperloan/X_train_lasso.csv')
X_test = pd.read_csv('../data/processed/prosperloan/X_test_lasso.csv')
y_train = pd.read_csv('../data/processed/prosperloan/y_train_lasso.csv').squeeze()
y_test = pd.read_csv('../data/processed/prosperloan/y_test_lasso.csv').squeeze()

print("Lasso-reduced, scaled datasets loaded successfully.")


Lasso-reduced, scaled datasets loaded successfully.


## Logistic Regression
The Logistic Regression model serves as a baseline. 
Its interpretability will also be useful in the explainability phase.


In [2]:
## Logistic Regression (baseline)
# This model serves as a simple interpretable benchmark.
# It will also be useful for explainability techniques (e.g., SHAP, LIME).

# Train model
log_model = mdl.train_logistic_regression(X_train, y_train)

print("----- TRAIN -----")
mdl.evaluate_with_threshold(log_model, X_train, y_train, top_pct=0.15)

print("----- TEST -----")
mdl.evaluate_with_threshold(log_model, X_test, y_test, top_pct=0.15)


----- TRAIN -----
Threshold (top 15%): 0.2394
AUC-ROC: 0.777
              precision    recall  f1-score   support

           0      0.902     0.901     0.901     77541
           1      0.439     0.441     0.440     13608

    accuracy                          0.832     91149
   macro avg      0.670     0.671     0.671     91149
weighted avg      0.833     0.832     0.833     91149

----- TEST -----
Threshold (top 15%): 0.2374
AUC-ROC: 0.781
              precision    recall  f1-score   support

           0      0.903     0.903     0.903     19386
           1      0.448     0.450     0.449      3402

    accuracy                          0.835     22788
   macro avg      0.676     0.676     0.676     22788
weighted avg      0.835     0.835     0.835     22788



In [3]:
mdl.save_model(log_model, '../models/prosperloan/logisticregression.pkl')

## Random Forest
The Random Forest model allows us to capture non-linear relationships
and will provide feature importance for explainability.


In [2]:
best_rf, grid = mdl.train_random_forest_tuned(X_train, y_train)

print("----- TRAIN -----")
mdl.evaluate_with_threshold(best_rf, X_train, y_train, top_pct=0.15)

print("----- TEST -----")
mdl.evaluate_with_threshold(best_rf, X_test, y_test, top_pct=0.15)

Best hyperparameters: {'max_depth': 15, 'min_samples_leaf': 10, 'n_estimators': 200}
Best CV AUC score: 0.8005978122021729
----- TRAIN -----
Threshold (top 15%): 0.5832
AUC-ROC: 0.893
              precision    recall  f1-score   support

           0      0.928     0.928     0.928     77541
           1      0.589     0.592     0.590     13608

    accuracy                          0.877     91149
   macro avg      0.759     0.760     0.759     91149
weighted avg      0.878     0.877     0.878     91149

----- TEST -----
Threshold (top 15%): 0.5651
AUC-ROC: 0.805
              precision    recall  f1-score   support

           0      0.907     0.906     0.907     19386
           1      0.468     0.471     0.469      3402

    accuracy                          0.841     22788
   macro avg      0.688     0.688     0.688     22788
weighted avg      0.842     0.841     0.841     22788



In [3]:
mdl.save_model(best_rf, '../models/prosperloan/randomforest.pkl')

## XGBoost
XGBoost is often used in financial prediction problems due to its
high performance and flexibility.


In [4]:
best_xgb, grid_results = mdl.train_xgboost_tuned(X_train, y_train)

print("----- TRAIN -----")
mdl.evaluate_with_threshold(best_xgb, X_train, y_train, top_pct=0.15)

print("----- TEST -----")
mdl.evaluate_with_threshold(best_xgb, X_test, y_test, top_pct=0.15)



→ Mejor AUC en validación interna: 0.8101
→ Mejores parámetros:
 {'learning_rate': 0.1, 'max_depth': 5.0, 'n_estimators': 300.0, 'reg_alpha': 0.1, 'reg_lambda': 10.0}
----- TRAIN -----
Threshold (top 15%): 0.2859
AUC-ROC: 0.846
              precision    recall  f1-score   support

           0      0.917     0.917     0.917     77541
           1      0.527     0.530     0.528     13608

    accuracy                          0.859     91149
   macro avg      0.722     0.723     0.723     91149
weighted avg      0.859     0.859     0.859     91149

----- TEST -----
Threshold (top 15%): 0.2816
AUC-ROC: 0.811
              precision    recall  f1-score   support

           0      0.909     0.908     0.909     19386
           1      0.480     0.482     0.481      3402

    accuracy                          0.845     22788
   macro avg      0.695     0.695     0.695     22788
weighted avg      0.845     0.845     0.845     22788



In [5]:
mdl.save_model(best_xgb, '../models/prosperloan/xgboost.pkl')

All models have been trained and evaluated.
Saved models will be used in the upcoming explainability and robustness audit phases.
