# Modeling Phase

In this notebook, we will develop baseline models to predict credit default.
These models will later be audited for explainability, robustness, and compliance.

## Objectives:
- Train baseline models: Logistic Regression, Random Forest, XGBoost.
- Evaluate performance using AUC-ROC, F1-Score and other metrics.
- Save models for future auditing.


In [1]:
# Credit Risk Audit Tool - Modeling Phase
# Dataset: Prosper Loan Data

import sys
import os

# Add the project root to the Python path
sys.path.append(os.path.abspath('..'))

import pandas as pd
import numpy as np
from audit_tool import modeling as mdl
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

pd.set_option('display.max_columns', None)

X_train = pd.read_csv('../data/processed/prosperloan/X_train.csv')
X_test = pd.read_csv('../data/processed/prosperloan/X_test.csv')
y_train = pd.read_csv('../data/processed/prosperloan/y_train.csv').squeeze()
y_test = pd.read_csv('../data/processed/prosperloan/y_test.csv').squeeze()

print("Data loaded successfully.")


Data loaded successfully.


## Logistic Regression
The Logistic Regression model serves as a baseline. 
Its interpretability will also be useful in the explainability phase.


In [2]:
# Train Logistic Regression
log_model = mdl.train_logistic_regression(X_train, y_train)

# Evaluate
report_lr, auc_lr = mdl.evaluate_model(log_model, X_test, y_test)
print("Logistic Regression Report:\n", report_lr)
print("AUC:", auc_lr)

# Save model
mdl.save_model(log_model, '../models/prosperloan/logistic_regression.pkl')


Logistic Regression Report:
               precision    recall  f1-score   support

           0       0.89      0.97      0.93     19293
           1       0.65      0.31      0.42      3355

    accuracy                           0.87     22648
   macro avg       0.77      0.64      0.68     22648
weighted avg       0.86      0.87      0.85     22648

AUC: 0.8753243089564852


## Random Forest
The Random Forest model allows us to capture non-linear relationships
and will provide feature importance for explainability.


In [3]:
# Train Random Forest
rf_model = mdl.train_random_forest(X_train, y_train)

# Evaluate
report_rf, auc_rf = mdl.evaluate_model(rf_model, X_test, y_test)
print("Random Forest Report:\n", report_rf)
print("AUC:", auc_rf)

# Save model
mdl.save_model(rf_model, '../models/prosperloan/random_forest.pkl')


Random Forest Report:
               precision    recall  f1-score   support

           0       0.89      0.97      0.93     19293
           1       0.67      0.30      0.42      3355

    accuracy                           0.87     22648
   macro avg       0.78      0.64      0.67     22648
weighted avg       0.86      0.87      0.85     22648

AUC: 0.8717406597436983


# XGBoost
XGBoost is often used in financial prediction problems due to its
high performance and flexibility.


In [4]:
# Train XGBoost
xgb_model = mdl.train_xgboost(X_train, y_train)

# Evaluate
report_xgb, auc_xgb = mdl.evaluate_model(xgb_model, X_test, y_test)
print("XGBoost Report:\n", report_xgb)
print("AUC:", auc_xgb)

# Save model
mdl.save_model(xgb_model, '../models/prosperloan/xgboost.pkl')


XGBoost Report:
               precision    recall  f1-score   support

           0       0.89      0.96      0.93     19293
           1       0.63      0.34      0.44      3355

    accuracy                           0.87     22648
   macro avg       0.76      0.65      0.69     22648
weighted avg       0.86      0.87      0.86     22648

AUC: 0.8766865398235989


All models have been trained and evaluated.
Saved models will be used in the upcoming explainability and robustness audit phases.
