# 03. Model Development & Evaluation

## 1. Objective
In this notebook, we develop and compare multiple machine learning models to predict loan defaults. We will leverage the preprocessing and resampling pipeline developed in Part 02 to ensure a fair and rigorous comparison.

### 1.1 Tasks
* **Baseline Modeling**: Establish a performance floor with Logistic Regression.
* **Ensemble Methods**: Implement Random Forest and XGBoost to capture complex, non-linear patterns.
* **Hyperparameter Tuning**: Optimize the champion model for the best balance of Precision and Recall.
* **Model Interpretation**: Use SHAP values to demystify the "black box" and provide business transparency.

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib

# Modeling & Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc, f1_score

# Imbalance Handling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Configuration
plt.style.use('ggplot')
warnings.filterwarnings('ignore')
np.random.seed(42)

print("‚úÖ Environment ready for model development.")

‚úÖ Environment ready for model development.


In [2]:
# Define absolute paths for portability
DATA_DIR = r"C:\dev\quant_project\homework\data_storage"
MODEL_DIR = r"C:\dev\quant_project\homework\model_storage"

# Ensure directories exist
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

In [3]:
# =================================================================
# BRIDGE: Restoring Part 02 context for independent execution
# =================================================================

# 1. Load Original Data
df = pd.read_csv(os.path.join(DATA_DIR, 'loan_data.csv'))

# 2. Domain Feature Engineering
def engineer_features(input_df):
    df_eng = input_df.copy()
    df_eng['loan_to_income'] = df_eng['loan_amount'] / (df_eng['annual_income'] + 1e-6)
    df_eng['rev_to_loan_ratio'] = (df_eng['monthly_revenue'] * 12) / (df_eng['loan_amount'] + 1e-6)
    df_eng['rev_per_employee'] = (df_eng['monthly_revenue'] * 12) / (df_eng['num_employees'] + 1)
    
    bins = [300, 580, 670, 740, 800, 850]
    labels = ['High_Risk', 'Subprime', 'Average', 'Good', 'Excellent']
    df_eng['credit_tier'] = pd.cut(df_eng['credit_score'], bins=bins, labels=labels)
    return df_eng

df_featured = engineer_features(df)

# 3. Preprocessor Definition
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_features = ['annual_income', 'credit_score', 'loan_amount', 'years_in_business', 
                    'num_employees', 'previous_loans', 'debt_to_income_ratio', 
                    'monthly_revenue', 'loan_to_income', 'rev_to_loan_ratio', 'rev_per_employee']

categorical_features = ['loan_purpose', 'industry', 'has_collateral', 
                        'education_level', 'geographic_region', 'credit_tier']

preprocessor = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features),
    ('cat', Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), categorical_features)
])

# 4. Stratified Split (Maintaining the ~22.5% default rate)
X = df_featured.drop(columns=['application_id', 'default'])
y = df_featured['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"‚úÖ Bridge successful. Raw training size: {len(X_train)} samples.")

‚úÖ Bridge successful. Raw training size: 4000 samples.


In [4]:
# =================================================================
# VALIDATION: Verifying Resampled Training Volume
# We perform a manual check to ensure SMOTE is correctly 
# augmenting the minority class before it hits the model.
# =================================================================

# 1. Apply preprocessing (SMOTE requires numerical inputs)
X_train_preprocessed = preprocessor.fit_transform(X_train)

# 2. Manually apply SMOTE for inspection
# We use sampling_strategy=0.5 (Minority will be 50% of the Majority)
sm = SMOTE(random_state=42, sampling_strategy=0.5)
X_res, y_res = sm.fit_resample(X_train_preprocessed, y_train)

print(f"--- Resampling Verification ---")
print(f"Original Training Size: {len(X_train)}")
print(f"Resampled Training Size: {len(X_res)}")
print(f"Default Samples (Original): {y_train.sum()}")
print(f"Default Samples (After SMOTE): {y_res.sum()}")
print(f"Post-SMOTE Default Rate: {y_res.mean():.2%}")

# Note: The 'full_pipeline' will handle this automatically during training.

--- Resampling Verification ---
Original Training Size: 4000
Resampled Training Size: 4651
Default Samples (Original): 899
Default Samples (After SMOTE): 1550
Post-SMOTE Default Rate: 33.33%


## 2. Evaluation Metric Strategy: AUPRC vs. ROC

In a 22.5% default environment, accuracy is a misleading metric. We prioritize **AUPRC (Area Under the Precision-Recall Curve)** for the following reasons:

* **Class Imbalance**: ROC-AUC can be inflated by high True Negative counts. AUPRC focuses strictly on the "Default" class.
* **Cost of Errors**: 
    * **False Negatives**: Direct capital loss (Primary Concern).
    * **False Positives**: Opportunity cost of lost interest.
* **Precision-Recall Tradeoff**: PR curves allow us to visualize exactly how much "Precision" we lose to achieve a certain "Recall" (e.g., catching 80% of bad loans).

In [5]:
# Benchmarking 3 core algorithms
models = {
    'Logistic_Regression': LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
    'Random_Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42)
}

results = []

for name, model in models.items():
    # Build pipeline including SMOTE
    clf_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42, sampling_strategy=0.5)),
        ('classifier', model)
    ])
    
    # 5-Fold Stratified Cross Validation
    cv_res = cross_validate(clf_pipeline, X_train, y_train, cv=5, 
                            scoring=['f1', 'roc_auc', 'recall', 'precision'])
    
    results.append({
        'Model': name,
        'F1_Mean': cv_res['test_f1'].mean(),
        'Recall_Mean': cv_res['test_recall'].mean(),
        'ROC_AUC_Mean': cv_res['test_roc_auc'].mean()
    })

performance_df = pd.DataFrame(results).sort_values(by='F1_Mean', ascending=False)
display(performance_df)

Unnamed: 0,Model,F1_Mean,Recall_Mean,ROC_AUC_Mean
1,Random_Forest,0.648382,0.566176,0.846118
0,Logistic_Regression,0.642895,0.709659,0.865033
2,XGBoost,0.618902,0.541688,0.831508


## 3. Hyperparameter Tuning: Optimizing Ensemble Models

Our initial benchmarking showed that **Logistic Regression** is a strong baseline, while **Random Forest** and **XGBoost** have potential but require tuning to overcome default parameter limitations.

### 3.1 Tuning Goals:
* **XGBoost**: Focus on preventing overfitting through `max_depth` and `learning_rate`.
* **Random Forest**: Optimize `min_samples_leaf` and `n_estimators` to improve generalization on the minority class.
* **Constraint**: We will continue using the **imblearn Pipeline** during Grid Search to ensure SMOTE is applied correctly within each cross-validation fold.

In [6]:
from sklearn.model_selection import GridSearchCV

# 1. Tuning Random Forest
rf_param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_leaf': [1, 2, 4]
}

rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42, sampling_strategy=0.5)),
    ('classifier', RandomForestClassifier(random_state=42))
])

print("Tuning Random Forest...")
rf_grid = GridSearchCV(rf_pipeline, rf_param_grid, cv=3, scoring='f1', n_jobs=-1)
rf_grid.fit(X_train, y_train)

# 2. Tuning XGBoost
xgb_param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [3, 5],
    'classifier__learning_rate': [0.05, 0.1],
    'classifier__scale_pos_weight': [1, 3] # Addresses 22.5% imbalance specifically
}

xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42, sampling_strategy=0.5)),
    ('classifier', XGBClassifier(eval_metric='logloss', random_state=42))
])

print("Tuning XGBoost...")
xgb_grid = GridSearchCV(xgb_pipeline, xgb_param_grid, cv=3, scoring='f1', n_jobs=-1)
xgb_grid.fit(X_train, y_train)

print(f"\n‚úÖ Tuning Complete.")
print(f"Best RF Params: {rf_grid.best_params_}")
print(f"Best XGB Params: {xgb_grid.best_params_}")

Tuning Random Forest...
Tuning XGBoost...

‚úÖ Tuning Complete.
Best RF Params: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 1, 'classifier__n_estimators': 100}
Best XGB Params: {'classifier__learning_rate': 0.05, 'classifier__max_depth': 3, 'classifier__n_estimators': 100, 'classifier__scale_pos_weight': 1}


## 4. The Grand Final: Test Set Performance

We now evaluate the best-tuned versions of all three candidates on the **X_test / y_test** set. This is the ultimate test of how the models will perform in a production environment at FinFlow.

In [7]:
from sklearn.metrics import recall_score, f1_score, roc_auc_score, precision_score

final_comparison = []

# Define the finalists
final_models = {
    'Logistic_Regression (Baseline)': Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42, sampling_strategy=0.5)),
        ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
    ]),
    'Tuned_Random_Forest': rf_grid.best_estimator_,
    'Tuned_XGBoost': xgb_grid.best_estimator_
}

for name, model in final_models.items():
    # Fit (LR needs fitting, others are already fitted by GridSearch)
    if name == 'Logistic_Regression (Baseline)':
        model.fit(X_train, y_train)
    
    # Predict
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    final_comparison.append({
        'Model': name,
        'Test_Recall': recall_score(y_test, y_pred),
        'Test_Precision': precision_score(y_test, y_pred),
        'Test_F1': f1_score(y_test, y_pred),
        'Test_ROC_AUC': roc_auc_score(y_test, y_proba)
    })

# Output the results
final_perf_df = pd.DataFrame(final_comparison).sort_values(by='Test_F1', ascending=False)
display(final_perf_df)

# Store the best model for Part 04
champion_name = final_perf_df.iloc[0]['Model']
best_model = final_models[champion_name]
print(f"\nüèÜ Champion Model: {champion_name}")

Unnamed: 0,Model,Test_Recall,Test_Precision,Test_F1,Test_ROC_AUC
1,Tuned_Random_Forest,0.617778,0.80814,0.700252,0.874796
2,Tuned_XGBoost,0.626667,0.783333,0.696296,0.889996
0,Logistic_Regression (Baseline),0.768889,0.615658,0.683794,0.891533



üèÜ Champion Model: Tuned_Random_Forest


---

## Final Model Selection & Strategic Rationale

Based on the final test set evaluation, the **Tuned Random Forest** has been identified as the "Champion Model" by our automated scoring logic.

### 4.1 Why Random Forest Won the F1-Score Title
At the default $0.5$ classification threshold, **Random Forest** provided the most balanced performance across the portfolio:

* **Optimal F1-Score ($0.700$):** As the harmonic mean of Precision and Recall, the $F1$ score rewards models that minimize both types of errors. Random Forest outperformed the baseline and XGBoost by maintaining a high degree of stability.
* **Superior Precision ($0.808$):** This is the strongest highlight for this model. It implies that when the model flags an applicant for "Default," it is correct **80.8%** of the time. This minimizes the "False Positive" rate, which is critical for maintaining high customer satisfaction and preventing the rejection of creditworthy applicants.
* **Generalization Performance:** While XGBoost achieved a slightly higher ROC-AUC ($0.889$), indicating strong ranking potential, Random Forest showed better generalization at the current threshold, likely due to its inherent resistance to the noise introduced by synthetic oversampling (SMOTE).

### 4.2 Cost-Based Model Selection
In a real-world production environment, we don't just select the highest $F1$; we select the model that minimizes the **Total Economic Cost**. The optimal choice depends on the specific cost weights assigned by the risk committee:

$$Total\ Cost = (FN \times Cost_{Default}) + (FP \times Cost_{Opportunity})$$



* **Case for Logistic Regression (High Recall: $0.768$):** If the cost of a default ($FN$) is significantly higher (e.g., $10\times$) than the cost of losing a customer ($FP$), Logistic Regression might be the preferred choice despite its lower $F1$. It captures nearly **77%** of all potential defaulters, providing maximum capital protection.
* **Case for Random Forest (High Precision: $0.808$):** If FinFlow is in a "Growth Phase" where customer acquisition cost is high and market share is the priority, the high Precision of Random Forest is more valuable. It ensures that only the most certain risks are rejected, keeping the "Approval Funnel" healthy.

### 4.3 Choose Random Forest here
We will proceed with **Tuned Random Forest** as our primary model for the next phase. Its high precision offers a stable foundation for automated lending. In the final notebook, we will explore **Threshold Tuning** to see if we can further increase the Recall of this champion model without significantly degrading its precision.

In [9]:
# 1. Reconstruct feature names from the best_model pipeline
cat_encoder = best_model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
cat_cols = list(cat_encoder.get_feature_names_out(categorical_features))
all_feature_names = numeric_features + cat_cols

# 2. Package everything into a dictionary
model_artifacts = {
    'model': best_model,
    'all_feature_names': all_feature_names,
    'numeric_features': numeric_features,
    'categorical_features': categorical_features
}

# 3. Save the artifacts file
model_save_path = os.path.join(MODEL_DIR, 'finflow_artifacts.joblib')
joblib.dump(model_artifacts, model_save_path)

# 4. Save the raw datasets for consistency in Part 04
X_train.to_csv(os.path.join(DATA_DIR, 'X_train.csv'), index=False)
X_test.to_csv(os.path.join(DATA_DIR, 'X_test.csv'), index=False)
y_train.to_csv(os.path.join(DATA_DIR, 'y_train.csv'), index=False)
y_test.to_csv(os.path.join(DATA_DIR, 'y_test.csv'), index=False)