# Ensemble Models: Random Forest & XGBoost (Imbalance-Aware)

## Objective
Move beyond linear and single-tree baselines to **ensemble models** that perform better under extreme class imbalance.

This notebook:
- Trains a **Random Forest** using `class_weight="balanced"`
- Trains an **XGBoost** model using `scale_pos_weight = (#neg / #pos)` and `eval_metric="aucpr"`
- Evaluates models using **ROC-AUC** and **PR-AUC**
- Saves the trained XGBoost model for explainability analysis in Part D


In [4]:
import os
import numpy as np
import joblib

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    classification_report, confusion_matrix
)

from xgboost import XGBClassifier


In [5]:
# Load processed data
X_train = np.load("../data/processed/v1_train_test/X_train_scaled.npy")
X_test  = np.load("../data/processed/v1_train_test/X_test_scaled.npy")
y_train = np.load("../data/processed/v1_train_test/y_train.npy")
y_test  = np.load("../data/processed/v1_train_test/y_test.npy")


In [6]:
def evaluate_classifier(name, y_true, y_pred, y_proba):
    print(f"\n=== {name} ===")
    print("ROC-AUC:", roc_auc_score(y_true, y_proba))
    print("PR-AUC :", average_precision_score(y_true, y_proba))
    print("Confusion Matrix [ [TN FP], [FN TP] ]:")
    print(confusion_matrix(y_true, y_pred))
    print(classification_report(y_true, y_pred, digits=4, zero_division=0))


## Random Forest (class_weight="balanced")

Random Forest is a bagging ensemble that reduces variance compared to a single decision tree.
We handle class imbalance using `class_weight="balanced"`.


In [7]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    min_samples_leaf=50,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

rf_pred  = rf.predict(X_test)
rf_proba = rf.predict_proba(X_test)[:, 1]

evaluate_classifier("Random Forest", y_test, rf_pred, rf_proba)



=== Random Forest ===
ROC-AUC: 0.9711203171476805
PR-AUC : 0.7862166007983233
Confusion Matrix [ [TN FP], [FN TP] ]:
[[56817    47]
 [   12    86]]
              precision    recall  f1-score   support

           0     0.9998    0.9992    0.9995     56864
           1     0.6466    0.8776    0.7446        98

    accuracy                         0.9990     56962
   macro avg     0.8232    0.9384    0.8720     56962
weighted avg     0.9992    0.9990    0.9990     56962



## Class Imbalance Handling for XGBoost

XGBoost supports imbalance handling using:

\[
scale\_pos\_weight = \frac{\#negative}{\#positive}
\]

This increases the training penalty for misclassifying fraud (positive class).


In [8]:
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()
scale_pos_weight = neg / pos
print(f"neg={neg}, pos={pos}, scale_pos_weight={scale_pos_weight:.2f}")


neg=227451, pos=394, scale_pos_weight=577.29



## XGBoost (Imbalance-Aware Boosting Model)

XGBoost is a gradient boosting model that captures:
- non-linear relationships
- feature interactions
- complex decision boundaries

We optimize for imbalanced classification using:
- `scale_pos_weight`
- `eval_metric="aucpr"` (PR-AUC-focused evaluation)


In [9]:
xgb = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_weight=1,
    gamma=0,
    scale_pos_weight=scale_pos_weight,
    base_score=0.5,
    objective="binary:logistic",
    eval_metric="aucpr",
    random_state=42,
    n_jobs=-1
)

xgb.fit(X_train, y_train)

xgb_pred  = xgb.predict(X_test)
xgb_proba = xgb.predict_proba(X_test)[:, 1]

evaluate_classifier("XGBoost (imbalance-aware)", y_test, xgb_pred, xgb_proba)



=== XGBoost (imbalance-aware) ===
ROC-AUC: 0.9844605962812812
PR-AUC : 0.8729609452078972
Confusion Matrix [ [TN FP], [FN TP] ]:
[[56844    20]
 [   15    83]]
              precision    recall  f1-score   support

           0     0.9997    0.9996    0.9997     56864
           1     0.8058    0.8469    0.8259        98

    accuracy                         0.9994     56962
   macro avg     0.9028    0.9233    0.9128     56962
weighted avg     0.9994    0.9994    0.9994     56962



## Save Trained XGBoost Model

The trained XGBoost model is saved so that the explainability notebook (Part D)
uses the **exact same model** used for evaluation.


In [10]:
os.makedirs("../models", exist_ok=True)
joblib.dump(xgb, "../models/xgb_model.joblib")
print("Saved: ../models/xgb_model.joblib")


Saved: ../models/xgb_model.joblib


## Key Takeaways

- Random Forest significantly improves precision compared to Logistic Regression while maintaining strong recall.
- XGBoost achieves the best overall performance, including the highest PR-AUC and strong precision/recall balance.
- The saved XGBoost model will be used in Part D for explainability (feature importance + SHAP).
