# Notebook 2:<br> **Model Training, Tuning, and Evaluation**

This notebook trains and optimizes classification models for the OULAD early-warning task using the preprocessed artifacts from `01_dataset_preprocessing.ipynb`.

**Prediction time:** <br>
Day `CUTOFF_DAY` (as defined in Notebook 1)  

**Target:** <br>
The classes are mapped to the `risk_tier`:
1. Low Risk: Pass/Distinction  
2. Medium Risk: Fail  
3. High Risk: Withdrawn  

## Setup Notebook

In [18]:
# Imports

# Standard library imports
from pathlib import Path
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib

# Sklearn imports
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedGroupKFold, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    f1_score, balanced_accuracy_score, precision_score, recall_score,
    classification_report, confusion_matrix
)
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, precision_recall_curve, auc

# Project utilities
import sys
sys.path.append(str(Path.cwd().parent))
from utils import summarize_cv, SCORING_METRICS

# Constants (match Notebook 1)
RANDOM_STATE = 42
CUTOFF_DAY = 98     # Important: Ensure that this is the same as in Notebook 1

# Directories for outputs
# With a method to resolve outputs relative to current working directory first (robust when kernel cwd varies)
candidate_out = Path.cwd() / "outputs"
if candidate_out.exists():
    OUT_DIR = candidate_out
else:
    OUT_DIR = Path("../outputs")
FIG_DIR = OUT_DIR / "figures"
TAB_DIR = OUT_DIR / "tables"
DATA_DIR = OUT_DIR / "data"
MODEL_DIR = OUT_DIR / "models"

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

## Data Loading and Setup

### 1. Train/Test Data

Load the outputs from the preprocessing:

In [19]:
def load_series_csv(path: Path) -> pd.Series:
    df = pd.read_csv(path)
    if df.shape[1] == 1:
        return df.iloc[:, 0]
    return df.iloc[:, -1]

# Paths:
X_train_path = DATA_DIR / "processed" / "X_train_raw.csv"
X_test_path  = DATA_DIR / "processed" / "X_test_raw.csv"
y_train_path = DATA_DIR / "processed" / "y_train.csv"
y_test_path  = DATA_DIR / "processed" / "y_test.csv"
groups_path  = DATA_DIR / "processed" / "groups_train.csv"
preprocess_path = MODEL_DIR / "preprocess_pipeline.joblib"

# Load processed data:
X_train = pd.read_csv(X_train_path)
X_test  = pd.read_csv(X_test_path)
y_train = load_series_csv(y_train_path)
y_test  = load_series_csv(y_test_path)
groups_train = load_series_csv(groups_path)

preprocess = joblib.load(preprocess_path)

print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)
print("groups_train:", groups_train.shape)

X_train: (20284, 37)
X_test: (5069, 37)
y_train: (20284,)
y_test: (5069,)
groups_train: (20284,)


### 2. Sanity Checks

In [20]:
# Show the % distribution of classes in the TRAIN split
print("Train class distribution (%):")
print((y_train.value_counts(normalize=True) * 100).round(1))  # normalize=True -> proportions; *100 -> percent

# Show the % distribution of classes in the TEST split
print("\nTest class distribution (%):")
print((y_test.value_counts(normalize=True) * 100).round(1))

# Ensure training matrices/vectors align:
# - X_train rows must match y_train labels
# - groups_train must have one group ID per training row (for group-aware CV)
assert len(X_train) == len(y_train) == len(groups_train), "Train X/y/groups lengths do not match."

# Ensure test features and labels align (one label per test row)
assert len(X_test) == len(y_test), "Test X/y lengths do not match."


Train class distribution (%):
risk_tier
Low Risk       60.7
Medium Risk    27.8
High Risk      11.5
Name: proportion, dtype: float64

Test class distribution (%):
risk_tier
Low Risk       60.7
Medium Risk    27.8
High Risk      11.5
Name: proportion, dtype: float64


### 3. Cross-Validation Setup

Cross-validation (CV) provides a more reliable estimate of model performance than a single split. The solution is using **StratifiedGroupKFold** with **Macro-F1** as a primary indicator (better for imbalanced classes). 


In [21]:
# Define a 5-fold stratified, group-aware CV splitter
cv = StratifiedGroupKFold(
    n_splits=5,               # number of folds
    shuffle=True,             # shuffle before splitting for randomness
    random_state=RANDOM_STATE # reproducible fold assignments
)

# Scoring metrics imported from utils.py:
# - macro_f1: averages F1 across classes
# - balanced_acc: averages recall across classes
# - precision_macro / recall_macro: diagnostics to understand trade-offs across classes

## Models

### 4. Models Used

1. **Logistic Regression (Multimodal)** since it models all classes jointly in a single probability framework.

2. **Random Forest** since it uses an ensemble of decision trees that reduces overfitting by training many diverse trees and averaging their predictions, while still offering practical interpretability through feature-importance scores.


#### 4.1. Logistic Regression (Multimodal)

In [22]:
logreg_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(
        max_iter=3000,
        solver="saga",
        random_state=RANDOM_STATE
    ))
])

#### 4.2. Random Forest

In [23]:
rf_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", RandomForestClassifier(
        n_estimators=500,
        random_state=RANDOM_STATE,
        n_jobs=-1,
        class_weight="balanced_subsample"
    ))
])


4.3. Define Models Based on Pipelines

In [24]:
models = {
    "logreg_default": logreg_pipe,
    "rf_default": rf_pipe,
}


In [25]:
rows = []
for name, pipe in models.items():
    res = cross_validate(
        pipe,
        X_train,
        y_train,
        groups=groups_train,
        cv=cv,
        scoring=SCORING_METRICS,
        return_train_score=False
    )
    rows.append(summarize_cv(name, res))

compare_df = pd.DataFrame(rows).sort_values("macro_f1_mean", ascending=False)
compare_df.to_csv(TAB_DIR / "table_12_model_comparison_cv_train_only.csv", index=False)
compare_df

Unnamed: 0,model,macro_f1_mean,macro_f1_std,balanced_acc_mean,balanced_acc_std,precision_macro_mean,precision_macro_std,recall_macro_mean,recall_macro_std,fit_time_mean
1,rf_default,0.498384,0.002729,0.512165,0.00314,0.611766,0.021072,0.512165,0.00314,12.316097
0,logreg_default,0.455669,0.003851,0.474944,0.003341,0.562743,0.059405,0.474944,0.003341,15.534321
