# Model Benchmarking & Audit

Ce notebook a pour objectif de comparer rigoureusement plusieurs algorithmes sur les deux scénarios du projet :
1. **Early Prediction** (Prédiction Précoce) : Sans les notes G1/G2.
2. **Late Prediction** (Prédiction Finale) : Avec les notes G1/G2.

## Modèles Testés
- **Linear Regression** : Baseline simple.
- **Random Forest** : Modèle ensembliste robuste.
- **Ensemble (Voting)** : Combinaison linéaire de Regression et Random Forest.
- **LightGBM** : Boosting de gradient (Etat de l'art tabulaire).
- **SVR** (Support Vector Regressor) : Efficace sur petits datasets.
- **MLPRegressor** : Réseau de neurones simple.

## Méthodologie
- **Cross-Validation** : 5-Fold pour éviter le surapprentissage.
- **Métriques** : RMSE (Erreur Quadratique Moyenne) et R2 (Coefficient de détermination).

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
import lightgbm as lgb
import time

In [6]:
# Chargement des données
mat = pd.read_csv('../sources/student/student-mat.csv', sep=';')
por = pd.read_csv('../sources/student/student-por.csv', sep=';')

# Fusion simple
df = pd.concat([mat, por], ignore_index=True)

# Feature Engineering basique
df['TotalAlc'] = df['Dalc'] + df['Walc']
df['ParentEdu'] = df['Medu'] + df['Fedu']
df['HasFailed'] = df['failures'].apply(lambda x: 1 if x > 0 else 0)

In [7]:
def get_data(scenario):
    data = df.copy()
    if scenario == 'Early':
        # Drop grades
        X = data.drop(['G1', 'G2', 'G3'], axis=1)
    else: # Late
        # Keep G1, G2
        X = data.drop(['G3'], axis=1)
    
    y = data['G3']
    return X, y

categorical_features = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']

In [8]:
# Modèles
lr = LinearRegression()
rf = RandomForestRegressor(random_state=42)
ensemble = VotingRegressor(estimators=[('lr', lr), ('rf', rf)])

models = {
    'Linear Regression': lr,
    'Random Forest': rf,
    'Ensemble (LR + RF)': ensemble,
    'LightGBM': lgb.LGBMRegressor(random_state=42, verbose=-1),
    'SVR': SVR(),
    'MLP (Neural Net)': MLPRegressor(random_state=42, max_iter=500)
}

print("| Scenario | Model | RMSE (CV) | R2 (CV) | Time (s) |")
print("|---|---|---|---|---|")

for scenario in ['Early', 'Late']:
    X, y = get_data(scenario)
    
    num_feats = [c for c in X.columns if c not in categorical_features]
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), num_feats),
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
        ])

    for name, model in models.items():
        start = time.time()
        pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('model', model)])
        
        # Performance (RMSE & R2)
        cv_rmse = cross_val_score(pipeline, X, y, cv=5, scoring='neg_root_mean_squared_error')
        rmse = -cv_rmse.mean()
        
        cv_r2 = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
        r2 = cv_r2.mean()
        
        end = time.time()
        duration = end - start
        
        print(f"| {scenario} | {name} | {rmse:.4f} | {r2:.4f} | {duration:.4f} |")

| Scenario | Model | RMSE (CV) | R2 (CV) | Time (s) |
|---|---|---|---|---|
| Early | Linear Regression | 3.5821 | 0.0228 | 0.0758 |
| Early | Random Forest | 3.3330 | 0.1131 | 2.5767 |
| Early | Ensemble (LR + RF) | 3.3302 | 0.1469 | 2.6986 |




| Early | LightGBM | 3.4969 | 0.0099 | 3.3790 |
| Early | SVR | 3.3577 | 0.1648 | 0.2536 |




| Early | MLP (Neural Net) | 3.8192 | -0.1987 | 4.2673 |
| Late | Linear Regression | 1.6549 | 0.7839 | 0.0695 |
| Late | Random Forest | 1.6541 | 0.7746 | 2.4708 |
| Late | Ensemble (LR + RF) | 1.5910 | 0.7955 | 2.4667 |




| Late | LightGBM | 1.7296 | 0.7526 | 3.4117 |
| Late | SVR | 1.9710 | 0.7142 | 0.2621 |




| Late | MLP (Neural Net) | 1.8951 | 0.7037 | 4.2729 |


