# Modélisation

## 1. Préparation

### 🔹 Objectif

Charger les données traitées (`data/processed/`), séparer les variables explicatives (features) de la variable cible (target), puis diviser les données en ensembles d'entraînement et de validation. On fixe également une graine aléatoire pour assurer la reproductibilité des résultats.

In [116]:
# 1. Imports
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import random
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt

In [145]:
# 2. Fixer la seed pour reproductibilité
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
TARGET_NAME = "SalePrice"

# 3. Charger les données traitées
# Répertoire des données
DATA_DIR = Path("./../data/")
PROCESSED_DIR = DATA_DIR / "processed"

# Chargement des fichiers
X = pd.read_csv(PROCESSED_DIR / "X_train_processed.csv")
y = pd.read_csv(PROCESSED_DIR / "y_train.csv").squeeze()  # .squeeze() pour avoir une Series si 1 seule colonne

**Split en train / validation**

In [146]:
# 6. Split des données
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=SEED
)
X_test, X_val, y_test, y_val = train_test_split(
    X_val, y_val, test_size=0.5, random_state=SEED)

print("Taille X_train :", X_train.shape)
print("Taille X_val   :", X_val.shape)
print("Taille X_test  :", X_test.shape)

Taille X_train : (1168, 75)
Taille X_val   : (146, 75)
Taille X_test  : (146, 75)


**On applique le log-transform sur les targets**

log1p(y) signifie **log(1 + y)**

In [148]:
y_train = np.log1p(y_train)
y_val = np.log1p(y_val)
y_test = np.log1p(y_test)


## 3. Baseline

### Objectif

- Créer un modèle simple (régression linéaire)
- Calculer ses performances
- Logguer les résultats dans MLflow

In [152]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import mlflow
import mlflow.sklearn
import numpy as np


 Fonction d’évaluation

In [153]:
def evaluate_model(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {"rmse": rmse, "mae": mae, "r2": r2}


Fonction d'entraînement avec MLflow

In [None]:
def train_baseline_model(X_train, y_train, X_val, y_val):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    metrics = evaluate_model(y_val, y_pred)

    return model, metrics


Lancer l'entraînement

In [155]:
baseline_model, baseline_metrics = train_baseline_model(X_train, y_train, X_val, y_val)
baseline_metrics



{'rmse': 0.14197568102675445,
 'mae': 0.09991617009387864,
 'r2': 0.8940443634854975}

Affichage des résultats

In [157]:
for metric, value in baseline_metrics.items():
    print(f"{metric.upper()}: {value:.4f}")

RMSE: 0.1420
MAE: 0.0999
R2: 0.8940


## 4. Expérimentation Modèles

 Imports utiles

In [158]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor


from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error



#### Fonction générique d'entraînement + log MLflow (modèles sklearn)

In [None]:
def train_and_log_model(name, model_class, params, X_train, y_train, X_val, y_val):
    with mlflow.start_run(run_name=name):
        model = model_class(**params)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)

        metrics = evaluate_model(y_val, y_pred)


    return model, metrics


Expérimentation 1 : Random Forest

In [160]:
rf_params = {
    "n_estimators": 100,
    "max_depth": 5,
    "random_state": 42
}

rf_model, rf_metrics = train_and_log_model(
    name="RandomForest",
    model_class=RandomForestRegressor,
    params=rf_params,
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val
)




Expérimentation 2 : XGBoost

In [161]:
xgb_params = {
    "n_estimators": 100,
    "learning_rate": 0.1,
    "max_depth": 4,
    "random_state": 42
}

xgb_model, xgb_metrics = train_and_log_model(
    name="XGBoost",
    model_class=XGBRegressor,
    params=xgb_params,
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val
)




Expérimentation 3 : SVR

In [162]:
svr_params = {
    "kernel": "rbf",
    "C": 100,
    "epsilon": 0.1
}

svr_model, svr_metrics = train_and_log_model(
    name="SVR",
    model_class=SVR,
    params=svr_params,
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val
)




Expérimentation 4 : KNN

In [163]:
knn_params = {
    "n_neighbors": 5
}

knn_model, knn_metrics = train_and_log_model(
    name="KNN",
    model_class=KNeighborsRegressor,
    params=knn_params,
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val
)






##### Résumé des performances

In [164]:
results = {
    "RandomForest": rf_metrics,
    "XGBoost": xgb_metrics,
    "SVR": svr_metrics,
    "KNN": knn_metrics
}

for model_name, scores in results.items():
    print(f"\n{model_name}")
    for metric, value in scores.items():
        print(f"  {metric.upper()}: {value:.4f}")



RandomForest
  RMSE: 0.1540
  MAE: 0.1107
  R2: 0.8754

XGBoost
  RMSE: 0.1344
  MAE: 0.0945
  R2: 0.9051

SVR
  RMSE: 0.1596
  MAE: 0.1143
  R2: 0.8660

KNN
  RMSE: 0.1742
  MAE: 0.1220
  R2: 0.8405


## 5. Tuning des hyperparamètres
Nous allons maintenant optimiser les hyperparamètres du modèle XGBoost en utilisant GridSearchCV.

L’objectif est d’améliorer encore ses performances tout en logguant les résultats avec MLflow

#### Définition de la grille de recherche

In [None]:
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [2, 3, 4, 5, 6],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.6, 0.8, 1.0],
}

Recherche par validation croisée + MLflow Logging

In [None]:
with mlflow.start_run(run_name="XGBoost_Tuning"):

    xgb = XGBRegressor(random_state=42, n_jobs=-1, verbosity=0)
    grid_search = GridSearchCV(
        estimator=xgb,
        param_grid=param_grid,
        cv=3,
        scoring='neg_root_mean_squared_error',
        verbose=1,
        n_jobs=-1
    )

    grid_search.fit(X_train, y_train)

    # Meilleurs hyperparamètres
    best_params = grid_search.best_params_
    print("Best Parameters:", best_params)

    # Évaluation sur la validation
    best_model = grid_search.best_estimator_
    val_preds = best_model.predict(X_val)
    eval_metrics = evaluate_model(y_val, val_preds)

   

    print(f"Validation RMSE: {eval_metrics['rmse']:.2f}")
    print(f"Validation MAE: {eval_metrics['mae']:.2f}")
    print(f"Validation R²: {eval_metrics['r2']:.4f}")


Fitting 3 folds for each of 135 candidates, totalling 405 fits


Best Parameters: {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100, 'subsample': 0.8}




Validation RMSE: 0.13
Validation MAE: 0.10
Validation R²: 0.9080


##### on stocke les métriques et paramètres du meilleur model

In [None]:
import pickle
from pathlib import Path

# Créer un dictionnaire avec les seuils
metrics = {
    "rmse": eval_metrics["rmse"],
    "mae": eval_metrics["mae"],
    "r": eval_metrics["r2"]
}
params = {
    "n_estimators": best_params["n_estimators"],
    "max_depth": best_params["max_depth"],
    "learning_rate": best_params["learning_rate"],
    "subsample": best_params["subsample"]
}

# Sauvegarder dans un fichier .pkl
metrics_path = Path("./../models/xgb_best_metrics.pkl")
params_path = Path("./../models/xgb_best_params.pkl")

with open(metrics_path, "wb") as f:
    pickle.dump(metrics, f)
    
with open(params_path, "wb") as f:
    pickle.dump(params, f)
    

## 6. Évaluation finale

On évalue maintenant **le meilleur modèle** sur un jeu **test** **jamais vu**.

In [168]:
# Prédictions finales
y_pred = best_model.predict(X_test)

# Métriques
rmse, mae, r2 = evaluate_model(y_test, y_pred).values()

print(f"Test RMSE: {rmse:.4f}")
print(f"Test MAE: {mae:.4f}")
print(f"Test R²: {r2:.4f}")

Test RMSE: 0.1534
Test MAE: 0.0981
Test R²: 0.8703


## 7. Export & Suivi

In [171]:
import joblib

# Sauvegarde en local
joblib.dump(best_model, "./../models/best_model.pkl")


['./../models/best_model.pkl']