 # NoteBook : Prédiction Multi-classe sur le devenir des patients

**Nom d'équipe** : Les goats

**Membre** : ALAQAD Zachary, SABI Yanis, MICHON Louis

**Objectif** : Atteindre un score optimal en utilisant XGBoost, Optuna

**Score CV ≈ 0.35626**

Ce notebook implémente :
- Feature engineering avancé
- Optimisation hyperparamètres avec **Optuna**
- Validation croisée **StratifiedKFold**
- Entraînement final et génération de la soumission


## Introduction et importation des bibliotheques

Dans ce challenge Kaggle, nous traitons un problème de classification (multi-classe) sur des données médicales. L'objectif est de prédire le devenir des patients parmis trois catégories :
**C** indique que le patient est vivant au temps N_Days, **CL** indique que le patient est vivant au temps N_Days grâce à une transplantation du foie et **D** indique que le patient est mort au temps N_Days. La métrique d'évaluation est la Log Loss multi-classe.

# 1️⃣ Imports

In [16]:
import pandas as pd
import numpy as np
import xgboost as xgb
import optuna
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import log_loss

## 2️⃣ Chargement des données


In [17]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.head()


Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
0,0,1055.0,,19724.0,F,,,,N,1.3,,3.64,,,,,209.0,10.5,3.0,C
1,1,3282.0,Placebo,17884.0,F,N,Y,Y,N,0.7,309.0,3.6,96.0,1142.0,71.3,106.0,240.0,12.4,4.0,C
2,2,1653.0,,20600.0,F,,,,N,2.2,,3.64,,,,,139.0,9.5,2.0,C
3,3,999.0,D-penicillamine,22514.0,F,N,Y,N,N,1.0,498.0,3.35,89.0,1601.0,164.3,85.0,394.0,9.7,3.0,C
4,4,2202.0,,17897.0,F,,,,N,17.2,,3.15,,,,,432.0,11.2,3.0,C


## 3️⃣ Feature Engineering
Création de ratios médicaux, transformations logarithmiques
et variables dérivées pour améliorer la séparabilité.


In [18]:
def advanced_features(df):
    df = df.copy()

    # Variables temporelles
    df['Age_Yrs'] = df['Age'] / 365.25

    # Ratios médicaux
    df['Bili_Alb'] = df['Bilirubin'] / (df['Albumin'] + 1e-5)
    df['SGOT_Platelets'] = df['SGOT'] / (df['Platelets'] + 1e-5)
    df['Copper_Alb'] = df['Copper'] / (df['Albumin'] + 1e-5)

    # Score APRI
    df['APRI'] = (df['SGOT'] / 40) / (df['Platelets'] / 100 + 1e-5)

    # Stabilisation des distributions
    skewed_cols = [
        'Bilirubin', 'Cholesterol', 'Copper',
        'Alk_Phos', 'SGOT', 'Tryglicerides'
    ]
    for col in skewed_cols:
        df[f'Log_{col}'] = np.log1p(df[col])

    # Compteur de valeurs manquantes
    df['NA_count'] = df.isnull().sum(axis=1)

    return df


In [19]:
train = advanced_features(train)
test = advanced_features(test)

## 4️⃣ Préparation des variables
Encodage de la cible et gestion des variables catégorielles.


In [20]:
target_le = LabelEncoder()
y = target_le.fit_transform(train['Status'])

X = train.drop(['id', 'Status'], axis=1)
X_test = test.drop(['id'], axis=1)

cat_cols = X.select_dtypes(include=['object']).columns.tolist()
for col in cat_cols:
    X[col] = X[col].astype('category')
    X_test[col] = X_test[col].astype('category')

X.shape, X_test.shape


((15000, 30), (10000, 30))

## 5️⃣ Optimisation des hyperparamètres avec Optuna
Objectif : minimiser la **log-loss multiclasse**.


In [21]:
def objective(trial):
    params = {
        'n_estimators': 1500,
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.03, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 15),
        'subsample': trial.suggest_float('subsample', 0.5, 0.9),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.8),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
        'gamma': trial.suggest_float('gamma', 0.1, 5.0),
        'objective': 'multi:softprob',
        'num_class': 3,
        'tree_method': 'hist',
        'enable_categorical': True,
        'random_state': 42,
        'early_stopping_rounds': 50
    }

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    losses = []

    for tr_idx, val_idx in skf.split(X, y):
        X_tr, X_val = X.iloc[tr_idx], X.iloc[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]

        model = xgb.XGBClassifier(**params)
        model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)

        preds = model.predict_proba(X_val)
        losses.append(log_loss(y_val, preds))

    return np.mean(losses)


In [15]:
print("Lancement de l'optimisation Optuna (100 essais)...")

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)

study.best_params

[I 2025-12-29 02:23:23,034] A new study created in memory with name: no-name-b36daae7-acfb-41be-aa13-05e918c16237


Lancement de l'optimisation Optuna (100 essais)...


[I 2025-12-29 02:23:58,926] Trial 0 finished with value: 0.37553871796974503 and parameters: {'learning_rate': 0.005474147019478653, 'max_depth': 9, 'min_child_weight': 1, 'subsample': 0.7483461113733648, 'colsample_bytree': 0.4926237841248225, 'reg_alpha': 0.031430064288991, 'reg_lambda': 0.014475806851866906, 'gamma': 2.0375751261896644}. Best is trial 0 with value: 0.37553871796974503.
[I 2025-12-29 02:24:16,209] Trial 1 finished with value: 0.3757369510086206 and parameters: {'learning_rate': 0.02271709530496473, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 0.5432219832109165, 'colsample_bytree': 0.7508427031543737, 'reg_alpha': 0.4862685828472312, 'reg_lambda': 0.4389698357914538, 'gamma': 0.3050438279154053}. Best is trial 0 with value: 0.37553871796974503.
[I 2025-12-29 02:24:40,546] Trial 2 finished with value: 0.37273934543388404 and parameters: {'learning_rate': 0.009541809864353431, 'max_depth': 4, 'min_child_weight': 13, 'subsample': 0.5024402610778742, 'colsample_b

{'learning_rate': 0.013250959420056307,
 'max_depth': 8,
 'min_child_weight': 8,
 'subsample': 0.6406889455936455,
 'colsample_bytree': 0.40147241815135704,
 'reg_alpha': 0.18726315759641388,
 'reg_lambda': 0.01758709449374469,
 'gamma': 0.26915605167506695}

## 6️⃣ Entraînement final avec validation croisée


In [22]:
best_params = study.best_params
best_params.update({
    'n_estimators': 2000,
    'objective': 'multi:softprob',
    'num_class': 3,
    'tree_method': 'hist',
    'enable_categorical': True,
    'random_state': 42,
    'early_stopping_rounds': 100
})

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
final_preds = np.zeros((len(X_test), 3))

for fold, (tr_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"Fold {fold+1}")

    X_tr, X_val = X.iloc[tr_idx], X.iloc[val_idx]
    y_tr, y_val = y[tr_idx], y[val_idx]

    model = xgb.XGBClassifier(**best_params)
    model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=100)

    final_preds += model.predict_proba(X_test) / skf.n_splits


Fold 1
[0]	validation_0-mlogloss:0.91631
[100]	validation_0-mlogloss:0.51708
[200]	validation_0-mlogloss:0.41981
[300]	validation_0-mlogloss:0.38852
[400]	validation_0-mlogloss:0.37712
[500]	validation_0-mlogloss:0.37169
[600]	validation_0-mlogloss:0.36919
[700]	validation_0-mlogloss:0.36769
[800]	validation_0-mlogloss:0.36712
[900]	validation_0-mlogloss:0.36671
[992]	validation_0-mlogloss:0.36698
Fold 2
[0]	validation_0-mlogloss:0.91616
[100]	validation_0-mlogloss:0.51240
[200]	validation_0-mlogloss:0.41460
[300]	validation_0-mlogloss:0.38396
[400]	validation_0-mlogloss:0.37244
[500]	validation_0-mlogloss:0.36706
[600]	validation_0-mlogloss:0.36424
[700]	validation_0-mlogloss:0.36304
[800]	validation_0-mlogloss:0.36235
[872]	validation_0-mlogloss:0.36280
Fold 3
[0]	validation_0-mlogloss:0.91705
[100]	validation_0-mlogloss:0.51908
[200]	validation_0-mlogloss:0.42454
[300]	validation_0-mlogloss:0.39697
[400]	validation_0-mlogloss:0.38628
[500]	validation_0-mlogloss:0.38248
[600]	validat

## 7️⃣ Génération de la soumission


In [23]:
submission = pd.DataFrame(
    final_preds,
    columns=[f"Status_{c}" for c in target_le.classes_]
)

submission.insert(0, 'id', test['id'])
submission.to_csv('submission_final_optuna.csv', index=False)

submission.head()


Unnamed: 0,id,Status_C,Status_CL,Status_D
0,15000,0.906233,0.003648,0.09012
1,15001,0.970647,0.002377,0.026976
2,15002,0.732635,0.010616,0.25675
3,15003,0.976911,0.00309,0.019999
4,15004,0.98361,0.010172,0.006218


Le fichier **submission_final_optuna.csv** est créé et sera soumis via la plateforme Kaggle.
