# üß™ QRT Leukemia Challenge ‚Äî Notebook d'Exp√©rimentation

**Objectif**: Tester des am√©liorations pour d√©passer le score actuel de **0.7111** (GBSA)

**Score cible**: 0.7744 (Challenge Winner)

---

## Historique des Scores

| Version | Mod√®le | IPCW C-index | Notes |
|---------|--------|--------------|-------|
| v1 | Ridge Baseline | 0.6537 | Ignore censure |
| v2 | Random Survival Forest | 0.7040 | Grid search optimis√© |
| v3 | Gradient Boosting Surv | **0.7111** | Meilleur actuel |
| v4 | Ensemble RSF+GBSA | ? | √Ä tester |
| v5 | GBSA + More Features | ? | √Ä tester |

---

## 1. Setup ‚Äî Chargement des donn√©es et mod√®les de base

In [1]:
# ============================================================
# Configuration et Imports
# ============================================================
import os, sys, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.base import clone
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

# Mod√®les de survie
from sksurv.ensemble import RandomSurvivalForest, GradientBoostingSurvivalAnalysis

# Modules locaux
from src.config import (
    RANDOM_STATE, TAU_YEARS, ID_COL, TARGET_TIME, TARGET_EVENT,
    RSF_DEFAULT_PARAMS, RSF_PARAM_GRID
)
from src.data_loader import load_all_data, merge_train_data, clean_target
from src.features import build_molecular_features, get_feature_columns
from src.preprocessing import get_default_preprocessor
from src.evaluation import to_sksurv_y, ipcw_cindex
from src.models import create_rsf_model

np.random.seed(RANDOM_STATE)
print("‚úì Configuration charg√©e")

‚úì Configuration charg√©e


In [2]:
# ============================================================
# Chargement et pr√©paration des donn√©es
# ============================================================
clinical_train, clinical_test, molecular_train, molecular_test, y_train = load_all_data()

# Feature engineering mol√©culaire
mol_feat_train = build_molecular_features(molecular_train)
mol_feat_test = build_molecular_features(molecular_test)

# Fusion
X_train_full = merge_train_data(clinical_train, mol_feat_train, y_train)
X_test_full = clinical_test.merge(mol_feat_test, on=ID_COL, how="left").fillna(0)

# Aligner colonnes
for col in X_train_full.columns:
    if col not in X_test_full.columns:
        X_test_full[col] = 0
X_test_full = X_test_full[[c for c in X_train_full.columns if c in X_test_full.columns]]

train_full = clean_target(X_train_full)
feature_cols = get_feature_columns(train_full)

# Split
train_df, valid_df = train_test_split(
    train_full, test_size=0.2, random_state=RANDOM_STATE,
    stratify=train_full[TARGET_EVENT]
)
ytr_s = to_sksurv_y(train_df)
yva_s = to_sksurv_y(valid_df)

# Preprocessor
preprocess = get_default_preprocessor(feature_cols)

print(f"‚úì Donn√©es pr√™tes:")
print(f"  ‚Ä¢ Train: {train_df.shape[0]} patients √ó {len(feature_cols)} features")
print(f"  ‚Ä¢ Valid: {valid_df.shape[0]} patients")

‚úì Donn√©es pr√™tes:
  ‚Ä¢ Train: 2538 patients √ó 92 features
  ‚Ä¢ Valid: 635 patients


In [3]:
# ============================================================
# Scores de r√©f√©rence (d√©j√† obtenus dans main.ipynb)
# ============================================================
REFERENCE_SCORES = {
    "Baseline (Ridge)": 0.6537,
    "KMeans Clustering": 0.6182,
    "Random Survival Forest": 0.7040,
    "Gradient Boosting Surv": 0.7111,
    "Challenge Winner": 0.7744
}

def print_score_comparison(new_score, model_name):
    """Compare un nouveau score avec les r√©f√©rences"""
    print(f"\nüìä {model_name}: {new_score:.4f}")
    print(f"   vs GBSA (0.7111): {(new_score - 0.7111)*100:+.2f}%")
    print(f"   vs Winner (0.7744): {(new_score - 0.7744)*100:+.2f}%")
    return new_score

print("‚úì Fonctions utilitaires pr√™tes")

‚úì Fonctions utilitaires pr√™tes


---

## 2. Exp√©rience 1: Ensemble RSF + GBSA

In [4]:
# ============================================================
# Entra√Ænement des mod√®les de base
# ============================================================
print("üîÑ Entra√Ænement RSF...")
X_train_prep = preprocess.fit_transform(train_df[feature_cols])
X_valid_prep = preprocess.transform(valid_df[feature_cols])

# RSF
rsf = RandomSurvivalForest(
    n_estimators=200, min_samples_leaf=20, min_samples_split=10,
    max_features=0.5, random_state=RANDOM_STATE, n_jobs=-1
)
rsf.fit(X_train_prep, ytr_s)
rsf_risk = rsf.predict(X_valid_prep)
rsf_c = ipcw_cindex(ytr_s, yva_s, rsf_risk)
print(f"  RSF: {rsf_c:.4f}")

# GBSA
print("üîÑ Entra√Ænement GBSA...")
gbsa = GradientBoostingSurvivalAnalysis(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    min_samples_split=10, min_samples_leaf=5,
    random_state=RANDOM_STATE, verbose=0
)
gbsa.fit(X_train_prep, ytr_s)
gbsa_risk = gbsa.predict(X_valid_prep)
gbsa_c = ipcw_cindex(ytr_s, yva_s, gbsa_risk)
print(f"  GBSA: {gbsa_c:.4f}")

üîÑ Entra√Ænement RSF...
  RSF: 0.7040
üîÑ Entra√Ænement GBSA...
  GBSA: 0.7111


In [5]:
# ============================================================
# Ensemble: Moyenne pond√©r√©e
# ============================================================
print("üîÄ Test Ensemble RSF + GBSA...\n")

rsf_norm = MinMaxScaler().fit_transform(rsf_risk.reshape(-1, 1)).flatten()
gbsa_norm = MinMaxScaler().fit_transform(gbsa_risk.reshape(-1, 1)).flatten()

results_ensemble = {}
for w in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    ensemble = w * gbsa_norm + (1 - w) * rsf_norm
    c = ipcw_cindex(ytr_s, yva_s, ensemble)
    results_ensemble[w] = c
    print(f"  w_gbsa={w:.1f} -> {c:.4f}")

best_w = max(results_ensemble, key=results_ensemble.get)
best_ensemble_c = results_ensemble[best_w]
print_score_comparison(best_ensemble_c, f"Ensemble (w_gbsa={best_w})")

üîÄ Test Ensemble RSF + GBSA...

  w_gbsa=0.2 -> 0.7085
  w_gbsa=0.3 -> 0.7097
  w_gbsa=0.4 -> 0.7103
  w_gbsa=0.5 -> 0.7102
  w_gbsa=0.6 -> 0.7106
  w_gbsa=0.7 -> 0.7113
  w_gbsa=0.8 -> 0.7118

üìä Ensemble (w_gbsa=0.8): 0.7118
   vs GBSA (0.7111): +0.07%
   vs Winner (0.7744): -6.26%


0.7117595360786825

---

## 3. Exp√©rience 2: Tuning GBSA avec plus d'arbres

In [6]:
# ============================================================
# Grid Search manuel pour GBSA
# ============================================================
print("üéØ Tuning GBSA...\n")

param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1],
    'max_depth': [2, 3, 4]
}

best_gbsa_score = 0
best_gbsa_params = None

from itertools import product
total = len(param_grid['n_estimators']) * len(param_grid['learning_rate']) * len(param_grid['max_depth'])
i = 0

for n_est, lr, depth in product(
    param_grid['n_estimators'],
    param_grid['learning_rate'],
    param_grid['max_depth']
):
    i += 1
    model = GradientBoostingSurvivalAnalysis(
        n_estimators=n_est, learning_rate=lr, max_depth=depth,
        min_samples_split=10, min_samples_leaf=5,
        random_state=RANDOM_STATE, verbose=0
    )
    model.fit(X_train_prep, ytr_s)
    risk = model.predict(X_valid_prep)
    c = ipcw_cindex(ytr_s, yva_s, risk)
    
    if c > best_gbsa_score:
        best_gbsa_score = c
        best_gbsa_params = {'n_estimators': n_est, 'learning_rate': lr, 'max_depth': depth}
    
    print(f"  [{i}/{total}] n={n_est}, lr={lr}, d={depth} -> {c:.4f}")

print(f"\n‚úì Meilleurs params: {best_gbsa_params}")
print_score_comparison(best_gbsa_score, "GBSA Tuned")

üéØ Tuning GBSA...

  [1/18] n=100, lr=0.05, d=2 -> 0.7027
  [2/18] n=100, lr=0.05, d=3 -> 0.6968
  [3/18] n=100, lr=0.05, d=4 -> 0.7003
  [4/18] n=100, lr=0.1, d=2 -> 0.7081
  [5/18] n=100, lr=0.1, d=3 -> 0.7052
  [6/18] n=100, lr=0.1, d=4 -> 0.7079
  [7/18] n=200, lr=0.05, d=2 -> 0.7087
  [8/18] n=200, lr=0.05, d=3 -> 0.7052
  [9/18] n=200, lr=0.05, d=4 -> 0.7085
  [10/18] n=200, lr=0.1, d=2 -> 0.7118
  [11/18] n=200, lr=0.1, d=3 -> 0.7111
  [12/18] n=200, lr=0.1, d=4 -> 0.7098
  [13/18] n=300, lr=0.05, d=2 -> 0.7117
  [14/18] n=300, lr=0.05, d=3 -> 0.7088
  [15/18] n=300, lr=0.05, d=4 -> 0.7093
  [16/18] n=300, lr=0.1, d=2 -> 0.7141
  [17/18] n=300, lr=0.1, d=3 -> 0.7116
  [18/18] n=300, lr=0.1, d=4 -> 0.7093

‚úì Meilleurs params: {'n_estimators': 300, 'learning_rate': 0.1, 'max_depth': 2}

üìä GBSA Tuned: 0.7141
   vs GBSA (0.7111): +0.30%
   vs Winner (0.7744): -6.03%


0.7141465240109957

---

## 4. Exp√©rience 3: Plus de features g√©n√©tiques

In [7]:
# ============================================================
# Augmenter TOP_GENES de 30 √† 50
# ============================================================
print("üß¨ Test avec plus de g√®nes (TOP_GENES=50)...\n")

# Re-cr√©er les features avec plus de g√®nes
mol_feat_train_v2 = build_molecular_features(molecular_train, top_genes=50, top_effects=20)
mol_feat_test_v2 = build_molecular_features(molecular_test, top_genes=50, top_effects=20)

# Refaire la fusion
X_train_v2 = merge_train_data(clinical_train, mol_feat_train_v2, y_train)
X_test_v2 = clinical_test.merge(mol_feat_test_v2, on=ID_COL, how="left").fillna(0)

for col in X_train_v2.columns:
    if col not in X_test_v2.columns:
        X_test_v2[col] = 0

train_full_v2 = clean_target(X_train_v2)
feature_cols_v2 = get_feature_columns(train_full_v2)

print(f"  Nouvelles features: {len(feature_cols_v2)} (vs {len(feature_cols)} avant)")
print(f"  + {len(feature_cols_v2) - len(feature_cols)} nouvelles features")

üß¨ Test avec plus de g√®nes (TOP_GENES=50)...

  Nouvelles features: 82 (vs 92 avant)
  + -10 nouvelles features


In [8]:
# Split et entra√Ænement
train_df_v2, valid_df_v2 = train_test_split(
    train_full_v2, test_size=0.2, random_state=RANDOM_STATE,
    stratify=train_full_v2[TARGET_EVENT]
)
ytr_v2 = to_sksurv_y(train_df_v2)
yva_v2 = to_sksurv_y(valid_df_v2)

preprocess_v2 = get_default_preprocessor(feature_cols_v2)
X_train_v2_prep = preprocess_v2.fit_transform(train_df_v2[feature_cols_v2])
X_valid_v2_prep = preprocess_v2.transform(valid_df_v2[feature_cols_v2])

# GBSA avec plus de features
gbsa_v2 = GradientBoostingSurvivalAnalysis(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    min_samples_split=10, min_samples_leaf=5,
    random_state=RANDOM_STATE, verbose=0
)
gbsa_v2.fit(X_train_v2_prep, ytr_v2)
gbsa_v2_risk = gbsa_v2.predict(X_valid_v2_prep)
gbsa_v2_c = ipcw_cindex(ytr_v2, yva_v2, gbsa_v2_risk)

print_score_comparison(gbsa_v2_c, "GBSA + 50 g√®nes")


üìä GBSA + 50 g√®nes: 0.7108
   vs GBSA (0.7111): -0.03%
   vs Winner (0.7744): -6.36%


0.7108462613405228

---

## 5. Exp√©rience 4: Co-mutations (interactions g√®ne-g√®ne)

In [9]:
# ============================================================
# Ajouter des features de co-mutation
# ============================================================
print("üß¨ Cr√©ation de features de co-mutation...\n")

# Paires de g√®nes connues pour √™tre importantes en AML/MDS
important_gene_pairs = [
    ('TP53', 'RUNX1'),
    ('ASXL1', 'TET2'),
    ('DNMT3A', 'TET2'),
    ('SRSF2', 'TET2'),
    ('TP53', 'ASXL1'),
    ('NPM1', 'FLT3'),
    ('RUNX1', 'ASXL1'),
]

def add_comutation_features(df, gene_cols):
    """Ajoute des features de co-mutation"""
    df_out = df.copy()
    for g1, g2 in important_gene_pairs:
        col1 = f'GENE__{g1}'
        col2 = f'GENE__{g2}'
        if col1 in df.columns and col2 in df.columns:
            df_out[f'COMUT__{g1}_{g2}'] = ((df[col1] > 0) & (df[col2] > 0)).astype(int)
    return df_out

# Appliquer aux donn√©es
train_full_v3 = add_comutation_features(train_full, feature_cols)
feature_cols_v3 = [c for c in train_full_v3.columns if c not in [ID_COL, TARGET_TIME, TARGET_EVENT]]

comut_cols = [c for c in feature_cols_v3 if c.startswith('COMUT__')]
print(f"  Co-mutations cr√©√©es: {len(comut_cols)}")
for c in comut_cols:
    print(f"    ‚Ä¢ {c}: {train_full_v3[c].sum()} patients")

üß¨ Cr√©ation de features de co-mutation...

  Co-mutations cr√©√©es: 7
    ‚Ä¢ COMUT__TP53_RUNX1: 19 patients
    ‚Ä¢ COMUT__ASXL1_TET2: 307 patients
    ‚Ä¢ COMUT__DNMT3A_TET2: 134 patients
    ‚Ä¢ COMUT__SRSF2_TET2: 299 patients
    ‚Ä¢ COMUT__TP53_ASXL1: 30 patients
    ‚Ä¢ COMUT__NPM1_FLT3: 15 patients
    ‚Ä¢ COMUT__RUNX1_ASXL1: 250 patients


In [10]:
# Entra√Ænement avec co-mutations
train_df_v3, valid_df_v3 = train_test_split(
    train_full_v3, test_size=0.2, random_state=RANDOM_STATE,
    stratify=train_full_v3[TARGET_EVENT]
)
ytr_v3 = to_sksurv_y(train_df_v3)
yva_v3 = to_sksurv_y(valid_df_v3)

preprocess_v3 = get_default_preprocessor(feature_cols_v3)
X_train_v3_prep = preprocess_v3.fit_transform(train_df_v3[feature_cols_v3])
X_valid_v3_prep = preprocess_v3.transform(valid_df_v3[feature_cols_v3])

gbsa_v3 = GradientBoostingSurvivalAnalysis(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    min_samples_split=10, min_samples_leaf=5,
    random_state=RANDOM_STATE, verbose=0
)
gbsa_v3.fit(X_train_v3_prep, ytr_v3)
gbsa_v3_risk = gbsa_v3.predict(X_valid_v3_prep)
gbsa_v3_c = ipcw_cindex(ytr_v3, yva_v3, gbsa_v3_risk)

print_score_comparison(gbsa_v3_c, "GBSA + Co-mutations")


üìä GBSA + Co-mutations: 0.7111
   vs GBSA (0.7111): +0.00%
   vs Winner (0.7744): -6.33%


0.7111113627068091

---

## 6. R√©capitulatif des Exp√©riences

In [11]:
# ============================================================
# Tableau r√©capitulatif
# ============================================================
print("="*70)
print("   R√âCAPITULATIF DES EXP√âRIENCES")
print("="*70)

experiments = {
    "Baseline GBSA": REFERENCE_SCORES["Gradient Boosting Surv"],
}

# Ajouter les r√©sultats si disponibles
if 'best_ensemble_c' in dir():
    experiments["Ensemble RSF+GBSA"] = best_ensemble_c
if 'best_gbsa_score' in dir():
    experiments["GBSA Tuned"] = best_gbsa_score
if 'gbsa_v2_c' in dir():
    experiments["GBSA + 50 g√®nes"] = gbsa_v2_c
if 'gbsa_v3_c' in dir():
    experiments["GBSA + Co-mutations"] = gbsa_v3_c

experiments["Challenge Winner"] = 0.7744

for exp, score in sorted(experiments.items(), key=lambda x: x[1], reverse=True):
    bar = "‚ñà" * int(score * 40)
    gap = score - 0.7744
    print(f"  {exp:25s} ‚îÇ {score:.4f} ‚îÇ {gap:+.4f} ‚îÇ {bar}")

print("="*70)
best_exp = max(experiments, key=experiments.get)
if best_exp != "Challenge Winner":
    print(f"\nüèÜ Meilleure exp√©rience: {best_exp} ({experiments[best_exp]:.4f})")

   R√âCAPITULATIF DES EXP√âRIENCES
  Challenge Winner          ‚îÇ 0.7744 ‚îÇ +0.0000 ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  GBSA Tuned                ‚îÇ 0.7141 ‚îÇ -0.0603 ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  Ensemble RSF+GBSA         ‚îÇ 0.7118 ‚îÇ -0.0626 ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  GBSA + Co-mutations       ‚îÇ 0.7111 ‚îÇ -0.0633 ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  Baseline GBSA             ‚îÇ 0.7111 ‚îÇ -0.0633 ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  GBSA + 50 g√®nes           ‚îÇ 0.7108 ‚îÇ -0.0636 ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà


---

## 7. Prochaines √âtapes

### √Ä tester :
- [ ] CoxPH avec ElasticNet regularization
- [ ] Parser CYTOGENETICS (del(5q), -7, complex karyotype)
- [ ] Stacking avec meta-learner
- [ ] XGBoost/LightGBM avec AFT (Accelerated Failure Time)
- [ ] Neural Network (DeepSurv)

### Notes :
- Mettre √† jour le tableau r√©capitulatif apr√®s chaque exp√©rience
- Sauvegarder les meilleurs mod√®les avec joblib
- Reporter les r√©sultats dans main.ipynb une fois valid√©s