# Projet 7: Impl√©mentez un mod√®le de scoring (feature selection et choix du mod√®le de scoring)

## Table des mati√®res: <a class="anchor" id="0"></a>

1. [Import des librairies et configurations g√©n√©rales](#library)
2. [Chargement des donn√©es](#load)
3. [S√©lection des donn√©es d'entrainement et de test](#train_test)
4. [Feature selection](#feats)
5. [Traitement des donn√©es d√©s√©quilibr√©es](#imbalanced)
6. [Pipeline, optimisation et entrainement des mod√®les](#pipe)
7. [Choix des scores](#scores)
8. [Mod√©lisations](#model)
    1. [Mod√®le Baseline: Dummy classifier](#dummy)
    2. [R√©gression logistique](#reglog)
    3. [LightGBM](#lightgbm)
9. [S√©lection du meilleur mod√®le](#best)
    1. [D√©finition du seuil de probabilit√©](#predict_proba)
    2. [Sauvegarde du mod√®le](#save)
    3. [Features importance](#feat_imp)

## Import des librairies et configurations g√©n√©rales <a class="anchor" id="library"></a>

In [None]:
# Configuration pour permettre les longues ex√©cutions
import sys
sys.path.insert(0, '..')

# Augmenter le timeout Jupyter pour permettre les longues ex√©cutions
try:
    from IPython.display import display, Markdown
    import IPython
    # Configuration du kernel pour timeout tr√®s long (1 heure)
    IPython.get_ipython().kernel.shell.run_cell(
        "%config IPKernelApp.iopub_data_rate_limit=1e10"
    )
    print("‚úÖ Configuration notebook pour longues ex√©cutions activ√©e")
except:
    print("‚ö†Ô∏è Configuration timeout non appliqu√©e (environnement non Jupyter)")


: 

In [None]:
# builtin
import time
import os

# data
import numpy as np
import pandas as pd
import random

# Fonctions personnelles
import fct_eda
import fct_preprocessing


# Update Fonctions personnelles
%load_ext autoreload
%autoreload 2
%reload_ext autoreload

# viz
import matplotlib.pyplot as plt
import seaborn as sns

# Feature Selection
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, RFE
from sklearn.feature_selection import chi2, f_classif

# Balancing data
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as pipe

# Model
from sklearn.model_selection import train_test_split, KFold, RepeatedStratifiedKFold, cross_validate
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import set_config
set_config(display='diagram')
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, fbeta_score
from sklearn.metrics import confusion_matrix, roc_curve
from sklearn.metrics import precision_recall_curve

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier


# Feature importance locales
from lime import lime_tabular
import shap

import mlflow
import mlflow.sklearn

# Configuration MLflow pour √©viter les erreurs de model registry
os.environ["MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"] = "false"

# Enregistrement du mod√®le
from pickle import dump
from pickle import load
import dill

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# warnings.filterwarnings('ignore')

In [None]:
# Configuration Pandas
pd_option_dictionary = {
    'display.max_rows': 500,
    'display.max_column': 200,
    'display.width': 300,
    'display.precision': 4,
    'display.max_colwidth': None,
    'display.float_format' : '{:.2f}'.format,
}

for pat, value in pd_option_dictionary.items():
    pd.set_option(pat, value)

## Chargement des donn√©es <a class="anchor" id="load"></a>

In [None]:
df = fct_preprocessing.preprocessing_no_NaN()

In [None]:
# V√©rification que le dataset consolid√© ne contient pas de NaN
fct_eda.shape_total_nan(df)

In [None]:
df.head()

## S√©lection des donn√©es d'entrainement et de test <a class="anchor" id="train_test"></a>

En Machine Learning il ne faut jamais valider un mod√®le sur les donn√©es qui ont servi √† son entrainement. Le mod√®le doit √™tre test√© sur des donn√©es qu'il n'a jamais vues. On aura ainsi une id√©e de sa performance future. Le dataset sera m√©lang√© de fa√ßon al√©atoire avant d'√™tre divis√© en deux parties:
- un **train set** dont les donn√©es sont utilis√©es pour **entrainer le mod√®le** (80% des donn√©es)
- un **test set** r√©serv√© uniquement √† **l'√©valuation du mod√®le** (20% des donn√©es)

La s√©paration du dataset en donn√©es d‚Äôentrainement et de test va permettre de **d√©tecter de l‚Äôoverfitting** (mod√®le trop complexe qui apprend parfaitement les donn√©es d‚Äôentrainement mais n‚Äôarrive pas √† g√©n√©raliser) ou de **l‚Äôunderfitting** (mod√®le trop simple ou mal choisi).

In [None]:
df_feat = df.copy()

In [None]:
# D√©finition des features et de la target
col_X = [f for f in df.columns if f not in ['TARGET']]
X = df_feat[col_X]
y = df_feat['TARGET']

In [None]:
# Liste des variables quantitatives
num_feat = X.select_dtypes(exclude='object').columns.tolist()   

In [None]:
# OneHotEncoder sur nos variables cat√©gorielles
X, categ_feat = fct_eda.categories_encoder(X, nan_as_category = False)
X.head()

In [None]:
# Jeu d'entrainement (80%) et de validation (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    shuffle = True,
                                                    random_state = 42)
print(f"Nb de lignes des donn√©es d'entrainement: {len(X_train)} \nNb de lignes des donn√©es de validation: {len(X_test)}")

In [None]:
X_train.shape

## Feature selection <a class="anchor" id="feats"></a>

M√™me si nous avons d√©j√† enlev√© quelques features quantitatives fortement corr√©l√©es entre elles et qualitatives en testant leur association avec la target via Chi2 et Kruskal Wallis, il reste encore beaucoup trop de variables pour les prendre toutes dans notre mod√®le.

La s√©lection des caract√©ristiques est le processus de r√©duction du nombre de variables d'entr√©e lors de l'√©laboration d'un mod√®le pr√©dictif. On trouve 2 avantages principaux √† r√©duire le nombre de variables en entr√©e du mod√®le:
- **r√©duire le co√ªt de calcul**
- **am√©liorer la performance du mod√®le**

Les m√©thodes bas√©es sur les statistiques impliquent l'√©valuation de la **relation entre chaque variable d'entr√©e et la variable cible √† l'aide de statistiques**. Les variables qui ont la relation la plus forte avec la target seront conserv√©es.

Il existe 2 techniques principales de s√©lection des caract√©ristiques: **supervis√©e** et **non supervis√©e** (les caract√©ristiques seront s√©lectionn√©es en fonction de la target ou non).

Les m√©thodes supervis√©es peuvent √™tre class√©es en 3 groupes
- **intrins√®ques**: algorithmes qui effectuent une **s√©lection automatique** des caract√©ristiques pendant l'entrainement
- **wrapper**: m√©thodes qui √©valuent plusieurs mod√®les √† l'aide de proc√©dures qui **ajoutent et/ou suppriment des pr√©dicteurs** afin de trouver la **combinaison optimale** qui **maximise la performance du mod√®le**.
- **filtres**: s√©lectionne des sous-ensembles de caract√©ristiques en fonction de leur **relation avec la cible**.

Nous allons dans un premier temps 
- supprimer les caract√©ristiques qui n'ont pas de variance c'est √† dire les variables qui n'ont qu'une seule et m√™me valeur parmis toutes les observations et n'apportent pas vraiment d'information (produit lors du OneHotEncoding)
- faire une s√©lection des variables cat√©gorielles puis num√©riques en nous aidant de **m√©thodes statistiques**

### VarianceThreshold <a class="anchor" id="Variance"></a>

Nous allons ici supprimer les colonnes sans variance:

In [None]:
transform = VarianceThreshold(0)

X_train_trans = transform.fit_transform(X_train)
X_test_trans = transform.fit_transform(X_test)

In [None]:
mask = transform.get_support()
feat_suppr = X.columns[~mask].tolist()

print('Colonnes supprim√©es')
feat_suppr

In [None]:
# Liste des variables cat√©gorielles actualis√©e
categ_feat = [elem for elem in categ_feat if elem not in feat_suppr]

# Liste des variables num√©riques actualis√©e
num_feat = [elem for elem in num_feat if elem not in feat_suppr]

In [None]:
# Nouveaux df
X_train = X_train[num_feat + categ_feat]
X_test = X_test[num_feat + categ_feat]

In [None]:
X_train.shape

## Fonctions utilis√©es l'or de Mod√©lisation  <a class="anchor" id="imbalanced"></a>

| R√©el \ Pr√©dit          | 0                   | 1                   |
| ---------------------- | ------------------- | ------------------- |
| **0 (non d√©faillant)** | VN *(vrai n√©gatif)* | FP *(faux positif)* |
| **1 (d√©faillant)**     | FN *(faux n√©gatif)* | VP *(vrai positif)* |


In [None]:
def score_metier(ytest, y_pred):
    # Matrice de confusion transform√©e en array avec affectation aux bonnes cat√©gories
    (vn, fp, fn, vp) = confusion_matrix(ytest, y_pred).ravel()

    # Rappel avec action fp ‚Üí √† minimiser
    score_metier = 10 * fn + fp

    return score_metier

In [None]:
def eval_metrics(best_model, xtest, ytest, beta_value):
    y_pred = best_model.predict(xtest)

    score_biz = score_metier(ytest, y_pred)
    betascore = fbeta_score(ytest, y_pred, beta=beta_value)
    recall = recall_score(ytest, y_pred)
    precision = precision_score(ytest, y_pred, zero_division=0)
    accuracy = accuracy_score(ytest, y_pred)
    auc = roc_auc_score(ytest, y_pred)

    print(f'Score m√©tier: {score_biz}')
    print(f'Beta score: {betascore}')
    print(f'Recall: {recall}')
    print(f'Precision: {precision}')
    print(f'Accuracy: {accuracy}')
    print(f'AUC: {auc}')

    return score_biz, betascore, recall, precision, accuracy, auc, y_pred


In [None]:
def plot_confusion_matrix(y_true, y_pred, model_name="Mod√®le", labels=None, save_path=None):
  
    if labels is None:
        labels = ['Y=0 (Non d√©faillant)', 'Y=1 (D√©faillant)']

    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(5, 5))
    sns.heatmap(cm,
                xticklabels=labels,
                yticklabels=labels,
                annot=True,
                fmt='d',
                linewidths=.5,
                cmap=sns.cubehelix_palette(as_cmap=True),
                cbar=False)
    plt.title(f'Matrice de confusion: {model_name}')
    plt.ylabel('R√©alit√©')
    plt.xlabel('Pr√©diction')
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path)
    plt.show()

def plot_recall_precision_threshold(model, X_test, y_test, model_name="Mod√®le", beta=2, cost_fn_ratio=10):
    """
    Plot precision-recall curve et calcule les seuils optimaux
    
    Parameters:
    -----------
    model : classifier
        Mod√®le entra√Æn√©
    X_test : array-like
        Features de test
    y_test : array-like
        Target de test
    model_name : str
        Nom du mod√®le pour le titre
    beta : float
        Param√®tre beta pour F-beta score
    cost_fn_ratio : float
        Ratio de co√ªt m√©tier (FN / FP)
    """
    
    # Obtenir les probabilit√©s
    y_proba = model.predict_proba(X_test)[:, 1]
    
    # Calcul de precision et recall pour diff√©rents seuils
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
    
    # Calcul du F-beta score pour chaque seuil
    fbeta_scores = []
    for p, r in zip(precision, recall):
        if (p + r) == 0:
            fbeta_scores.append(0)
        else:
            f_beta = (1 + beta**2) * (p * r) / ((beta**2 * p) + r)
            fbeta_scores.append(f_beta)
    
    # Calcul du score m√©tier pour chaque seuil
    business_scores = []
    for threshold in thresholds:
        y_pred_thresh = (y_proba >= threshold).astype(int)
        score = score_metier(y_test, y_pred_thresh)
        business_scores.append(score)
    
    # Trouver les seuils optimaux
    best_idx_fbeta = np.argmax(fbeta_scores)
    best_threshold_fbeta = thresholds[best_idx_fbeta] if best_idx_fbeta < len(thresholds) else 0.5
    
    best_idx_business = np.argmin(business_scores)
    best_threshold_business = thresholds[best_idx_business] if best_idx_business < len(thresholds) else 0.5
    
    # Plot
    fig, ax1 = plt.subplots(figsize=(12, 6))
    
    # Axe 1: Precision et Recall
    ax1.plot(thresholds, precision[:-1], label='Precision', linewidth=2, color='blue')
    ax1.plot(thresholds, recall[:-1], label='Recall', linewidth=2, color='red')
    ax1.set_xlabel('Threshold', fontsize=11)
    ax1.set_ylabel('Precision / Recall', fontsize=11, color='black')
    ax1.tick_params(axis='y', labelcolor='black')
    ax1.grid(True, alpha=0.3)
    ax1.legend(loc='upper right', fontsize=10)
    
    # Axe 2: F-beta score
    ax2 = ax1.twinx()
    ax2.plot(thresholds, fbeta_scores[:-1], label=f'F{beta}-Score', linewidth=2, color='green', linestyle='--')
    ax2.set_ylabel(f'F{beta}-Score', fontsize=11, color='green')
    ax2.tick_params(axis='y', labelcolor='green')
    ax2.legend(loc='center right', fontsize=10)
    
    # Marquer les seuils optimaux
    ax1.axvline(best_threshold_fbeta, color='green', linestyle=':', linewidth=2, label=f'Seuil optimal F{beta}')
    ax1.axvline(best_threshold_business, color='orange', linestyle=':', linewidth=2, label=f'Seuil optimal M√©tier')
    
    plt.title(f'Precision-Recall-Threshold: {model_name}', fontweight='bold', fontsize=12)
    plt.tight_layout()
    plt.show()
    
    return best_threshold_fbeta, best_threshold_business

## Traitement des donn√©es d√©s√©quilibr√©es <a class="anchor" id="imbalanced"></a>

Lors de l'analyse exploratoire, nous avons remarqu√© que les donn√©es √©taient tr√®s **d√©s√©quilibr√©es** entre les d√©faillants et non d√©faillants. Les non d√©faillants sont largement sur repr√©sent√©s (> 91%).

La plupart des mod√®les de Machine Learning vont **ignorer la classe minoritaire** et donc avoir des **performances m√©diocres** dans cette classe alors qu'en g√©n√©ral c'est la performance de la classe minoritaire qui est la plus importante.

Une des approches pour traiter les ensembles de donn√©es d√©s√©quilibr√©s consiste √† sur√©chantillonner la classe minoritaire. La m√©thode la plus simple est de **dupliquer les exemples de la classe minoritaire** m√™me si aucune information n'est ajout√©e au mod√®le.

Il est √©galement possible de **pond√©rer les classes** c'est √† dire ajuster la fonction de co√ªt du mod√®le de mani√®re √† ce qu'une mauvaise classification d'une observation de la classe minoritaire soit plus lourdement p√©nalis√©e qu'une mauvaise classification d'une observation de la classe majoritaire. Cette approche contribue √† am√©liorer la pr√©cision du mod√®le en r√©√©quilibrant la distribution des classes. Comme aucun nouveau point de donn√©es n'est cr√©√©, la m√©thode doit √™te utilis√©e conjointement avec d'autres m√©thodes comme le sur√©chantillonnage par exemple.

Au lieu de cela, de nouveaux **exemples peuvent √™tre synth√©tis√©s √† partir des exemples existants**. Il s'agit d'un type d'augmentation de donn√©es pour la classe minoritaire appel√© **SMOTE** pour (Synthetic Minority Oversampling Technique ou Technique de sur√©chantillonnage synth√©tique des minorit√©s).

Un **exemple al√©atoire de la classe minoritaire** est choisi et les k plus proches voisins sont trouv√©s (avec k = 5 en g√©n√©ral). **Un voisin est choisi au hasard** et un segment est trac√© entre les 2 points.
 
Il est recommand√© d'utiliser d'abord un **sous-√©chantillonnage al√©atoire** pour r√©duire le nombre d'exemples dans la classe minoritaire puis d'utiliser **SMOTE** pour sur√©chantillonner la classe minoritaire afin d'√©quilibrer la distribution des classes. C'est une approche efficace car les nouveaux exemples synth√©tiques de la classe minoritaire sont plausibles (proches dans l'espace des caract√©ristiques des exemples existants de la classe minoritaire).

L'inconv√©nient g√©n√©ral serait que les exemples synth√©tiques sont cr√©√©s sans tenir compte de la classe majoritaire.

https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

In [None]:
# Distribution de la target
plt.figure(figsize=(5,3))
sns.countplot(x = 'TARGET', data = df, palette=['green', 'red'])
plt.title('Distribution de la target', fontweight='bold', fontsize = 12)
x = [0, 1]
plt.xticks(x, ['Non D√©faillant', 'D√©faillant'])
plt.xlabel('');

In [None]:
Counter(df['TARGET'])

## Pipeline, optimisation et entrainement des mod√®les <a class="anchor" id="pipe"></a>

Nous allons dans un premier temps cr√©er une **pipeline** pour chacun de nos mod√®les. Cette pipeline va nous permettre d'affecter des √©tapes de preprocessing √† nos donn√©es, c'est √† dire des transformations comme le traitement des donn√©es d√©s√©quilibr√©es, la standardisation de nos donn√©es, et de choisir le type de mod√®le.

Cette pipeline sera ensuite int√©gr√©e dans une fonction d'optimisation et d'entrainement qui va utiliser la **validation crois√©e** pour tester la robustesse du mod√®le pr√©dictif en r√©p√©tant la proc√©dure de split. Elle donnera plusieurs erreurs d'apprentissage et de test et donc une **estimation de la variabilit√© de la performance de g√©n√©ralisation du mod√®le**. Ici, nous avons utilis√© la m√©thode **KFold** qui consiste √† d√©couper le train set en cinq parties, l'entrainer sur les quatre premi√®res parties puis le valider sur la cinqui√®me. On recommencera sur toutes les configurations possibles puis on fera la moyenne des cinq scores et on pourra donc comparer nos mod√®les pour √™tre s√ªr de prendre celui qui a en moyenne la meilleure performance.

Le r√©glage des hyperparam√®tres s'effectuera soit √† l'aide du  **GridSearchCV** qui va tester toutes les combinaisons possibles d'hyperparam√®tres afin de trouver celles qui vont minimiser le plus l'erreur (m√©thode exhaustive tr√®s **co√ªteuse en termes de puissance de calcul et de temps**), soit du **RandomizedSearchCV** qui va s√©lectionner des combinaisaons al√©atoires d'hyperparam√®tres. Cette m√©thode est un peu **moins pr√©cise** mais beaucoup **plus rapide**. Elle sera utilis√©e pour les mod√®les plus complexes.

Enfin, pour √©valuer la **performance r√©elle de nos mod√®les**, nous calculerons les metriques s√©lectionn√©es sur les donn√©es de test.

La validation crois√©e peut √™tre utilis√©e √† la fois pour le **r√©glage des hyperparam√®tres** et pour l'estimation de la **performance de g√©n√©ralisation** d'un mod√®le. Cependant, l'utiliser √† ces deux fins en m√™me temps est probl√©matique. Le r√©glage des hyperparam√®tres est une forme d'apprentissage automatique et, par cons√©quent, nous avons besoin d'une **autre boucle externe de validation crois√©e** pour √©valuer correctement la performance de g√©n√©ralisation de la proc√©dure de mod√©lisation compl√®te. Lorsque l'on optimise certaines parties de la pipeline d'apprentissage automatique (par exemple, les hyperparam√®tres, la transformation, etc.), il est n√©cessaire d'utiliser la **nested cross validation** pour √©valuer la performance de g√©n√©ralisation du mod√®le pr√©dictif. Sinon, les r√©sultats obtenus sans nested cross validation sont souvent trop optimistes.  C‚Äôest ce qui a √©t√© fait pour visualiser la distribution de la m√©trique s√©lectionn√©e sur les donn√©es d‚Äôentrainement et de validation.

## Choix des scores <a class="anchor" id="scores"></a>

Si l'on se r√©f√®re au fichier de description des colonnes:
- **1** => clients a des difficult√©s de paiement, il a eu un retard de paiement de plus de X jours sur au moins une des Y premi√®res √©ch√©ances du pr√™t dans notre √©chantillon (**d√©faillant**)
- **0** => tous les autres cas (**non d√©faillant**)

Sur une matrice de confusion, les d√©faillants repr√©sentent la classe positive (Y=1) et les non d√©faillants la classe n√©gative (Y=0).

Comme il serait extr√™ment co√ªteux pour la banque d'accorder un cr√©dit √† un client d√©faillant qui ne le rembourserait pas ou en partie, il nous faut **minimiser le nombre de faux n√©gatifs** c'est √† dire un client pr√©dit non d√©faillant alors qu'il est d√©faillant.

Il faut √©galement t√¢cher de **minimiser les faux positifs** c'est √† dire pr√©dire qu'un client est d√©faillant alors qu'il ne l'est pas (risque de perte de clients, de manque √† gagner).

Cependant, un faux positif n'a pas le m√™me co√ªt qu'un faux n√©gatif. Ce dernier est beaucoup plus co√ªteux pour la banque. **Nous accorderons donc 10 fois plus de poids aux faux n√©gatifs** (fonction co√ªt m√©tier). 

Le **Rappel (Recall)** qui mesure le taux de vrais positifs est √† favoriser au d√©triment de la pr√©cision qui est la capacit√© du classificateur √† ne pas √©tiqueter comme positif un √©chantillon qui est n√©gatif.

Pour faire cela, nous allons nous baser sur le **F-beta score** qui est la moyenne harmonique pond√©r√©e de la pr√©cision et du rappel. Le param√®tre b√™ta d√©termine le poids du rappel dans le score. Lorsqu'il est supp√©rieur √† un, il favorise le rappel. Nous testerons plusieurs valeurs pour le beta et garderons celui qui donne le meilleur score.

Nous mettrons √©galement l'**accuracy** et l'**AUC** comme √©l√©ments de comparaison. Le **temps d'entrainement** sera √©galement track√©.

Le choix du meilleur mod√®le se fera via **cross validation sur le betascore** puis sur les donn√©es de test en fonction du **score m√©tier et du betascore**.

## Mod√©lisations <a class="anchor" id="model"></a>

Nous cherchons √† classer les demandes en **cr√©dit accord√© ou refus√©**. Il s'agit donc d'un mod√®le de **classification binaire**.

### Mod√®le Baseline: Dummy classifier <a class="anchor" id="dummy"></a>

Ce classificateur fait des pr√©dictions en utilisant des r√®gles simples. Il est utile comme **base de r√©f√©rence simple** √† comparer avec d'autres classificateurs et ne sera pas optimis√©. Il ignore les variables en entr√©e et par cons√©quent, n'utilise aucune information provenant des features. Il n'y a donc **pas besoin de transformer au pr√©alable nos features**.

In [None]:
# Dummy classifier baseline model
# On renomme l'instance pour coh√©rence: dummy_clf
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train, y_train)
y_pred_dummy = dummy_clf.predict(X_test)

print("=" * 70)
print("DUMMY CLASSIFIER BASELINE")
print("=" * 70)

# Evaluate metrics
score_biz, betascore, recall, precision, accuracy, auc, y_pred = eval_metrics(dummy_clf, X_test, y_test, beta_value=2)



In [None]:
# Plot de la matrice de confusion avec la fonction d√©finie plus haut
plot_confusion_matrix(y_test, y_pred_dummy, model_name="DummyClassifier")


### Mod√®le Baseline: Regression Logistic <a class="anchor" id="dummy"></a>

La **r√©gression logistique** est un mod√®le de classification binaire qui estime la probabilit√© qu'une observation appartienne √† une classe en utilisant la fonction logistique (sigmo√Øde). 

**Avantages:**
- Simple, interpr√©table et rapide √† entra√Æner
- Probabilit√©s calibr√©es
- Performances solides sur donn√©es lin√©airement s√©parables

**Strat√©gie:**
Nous allons tester **toutes les strat√©gies de r√©√©quilibrage** disponibles pour identifier celle qui offre les meilleures performances:
1. **Sans r√©√©quilibrage** (baseline)
2. **Class weight balanced** (pond√©ration des classes)
3. **SMOTE 0.5** (sur√©chantillonnage synth√©tique avec ratio 0.5)
4. **SMOTE 0.7** (sur√©chantillonnage synth√©tique avec ratio 0.7)
5. **Undersample 0.5** (sous-√©chantillonnage al√©atoire avec ratio 0.5)
6. **Combine** (undersample + SMOTE combin√©)

Pour chaque strat√©gie, nous optimiserons les hyperparam√®tres avec RandomizedSearchCV.

#### Configuration et pr√©paration

In [None]:
# Configuration flexible du GridSearch - √Ä AJUSTER selon vos besoins
GRIDSEARCH_CONFIG = {
    'cv_splits': 2,           # R√©duit √† 2 pour rapidit√©
    'cv_repeats': 1,          # 1 repeat
    'use_randomized': True,   # Utiliser RandomizedSearchCV au lieu de GridSearchCV
    'n_iter': 1,              # UNE SEULE it√©ration (grille d√©j√† simplifi√©e)
    'verbose': 1              # 0=silencieux, 1=peu de texte, 2=d√©taill√©
}

# D√©finition des strat√©gies de r√©√©quilibrage √† tester (r√©duites √† essentielles)
balancing_strategies = {
    'none': {'strategy': 'none', 'sampling_ratio': None},
    'class_weight': {'strategy': 'class_weight', 'sampling_ratio': None},
    'smote_0.5': {'strategy': 'smote', 'sampling_ratio': 0.5},
    'smote_0.7': {'strategy': 'smote', 'sampling_ratio': 0.7},
    'undersample_0.5': {'strategy': 'undersample', 'sampling_ratio': 0.5},
    'combine': {'strategy': 'combine', 'sampling_ratio': 0.3}
}

# Grille d'hyperparam√®tres pour la r√©gression logistique (OPTIMIS√âE)
param_grid_lr = {
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
    'classifier__max_iter': [100]
}

# Grille d'hyperparam√®tres pour LightGBM
param_grid_lgb = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__num_leaves': [31, 50],
    'classifier__max_depth': [-1, 10, 15]
}

# Scorer personnalis√© pour le F2-score
f2_scorer = make_scorer(fbeta_score, beta=2
                    )


#### Test de toutes les strat√©gies de r√©√©quilibrage

In [None]:
# Fonction helper pour utiliser GridSearch ou RandomizedSearch
from sklearn.model_selection import RandomizedSearchCV

def create_search_cv(pipeline, param_grid, cv, scoring, config):
    """
    Cr√©e GridSearchCV ou RandomizedSearchCV selon la configuration
    """
    if config['use_randomized']:
        return RandomizedSearchCV(
            pipeline,
            param_grid,
            n_iter=config['n_iter'],
            cv=cv,
            scoring=scoring,
            n_jobs=-1,
            verbose=config['verbose'],
            random_state=42
        )
    else:
        return GridSearchCV(
            pipeline,
            param_grid,
            cv=cv,
            scoring=scoring,
            n_jobs=-1,
            verbose=config['verbose']
        )

print("‚úÖ Fonction create_search_cv initialis√©e")

In [None]:
# D√©finition de la fonction create_balanced_pipeline
# IMPORTANT: Utilise imblearn.pipeline.Pipeline SEULEMENT avec r√©√©quilibrage
def create_balanced_pipeline(classifier, strategy='none', sampling_ratio=None, scaler=None):
    """
    Cr√©e un pipeline avec r√©√©quilibrage des donn√©es et classification
    
    Parameters:
    -----------
    classifier : estimator
        Le classificateur √† utiliser (LogisticRegression, LGBMClassifier, etc.)
    strategy : str
        Strat√©gie de r√©√©quilibrage: 'none', 'smote', 'smote_auto', 'undersample', 'combine'
    sampling_ratio : float
        Ratio de sur√©chantillonnage pour SMOTE (0 √† 1)
    scaler : transformer
        Scaler pour normaliser les donn√©es (StandardScaler, etc.)
    
    Returns:
    --------
    Pipeline sklearn ou imblearn selon la strat√©gie
    """
    
    steps = []
    has_sampling = False  # Flag pour d√©terminer quel pipeline utiliser
    
    # Ajouter le scaler si fourni
    if scaler is not None:
        steps.append(('scaler', scaler))
    
    # Ajouter le r√©√©quilibrage selon la strat√©gie
    if strategy == 'none':
        # Pas de r√©√©quilibrage
        pass
    
    elif strategy == 'smote':
        # SMOTE avec ratio sp√©cifi√©
        smote = SMOTE(sampling_strategy=sampling_ratio, random_state=42)
        steps.append(('smote', smote))
        has_sampling = True
    
    elif strategy == 'smote_auto':
        # SMOTE auto (√©quilibre complet 1:1)
        smote = SMOTE(sampling_strategy='auto', random_state=42)
        steps.append(('smote', smote))
        has_sampling = True
    
    elif strategy == 'undersample':
        # Sous-√©chantillonnage al√©atoire
        rus = RandomUnderSampler(sampling_strategy=sampling_ratio, random_state=42)
        steps.append(('undersample', rus))
        has_sampling = True
    
    elif strategy == 'combine':
        # Combinaison: sous-√©chantillonnage + SMOTE
        under = RandomUnderSampler(sampling_strategy=sampling_ratio, random_state=42)
        over = SMOTE(sampling_strategy=0.5, random_state=42)
        steps.append(('undersample', under))
        steps.append(('smote', over))
        has_sampling = True
    
    # Ajouter le classifier
    steps.append(('classifier', classifier))
    
    # Retourner le pipeline appropri√©
    if has_sampling:
        # Utiliser imblearn.pipeline seulement si r√©√©quilibrage
        return pipe(steps)
    else:
        # Pas de r√©√©quilibrage - utiliser sklearn.Pipeline standard
        return Pipeline(steps)

# Fonction sp√©cifique pour LightGBM - g√®re mieux le r√©√©quilibrage
def create_lgbm_pipeline(classifier, strategy='smote', sampling_ratio=None):
    """
    Cr√©e un pipeline pour LightGBM avec r√©√©quilibrage appropri√©
    LightGBM n'a pas besoin de scaler (tree-based)
    
    Parameters:
    -----------
    classifier : LGBMClassifier
        Le classificateur LightGBM
    strategy : str
        Strat√©gie de r√©√©quilibrage: 'class_weight', 'smote', 'undersample', 'combine'
    sampling_ratio : float
        Ratio de sur√©chantillonnage
    
    Returns:
    --------
    Pipeline imblearn ou sklearn selon la strat√©gie
    """
    
    # Pour class_weight, utiliser sklearn.Pipeline simple
    # (le class_weight est g√©r√© par le classifier lui-m√™me)
    if strategy == 'class_weight':
        return Pipeline([('classifier', classifier)])
    
    # Pour les autres strat√©gies, utiliser imblearn.pipeline avec r√©√©quilibrage
    steps = []
    
    if strategy == 'smote':
        smote = SMOTE(sampling_strategy=sampling_ratio, random_state=42)
        steps.append(('smote', smote))
    
    elif strategy == 'undersample':
        rus = RandomUnderSampler(sampling_strategy=sampling_ratio, random_state=42)
        steps.append(('undersample', rus))
    
    elif strategy == 'combine':
        under = RandomUnderSampler(sampling_strategy=sampling_ratio, random_state=42)
        over = SMOTE(sampling_strategy=0.5, random_state=42)
        steps.append(('undersample', under))
        steps.append(('smote', over))
    
    # Ajouter le classifier
    steps.append(('classifier', classifier))
    
    # Retourner imblearn pipeline pour les strat√©gies avec r√©√©quilibrage
    return pipe(steps)

print("‚úÖ Fonction create_balanced_pipeline d√©finie (Pipeline sklearn pour 'none', imblearn pour autres strat√©gies)")
print("‚úÖ Fonction create_lgbm_pipeline d√©finie (Pipeline optimis√© pour LightGBM)")


In [None]:
# Initialisation du dictionnaire pour stocker les r√©sultats
results_lr = {}



In [None]:
# Initialisation MLFlow
import sys
sys.path.append('..')

from src.mlflow_tracking.tracker import MLFlowTracker

# Initialiser le tracker MLFlow
tracker = MLFlowTracker(experiment_name="credit-scoring-projet7", tracking_uri="../mlruns")

print("\n" + "="*70)
print("TESTING ALL BALANCING STRATEGIES - LOGISTIC REGRESSION")
print("="*70)

# Boucle sur toutes les strat√©gies
for strategy_name, strategy_config in balancing_strategies.items():
    print(f"\n{'‚îÄ'*70}")
    print(f"üìä Testing: {strategy_name}")
    print(f"{'‚îÄ'*70}")
    
    # D√©marrer une run MLFlow
    run_name = f"LogisticRegression_{strategy_name}"
    tracker.start_run(run_name=run_name)
    
    try:
        # Cr√©er le mod√®le de base
        if strategy_config['strategy'] == 'class_weight':
            lr = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
            pipeline = create_balanced_pipeline(lr, 'none', scaler=StandardScaler())
        else:
            lr = LogisticRegression(random_state=42, max_iter=1000)
            pipeline = create_balanced_pipeline(
                lr, 
                strategy_config['strategy'], 
                sampling_ratio=strategy_config['sampling_ratio'],
                scaler=StandardScaler()
            )
        
        # Configurer la validation crois√©e stratifi√©e (optimis√©e)
        cv = RepeatedStratifiedKFold(
            n_splits=GRIDSEARCH_CONFIG['cv_splits'], 
            n_repeats=GRIDSEARCH_CONFIG['cv_repeats'], 
            random_state=42
        )
        
        # GridSearchCV ou RandomizedSearchCV selon la configuration
        grid_search = create_search_cv(
            pipeline,
            param_grid_lr,
            cv=cv,
            scoring=f2_scorer,
            config=GRIDSEARCH_CONFIG
        )
        
        # Entra√Ænement
        print(f"\n GridSearchCV...")
        start_time = time.time()
        grid_search.fit(X_train, y_train)
        training_time = time.time() - start_time
        
        # Meilleur mod√®le
        best_model = grid_search.best_estimator_
        best_params = grid_search.best_params_
        best_cv_score = grid_search.best_score_
        
        # Pr√©dictions sur le test set
        y_pred_test = best_model.predict(X_test)
        
        # Calcul des m√©triques
        score_biz = score_metier(y_test, y_pred_test)
        betascore = fbeta_score(y_test, y_pred_test, beta=2)
        recall = recall_score(y_test, y_pred_test)
        precision = precision_score(y_test, y_pred_test, zero_division=0)
        accuracy = accuracy_score(y_test, y_pred_test)
        auc = roc_auc_score(y_test, y_pred_test)
        
        # Affichage des r√©sultats
        print(f"\n Results:")
        print(f"‚îú‚îÄ Training time: {training_time:.2f}s")
        print(f"‚îú‚îÄ Best CV F2-Score: {best_cv_score:.4f}")
        print(f"‚îú‚îÄ Test F2-Score: {betascore:.4f}")
        print(f"‚îú‚îÄ Test Recall: {recall:.4f}")
        print(f"‚îú‚îÄ Test Precision: {precision:.4f}")
        print(f"‚îú‚îÄ Test Accuracy: {accuracy:.4f}")
        print(f"‚îú‚îÄ Test AUC: {auc:.4f}")
        print(f"‚îî‚îÄ Score M√©tier: {score_biz}")
        
        # Log des param√®tres et hyperparam√®tres dans MLFlow
        params_to_log = {
            'strategy': strategy_config['strategy'],
            'sampling_ratio': strategy_config['sampling_ratio'] if strategy_config['sampling_ratio'] else 'N/A',
            'scaler': 'StandardScaler',
            'training_time': training_time,
            **best_params
        }
        tracker.log_params(params_to_log)
        
        # Log des m√©triques dans MLFlow
        metrics_to_log = {
            'best_cv_f2_score': best_cv_score,
            'test_f2_score': betascore,
            'test_recall': recall,
            'test_precision': precision,
            'test_accuracy': accuracy,
            'test_auc': auc,
            'score_metier': score_biz
        }
        tracker.log_metrics(metrics_to_log)
        
        # Log du mod√®le dans MLFlow
        tracker.log_model(best_model, model_name=f"lr_{strategy_name}")
        
        # Stocker les r√©sultats
        results_lr[strategy_name] = {
            'model': best_model,
            'best_params': best_params,
            'best_cv_score': best_cv_score,
            'test_metrics': {
                'f2_score': betascore,
                'recall': recall,
                'precision': precision,
                'accuracy': accuracy,
                'auc': auc,
                'score_metier': score_biz
            },
            'y_pred': y_pred_test,
            'training_time': training_time,
            'run_id': mlflow.active_run().info.run_id
        }
        
        # Plot confusion matrix
        plot_confusion_matrix(y_test, y_pred_test, model_name=f"LogisticRegression_{strategy_name}")
        
        # Terminer la run
        tracker.end_run()
        
    except Exception as e:
        print(f"‚ùå Erreur avec strat√©gie {strategy_name}: {str(e)}")
        tracker.end_run()
        continue

print("\n" + "="*70)
print("‚úÖ ALL LOGISTIC REGRESSION RUNS COMPLETED")

print("="*70)

#### Comparaison des r√©sultats

### LightGBM <a class="anchor" id="lightgbm"></a>
LightGBM est un gradient boosting bas√© sur des arbres de d√©cision optimis√© pour la vitesse et la performance (histogram-based, leaf-wise growth). 

Objectif: R√©aliser le m√™me processus que pour la r√©gression logistique avec:
- Toutes les strat√©gies de r√©√©quilibrage (`none`, `class_weight`, `smote_*`, `smote_auto`, `undersample`, `combine`)
- Optimisation des hyperparam√®tres principaux du mod√®le
- Tracking MLflow complet (param√®tres, m√©triques, seuils, mod√®le)
- S√©lection comparative finale.

Sp√©cificit√©s LightGBM pour donn√©es d√©s√©quilibr√©es:
- `class_weight` ou `is_unbalance=True` / `scale_pos_weight` (on utilisera `class_weight` pour coh√©rence)
- Risque d'overfitting avec trop de feuilles ‚Üí contr√¥ler `num_leaves` et `max_depth`
- R√©gularisation via `reg_alpha` / `reg_lambda`

On optimise le F2-score (Recall prioritaire) et on suit aussi le score m√©tier (10*FN + FP).

In [None]:
print("\n" + "="*70)
print("LIGHTGBM - CONFIGURATION")
print("="*70)

# Grille d'hyperparam√®tres pour LightGBM - ULTRA-SIMPLIFI√âE pour rapidit√©
param_grid_lgbm = {
    'classifier__num_leaves': [31],
    'classifier__max_depth': [7],
    'classifier__learning_rate': [0.05],
    'classifier__n_estimators': [100],
    'classifier__reg_alpha': [0.0],
    'classifier__reg_lambda': [0.0]
}

# Dictionnaire pour stocker les r√©sultats
results_lgbm = {}

print("Configuration LightGBM termin√©e!")
print(f"Nombre de strat√©gies √† tester: {len(balancing_strategies)}")
print(f"Hyperparam√®tres √† optimiser: {list(param_grid_lgbm.keys())}")


In [None]:
        # RandomizedSearchCV pour optimisation (moins co√ªteux que GridSearchCV)
        random_search = RandomizedSearchCV(
            pipeline,
            param_grid_lgbm,
            n_iter=GRIDSEARCH_CONFIG['n_iter'],  # Utiliser la config
            cv=cv,
            scoring=f2_scorer,
            n_jobs=1,  # IMPORTANT: Utiliser n_jobs=1 pour √©viter les probl√®mes de pickling
            verbose=GRIDSEARCH_CONFIG['verbose'],
            random_state=42
        )

In [None]:
# ============================================================================
# SCRIPT DE NETTOYAGE MLFLOW - √Ä EX√âCUTER EN PREMIER
# ============================================================================

import mlflow

print("\n" + "="*70)
print("üßπ NETTOYAGE DES RUNS MLFLOW ACTIVES")
print("="*70)

# Terminer toutes les runs actives
run_ended = False
max_attempts = 5

for i in range(max_attempts):
    try:
        active_run = mlflow.active_run()
        if active_run:
            print(f"‚ö†Ô∏è Run active d√©tect√©e: {active_run.info.run_id}")
            mlflow.end_run()
            print(f"‚úÖ Run {active_run.info.run_id} termin√©e")
            run_ended = True
        else:
            break
    except Exception as e:
        print(f"‚ÑπÔ∏è Tentative {i+1}/{max_attempts}: {str(e)}")
        break

if run_ended:
    print("\n‚úÖ Toutes les runs actives ont √©t√© termin√©es")
else:
    print("\n‚úÖ Aucune run active d√©tect√©e")

# V√©rification finale
try:
    active_run = mlflow.active_run()
    if active_run:
        print(f"\n‚ö†Ô∏è ATTENTION: Une run est encore active: {active_run.info.run_id}")
        print("Ex√©cutez: mlflow.end_run()")
    else:
        print("\n‚úÖ Pr√™t pour d√©marrer l'entra√Ænement LightGBM")
except:
    print("\n‚úÖ Pr√™t pour d√©marrer l'entra√Ænement LightGBM")

print("="*70 + "\n")

In [31]:
# ============================================================================
# SCRIPT COMPLET : LIGHTGBM AVEC NETTOYAGE DES COLONNES
# ============================================================================

import re
import sys
import time
import mlflow
sys.path.append('..')

from src.mlflow_tracking.tracker import MLFlowTracker

# ============================================================================
# √âTAPE 1 : FONCTION DE NETTOYAGE DES NOMS DE COLONNES
# ============================================================================

def clean_column_names(df):
    """Nettoie les noms de colonnes pour LightGBM"""
    forbidden_chars = r'[\[\]{}:,"\']'
    column_mapping = {}
    
    for col in df.columns:
        new_col = re.sub(forbidden_chars, '_', str(col))
        new_col = re.sub(r'_+', '_', new_col)
        new_col = new_col.strip('_')
        column_mapping[col] = new_col
    
    df_clean = df.rename(columns=column_mapping)
    
    changes = [(old, new) for old, new in column_mapping.items() if old != new]
    if changes:
        print(f"‚úÖ {len(changes)} colonnes renomm√©es pour compatibilit√© LightGBM")
    
    return df_clean

# ============================================================================
# √âTAPE 2 : NETTOYER LES DONN√âES
# ============================================================================

print("\n" + "="*70)
print("üßπ NETTOYAGE DES NOMS DE COLONNES")
print("="*70)

X_train_clean = clean_column_names(X_train)
X_test_clean = clean_column_names(X_test)

print(f"‚úÖ X_train_clean: {X_train_clean.shape}")
print(f"‚úÖ X_test_clean: {X_test_clean.shape}")

# ============================================================================
# √âTAPE 3 : ENTRA√éNEMENT LIGHTGBM
# ============================================================================

# Initialiser le tracker MLFlow
tracker = MLFlowTracker(experiment_name="credit-scoring-projet7", tracking_uri="../mlruns")

# Dictionnaire pour stocker les r√©sultats
results_lgbm = {}

print("\n" + "="*70)
print("TESTING ALL BALANCING STRATEGIES - LIGHTGBM")
print("="*70)

# Boucle sur toutes les strat√©gies
for strategy_name, strategy_config in balancing_strategies.items():
    print(f"\n{'‚îÄ'*70}")
    print(f"üìä Testing: {strategy_name}")
    print(f"{'‚îÄ'*70}")
    
    run_name = f"LightGBM_{strategy_name}"
    tracker.start_run(run_name=run_name)
    
    try:
        # Cr√©er le mod√®le LightGBM
        if strategy_config['strategy'] == 'class_weight':
            lgbm = LGBMClassifier(
                class_weight='balanced',
                random_state=42,
                n_jobs=-1,
                verbose=-1
            )
            pipeline = create_lgbm_pipeline(lgbm, strategy='class_weight')
        else:
            lgbm = LGBMClassifier(
                random_state=42,
                n_jobs=-1,
                verbose=-1
            )
            pipeline = create_lgbm_pipeline(
                lgbm, 
                strategy=strategy_config['strategy'], 
                sampling_ratio=strategy_config['sampling_ratio']
            )
        
        # Validation crois√©e
        cv = RepeatedStratifiedKFold(
            n_splits=GRIDSEARCH_CONFIG['cv_splits'], 
            n_repeats=GRIDSEARCH_CONFIG['cv_repeats'], 
            random_state=42
        )
        
        # Recherche d'hyperparam√®tres
        grid_search = create_search_cv(
            pipeline,
            param_grid_lgb,
            cv=cv,
            scoring=f2_scorer,
            config=GRIDSEARCH_CONFIG
        )
        
        # Entra√Ænement
        print(f"\nüîç {'RandomizedSearchCV' if GRIDSEARCH_CONFIG['use_randomized'] else 'GridSearchCV'}...")
        start_time = time.time()
        grid_search.fit(X_train_clean, y_train)  # ‚Üê DONN√âES NETTOY√âES
        training_time = time.time() - start_time
        
        # Meilleur mod√®le
        best_model = grid_search.best_estimator_
        best_params = grid_search.best_params_
        best_cv_score = grid_search.best_score_
        
        # Pr√©dictions
        y_pred_test = best_model.predict(X_test_clean)  # ‚Üê DONN√âES NETTOY√âES
        y_pred_proba_test = best_model.predict_proba(X_test_clean)[:, 1]
        
        # M√©triques
        score_biz = score_metier(y_test, y_pred_test)
        betascore = fbeta_score(y_test, y_pred_test, beta=2)
        recall = recall_score(y_test, y_pred_test)
        precision = precision_score(y_test, y_pred_test, zero_division=0)
        accuracy = accuracy_score(y_test, y_pred_test)
        auc = roc_auc_score(y_test, y_pred_proba_test)
        
        # Affichage
        print(f"\nüìà Results:")
        print(f"‚îú‚îÄ Training time: {training_time:.2f}s")
        print(f"‚îú‚îÄ Best CV F2-Score: {best_cv_score:.4f}")
        print(f"‚îú‚îÄ Test F2-Score: {betascore:.4f}")
        print(f"‚îú‚îÄ Test Recall: {recall:.4f}")
        print(f"‚îú‚îÄ Test Precision: {precision:.4f}")
        print(f"‚îú‚îÄ Test Accuracy: {accuracy:.4f}")
        print(f"‚îú‚îÄ Test AUC: {auc:.4f}")
        print(f"‚îî‚îÄ Score M√©tier: {score_biz}")
        
        # MLflow logging
        params_to_log = {
            'model_type': 'LightGBM',
            'strategy': strategy_config['strategy'],
            'sampling_ratio': strategy_config['sampling_ratio'] if strategy_config['sampling_ratio'] else 'N/A',
            'scaler': 'None',
            'training_time': training_time,
            'cv_splits': GRIDSEARCH_CONFIG['cv_splits'],
            'cv_repeats': GRIDSEARCH_CONFIG['cv_repeats'],
            **best_params
        }
        tracker.log_params(params_to_log)
        
        metrics_to_log = {
            'best_cv_f2_score': best_cv_score,
            'test_f2_score': betascore,
            'test_recall': recall,
            'test_precision': precision,
            'test_accuracy': accuracy,
            'test_auc': auc,
            'score_metier': score_biz
        }
        tracker.log_metrics(metrics_to_log)
        
        tracker.log_model(best_model, model_name=f"lgbm_{strategy_name}")
        
        # Stocker r√©sultats
        results_lgbm[strategy_name] = {
            'model': best_model,
            'best_params': best_params,
            'best_cv_score': best_cv_score,
            'test_metrics': {
                'f2_score': betascore,
                'recall': recall,
                'precision': precision,
                'accuracy': accuracy,
                'auc': auc,
                'score_metier': score_biz
            },
            'y_pred': y_pred_test,
            'y_pred_proba': y_pred_proba_test,
            'training_time': training_time,
            'run_id': mlflow.active_run().info.run_id
        }
        
        # Confusion matrix
        plot_confusion_matrix(y_test, y_pred_test, model_name=f"LightGBM_{strategy_name}")
        
        tracker.end_run()
        
    except Exception as e:
        print(f"‚ùå Erreur: {str(e)}")
        import traceback
        traceback.print_exc()
        tracker.end_run()
        continue

# ============================================================================
# √âTAPE 4 : R√âSUM√â DES R√âSULTATS
# ============================================================================

print("\n" + "="*70)
print("‚úÖ ALL LIGHTGBM RUNS COMPLETED")
print("="*70)

print("\n" + "="*70)
print("üìä SUMMARY - LIGHTGBM RESULTS")
print("="*70)

if results_lgbm:
    print(f"\n{'Strategy':<20} {'F2-Score':<12} {'Recall':<10} {'Precision':<12} {'AUC':<10} {'Score M√©tier':<15}")
    print("‚îÄ" * 90)
    
    for strategy_name, result in results_lgbm.items():
        metrics = result['test_metrics']
        print(f"{strategy_name:<20} {metrics['f2_score']:<12.4f} {metrics['recall']:<10.4f} "
              f"{metrics['precision']:<12.4f} {metrics['auc']:<10.4f} {metrics['score_metier']:<15}")
    
    best_strategy = max(results_lgbm.items(), key=lambda x: x[1]['test_metrics']['f2_score'])
    print("\n" + "‚îÄ" * 90)
    print(f"üèÜ Best Strategy: {best_strategy[0]} (F2-Score: {best_strategy[1]['test_metrics']['f2_score']:.4f})")
    
    # Statistiques additionnelles
    print("\nüìà Statistiques additionnelles:")
    print(f"‚îú‚îÄ Nombre de strat√©gies test√©es: {len(results_lgbm)}")
    print(f"‚îú‚îÄ Temps total d'entra√Ænement: {sum(r['training_time'] for r in results_lgbm.values()):.2f}s")
    print(f"‚îî‚îÄ Temps moyen par strat√©gie: {sum(r['training_time'] for r in results_lgbm.values())/len(results_lgbm):.2f}s")
else:
    print("\n‚ö†Ô∏è Aucun r√©sultat disponible")

print("\n" + "="*70)


üßπ NETTOYAGE DES NOMS DE COLONNES
‚úÖ 6 colonnes renomm√©es pour compatibilit√© LightGBM
‚úÖ 6 colonnes renomm√©es pour compatibilit√© LightGBM
‚úÖ X_train_clean: (246004, 299)
‚úÖ X_test_clean: (61501, 299)
üìÅ Tracking directory: c:\ashash\7\projet7-scoring-credit\notebooks\..\mlruns
‚ö†Ô∏è Erreur:  Model registry functionality is unavailable; got unsupported URI 'c:\ashash\7\projet7-scoring-credit\notebooks\..\mlruns' for model registry data storage. Supported URI schemes are: ['', 'file', 'databricks', 'databricks-uc', 'uc', 'http', 'https', 'postgresql', 'mysql', 'sqlite', 'mssql']. See https://www.mlflow.org/docs/latest/tracking.html#storage for how to run an MLflow server against one of the supported backend storage locations.

TESTING ALL BALANCING STRATEGIES - LIGHTGBM

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î

UnsupportedModelRegistryStoreURIException:  Model registry functionality is unavailable; got unsupported URI 'c:\ashash\7\projet7-scoring-credit\notebooks\..\mlruns' for model registry data storage. Supported URI schemes are: ['', 'file', 'databricks', 'databricks-uc', 'uc', 'http', 'https', 'postgresql', 'mysql', 'sqlite', 'mssql']. See https://www.mlflow.org/docs/latest/tracking.html#storage for how to run an MLflow server against one of the supported backend storage locations.

In [None]:
# Fermer proprement toute run MLflow active
import mlflow
try:
    if mlflow.active_run():
        mlflow.end_run()
        print("‚úÖ Run MLflow ferm√©e proprement")
except Exception as e:
    print(f"Note: {e}")


In [None]:
ffff

## S√©lection du meilleur mod√®le <a class="anchor" id="best"></a>



In [None]:
all_results = []

# --- Logistic Regression ---
if 'results_lr' in globals() and results_lr:
    for strategy_name, result in results_lr.items():
        metrics = result['test_metrics']
        all_results.append({
            'Model': 'Logistic Regression',
            'Strategy': strategy_name,
            'F2-Score': metrics['f2_score'],
            'Recall': metrics['recall'],
            'Precision': metrics['precision'],
            'Accuracy': metrics['accuracy'],
            'AUC': metrics['auc'],
            'Score M√©tier': metrics['score_metier'],
            'CV F2-Score': result['best_cv_score'],
            'Training Time (s)': result['training_time']
        })

# --- LightGBM ---
if 'results_lgbm' in globals() and results_lgbm:
    for strategy_name, result in results_lgbm.items():
        metrics = result['test_metrics']
        all_results.append({
            'Model': 'LightGBM',
            'Strategy': strategy_name,
            'F2-Score': metrics['f2_score'],
            'Recall': metrics['recall'],
            'Precision': metrics['precision'],
            'Accuracy': metrics['accuracy'],
            'AUC': metrics['auc'],
            'Score M√©tier': metrics['score_metier'],
            'CV F2-Score': result['best_cv_score'],
            'Training Time (s)': result['training_time']
        })

df_models_recap = pd.DataFrame(all_results)

if len(df_models_recap) == 0:
    print("\n Aucun r√©sultat trouv√©.")
else:
    df_models_recap = df_models_recap.sort_values('F2-Score', ascending=False).reset_index(drop=True)
    df_models_recap.insert(0, 'Rank', range(1, len(df_models_recap) + 1))

    # =====================
    #     STYLE DATAFRAME
    # =====================

    styled_df = (
        df_models_recap.style
        .background_gradient(subset=['Score M√©tier'], cmap='RdYlGn')
        .background_gradient(subset=['F2-Score'], cmap='RdYlGn', vmin=0.4, vmax=0.8)
        .background_gradient(subset=['Recall'], cmap='YlOrRd', vmin=0.4, vmax=1.0)
        .background_gradient(subset=['Precision'], cmap='Blues', vmin=0.3, vmax=0.8)
        .background_gradient(subset=['AUC'], cmap='Purples', vmin=0.5, vmax=0.9)
        .highlight_max(subset=['F2-Score'], color='#90EE90')
        .highlight_max(subset=['Recall'], color='#FFD700')
        .highlight_max(subset=['AUC'], color='#87CEEB')
        .highlight_min(subset=['Score M√©tier'], color="#90EE90")
        .set_properties(**{
            'text-align': 'center',
            'font-size': '12px',
            'border': '1px solid #ddd',
            'padding': '8px'
        })
        .set_table_styles([
            {
                'selector': 'th',
                'props': [
                    ('background-color', '#2E86AB'),
                    ('color', 'white'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center'),
                    ('padding', '12px'),
                    ('border', '1px solid white'),
                    ('font-size', '13px')
                ]
            },
            {
                'selector': 'tr:hover',
                'props': [
                    ('background-color', "#1fbb5b"),
                    ('cursor', 'pointer')
                ]
            }
        ])
        .format({
            'F2-Score': '{:.4f}',
            'Recall': '{:.4f}',
            'Precision': '{:.4f}',
            'Accuracy': '{:.4f}',
            'AUC': '{:.4f}',
            'CV F2-Score': '{:.4f}',
            'Score M√©tier': '{:,.0f}',
            'Training Time (s)': '{:.2f}'
        })
    )

    print("\n")
    display(styled_df)


In [None]:
# Cr√©ation d'un barplot montrant la performance sur le fbeta_score et le score m√©tier
print("\n" + "="*100)
print("üìä BARPLOT COMPARATIF: F2-SCORE vs SCORE M√âTIER")
print("="*100 + "\n")


plot_df = pd.DataFrame(all_results) 
plot_df['Mod√®le'] = plot_df['Model'].str[:2].str.upper() + "_" + plot_df['Strategy'].str[:8]

# Cr√©er le plot avec les donn√©es correctes
fig, ax1 = plt.subplots(figsize=(14, 7))
sns.set_style("whitegrid")

# Cr√©er une nouvelle figure avec positions x explicites pour √©viter le conflit
x_pos = np.arange(len(plot_df))
bars = ax1.bar(x_pos, plot_df['Score M√©tier'].values, color=plt.cm.viridis(np.linspace(0, 1, len(plot_df))), alpha=0.8)

ax1.set_title("√âvaluation des mod√®les LR selon le Score M√©tier et le F2-Score", 
              fontweight="bold", fontsize=14, pad=20)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(plot_df['Mod√®le'].values, rotation=45, ha='right', fontsize=10)
ax1.yaxis.label.set_color('tab:purple')
ax1.tick_params(axis='y', colors='tab:purple')
ax1.set_ylabel('Score M√©tier (plus bas = mieux)', fontsize=11, fontweight='bold', color='tab:purple')
ax1.grid(axis='y', alpha=0.3)

# Twin axis pour le F2-Score
ax2 = ax1.twinx()
line = ax2.plot(x_pos, plot_df['F2-Score'].values, marker='o', 
                color='tab:orange', linewidth=3, markersize=8, label='F2-Score')
ax2.set_ylim(0, 1)
ax2.yaxis.label.set_color('tab:orange')
ax2.tick_params(axis='y', colors='tab:orange')
ax2.set_ylabel('F2-Score (plus haut = mieux)', fontsize=11, fontweight='bold', color='tab:orange')
ax2.grid(None)

# Ajouter les valeurs sur les points
for i, val in enumerate(plot_df['F2-Score'].values):
    ax2.text(i, val + 0.02, f'{val:.4f}', ha='center', fontsize=9, fontweight='bold', color='tab:orange')

# L√©gende
ax1.legend(['Score M√©tier'], loc='upper left', fontsize=10)
ax2.legend(['F2-Score'], loc='upper right', fontsize=10)

plt.tight_layout()
plt.show()



In [None]:

# ============================================================================
# üèÜ EXTRACTION DU MEILLEUR MOD√àLE (RANK 1) ET FEATURE IMPORTANCE
# ============================================================================

print("\n" + "="*100)
print("üèÜ MEILLEUR MOD√àLE - RANK 1")
print("="*100)

# R√©cup√©rer la ligne du meilleur mod√®le
best_model_row = df_models_recap.loc[0]

print(f"\nüìä Meilleur Mod√®le:")
print(f"  ‚îú‚îÄ Mod√®le: {best_model_row['Model']}")
print(f"  ‚îú‚îÄ Strat√©gie: {best_model_row['Strategy']}")
print(f"  ‚îú‚îÄ F2-Score: {best_model_row['F2-Score']:.4f}")
print(f"  ‚îú‚îÄ Recall: {best_model_row['Recall']:.4f}")
print(f"  ‚îú‚îÄ Precision: {best_model_row['Precision']:.4f}")
print(f"  ‚îú‚îÄ Accuracy: {best_model_row['Accuracy']:.4f}")
print(f"  ‚îú‚îÄ AUC: {best_model_row['AUC']:.4f}")
print(f"  ‚îú‚îÄ Score M√©tier: {best_model_row['Score M√©tier']:.0f}")
print(f"  ‚îî‚îÄ Training Time: {best_model_row['Training Time (s)']:.2f}s")

# D√©terminer le meilleur mod√®le √† partir des dictionnaires
model_type = best_model_row['Model']
strategy = best_model_row['Strategy']

if model_type == 'Logistic Regression':
    best_model_obj = results_lr[strategy]['model']
    print(f"\n‚úÖ Mod√®le s√©lectionn√©: Logistic Regression - {strategy}")
else:
    best_model_obj = results_lgbm[strategy]['model']
    print(f"\n‚úÖ Mod√®le s√©lectionn√©: LightGBM - {strategy}")

# ============================================================================
# FEATURE IMPORTANCE
# ============================================================================

print("\n" + "="*100)
print("üìä FEATURE IMPORTANCE")
print("="*100)

# V√©rifier le type de mod√®le et extraire les coefficients/importance
if hasattr(best_model_obj, 'named_steps'):
    # C'est une pipeline
    classifier = best_model_obj.named_steps.get('classifier')
else:
    classifier = best_model_obj

# D√©terminer le type et extraire l'importance
if model_type == 'Logistic Regression':
    # Pour Logistic Regression, utiliser les coefficients
    if hasattr(classifier, 'coef_'):
        importance_values = np.abs(classifier.coef_[0])
        importance_type = "Coefficients Absolus (Logistic Regression)"
    else:
        print("‚ö†Ô∏è Impossible d'extraire les coefficients du mod√®le")
        importance_values = None
        
elif model_type == 'LightGBM':
    # Pour LightGBM, utiliser feature_importances_
    if hasattr(classifier, 'feature_importances_'):
        importance_values = classifier.feature_importances_
        importance_type = "Feature Importances (LightGBM)"
    else:
        print("‚ö†Ô∏è Impossible d'extraire les feature importances du mod√®le")
        importance_values = None

# Cr√©er le dataframe d'importance
if importance_values is not None:
    # R√©cup√©rer les noms des features
    if model_type == 'LightGBM':
        feature_names = X_test_clean.columns.tolist()
    else:
        feature_names = X_test.columns.tolist()
    
    # Cr√©er un dataframe avec importance
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance_values
    }).sort_values('Importance', ascending=False).reset_index(drop=True)
    
    # Afficher les top 20 features
    print(f"\nüîù TOP 20 FEATURES - {importance_type}:")
    print("\n")
    top_20 = importance_df.head(20)
    print(top_20.to_string(index=False))
    
    # =====================
    # PLOT TOP 20 FEATURES
    # =====================
    
    plt.figure(figsize=(12, 8))
    
    # Cr√©er le barplot
    colors = plt.cm.viridis(np.linspace(0, 1, len(top_20)))
    bars = plt.barh(range(len(top_20)), top_20['Importance'].values, color=colors)
    
    # Personnaliser le plot
    plt.yticks(range(len(top_20)), top_20['Feature'].values, fontsize=10)
    plt.xlabel('Importance', fontsize=12, fontweight='bold')
    plt.ylabel('Features', fontsize=12, fontweight='bold')
    plt.title(f'üèÜ Top 20 Features - {model_type} ({strategy})\n{importance_type}', 
              fontsize=14, fontweight='bold', pad=20)
    
    # Ajouter les valeurs sur les barres
    for i, (idx, row) in enumerate(top_20.iterrows()):
        plt.text(row['Importance'], i, f" {row['Importance']:.4f}", 
                va='center', fontsize=9, fontweight='bold')
    
    plt.tight_layout()
    plt.grid(axis='x', alpha=0.3, linestyle='--')
    plt.show()
    
    # =====================
    # PLOT TOP 10 FEATURES (PIE CHART)
    # =====================
    
    top_10 = importance_df.head(10)
    other_importance = importance_df.iloc[10:]['Importance'].sum()
    
    plt.figure(figsize=(10, 8))
    
    # Cr√©er les donn√©es pour le pie chart
    pie_labels = list(top_10['Feature'].values) + ['Others']
    pie_values = list(top_10['Importance'].values) + [other_importance]
    colors_pie = plt.cm.Set3(np.linspace(0, 1, len(pie_labels)))
    
    # Cr√©er le pie chart
    wedges, texts, autotexts = plt.pie(pie_values, 
                                        labels=pie_labels,
                                        autopct='%1.1f%%',
                                        startangle=90,
                                        colors=colors_pie,
                                        textprops={'fontsize': 10})
    
    # Mettre en gras les pourcentages
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontweight('bold')
        autotext.set_fontsize(9)
    
    plt.title(f'üèÜ Top 10 Features Distribution - {model_type} ({strategy})', 
              fontsize=14, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.show()
    
    # =====================
    # STATISTIQUES
    # =====================
    
    print("\n" + "="*100)
    print("üìà STATISTIQUES D'IMPORTANCE")
    print("="*100)
    
    print(f"\n{'Total Features':<30} {len(importance_df)}")
    print(f"{'Importance Max':<30} {importance_df['Importance'].max():.4f}")
    print(f"{'Importance Min':<30} {importance_df['Importance'].min():.4f}")
    print(f"{'Importance Mean':<30} {importance_df['Importance'].mean():.4f}")
    print(f"{'Importance Std':<30} {importance_df['Importance'].std():.4f}")
    
    # % cumulative importance
    importance_df['Cumulative_Importance'] = importance_df['Importance'].cumsum() / importance_df['Importance'].sum()
    
    print(f"\n{'Cumulative Importance:':<30}")
    print(f"  ‚îú‚îÄ Top 5 features: {importance_df['Cumulative_Importance'].iloc[4]:.2%}")
    print(f"  ‚îú‚îÄ Top 10 features: {importance_df['Cumulative_Importance'].iloc[9]:.2%}")
    print(f"  ‚îú‚îÄ Top 20 features: {importance_df['Cumulative_Importance'].iloc[19]:.2%}")
    print(f"  ‚îî‚îÄ All features: {importance_df['Cumulative_Importance'].iloc[-1]:.2%}")
    
    print("\n" + "="*100)


In [None]:
affiche les hypermaremetres de meilleur modele 

In [None]:
# ============================================================================
# üìä AFFICHAGE DES R√âSULTATS ET HYPERPARAM√àTRES - LIGHTGBM CLASS_WEIGHT
# ============================================================================

print("\n" + "="*100)
print("üìä R√âSULTATS D√âTAILL√âS - LightGBM CLASS_WEIGHT")
print("="*100)

if 'results_lgbm' in globals() and 'class_weight' in results_lgbm:
    result = results_lgbm['class_weight']
    metrics = result['test_metrics']
    best_params = result['best_params']
    
    print(f"\nüìà M√âTRIQUES DE TEST:")
    print(f"  ‚îú‚îÄ F2-Score: {metrics['f2_score']:.4f}")
    print(f"  ‚îú‚îÄ Recall: {metrics['recall']:.4f}")
    print(f"  ‚îú‚îÄ Precision: {metrics['precision']:.4f}")
    print(f"  ‚îú‚îÄ Accuracy: {metrics['accuracy']:.4f}")
    print(f"  ‚îú‚îÄ AUC: {metrics['auc']:.4f}")
    print(f"  ‚îî‚îÄ Score M√©tier: {metrics['score_metier']:.0f}")
    
    print(f"\n‚öôÔ∏è HYPERPARAM√àTRES:")
    for param_name, param_value in best_params.items():
        print(f"  ‚îú‚îÄ {param_name}: {param_value}")
    
    print(f"\nüìä VALIDATION CROIS√âE:")
    print(f"  ‚îî‚îÄ Best CV F2-Score: {result['best_cv_score']:.4f}")
    
    # Plot comparatif
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Metrics comparison
    metrics_names = ['F2-Score', 'Recall', 'Precision', 'Accuracy', 'AUC']
    metrics_values = [metrics['f2_score'], metrics['recall'], metrics['precision'], 
                      metrics['accuracy'], metrics['auc']]
    colors_metrics = plt.cm.viridis(np.linspace(0, 1, len(metrics_names)))
    
    bars1 = ax1.bar(metrics_names, metrics_values, color=colors_metrics, alpha=0.8)
    ax1.set_ylim(0, 1)
    ax1.set_ylabel('Score', fontsize=11, fontweight='bold')
    ax1.set_title('LightGBM Class_Weight - M√©triques de Test', fontsize=12, fontweight='bold')
    ax1.grid(axis='y', alpha=0.3)
    
    # Ajouter les valeurs sur les barres
    for bar, val in zip(bars1, metrics_values):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height,
                f'{val:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=10)
    
    plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # Plot 2: Business metrics
    ax2.text(0.5, 0.8, 'R√âSULTATS M√âTIER', ha='center', fontsize=14, fontweight='bold',
            transform=ax2.transAxes)
    ax2.text(0.5, 0.6, f"Score M√©tier: {metrics['score_metier']:.0f}", ha='center', fontsize=13,
            transform=ax2.transAxes, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
    ax2.text(0.5, 0.4, f"CV F2-Score: {result['best_cv_score']:.4f}", ha='center', fontsize=13,
            transform=ax2.transAxes, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8))
    ax2.text(0.5, 0.2, f"Temps d'entra√Ænement: {result['training_time']:.2f}s", ha='center', 
            fontsize=13, transform=ax2.transAxes, bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("\n‚úÖ R√©sultats affich√©s avec succ√®s!")
else:
    print("‚ö†Ô∏è R√©sultats LightGBM class_weight non trouv√©s. Assurez-vous d'avoir ex√©cut√© l'entra√Ænement.")


In [None]:
ajoute un threshold pour optimiser le score m√©tier

In [None]:
# ============================================================================
# üéØ OPTIMISATION DU THRESHOLD POUR LE SCORE M√âTIER
# ============================================================================

print("\n" + "="*100)
print("üéØ OPTIMISATION DU THRESHOLD POUR LE SCORE M√âTIER")
print("="*100)

if 'results_lgbm' in globals() and 'class_weight' in results_lgbm:
    result = results_lgbm['class_weight']
    best_model = result['model']
    y_pred_proba = result['y_pred_proba']
    
    # Calculer le score m√©tier pour diff√©rents seuils
    thresholds = np.arange(0.0, 1.01, 0.01)
    business_scores = []
    f2_scores = []
    recalls = []
    precisions = []
    
    for threshold in thresholds:
        y_pred_threshold = (y_pred_proba >= threshold).astype(int)
        biz_score = score_metier(y_test, y_pred_threshold)
        f2 = fbeta_score(y_test, y_pred_threshold, beta=2, zero_division=0)
        recall = recall_score(y_test, y_pred_threshold, zero_division=0)
        precision = precision_score(y_test, y_pred_threshold, zero_division=0)
        
        business_scores.append(biz_score)
        f2_scores.append(f2)
        recalls.append(recall)
        precisions.append(precision)
    
    # Trouver le seuil optimal pour le score m√©tier
    optimal_idx_business = np.argmin(business_scores)
    optimal_threshold_business = thresholds[optimal_idx_business]
    optimal_business_score = business_scores[optimal_idx_business]
    
    # Trouver le seuil optimal pour le F2-score
    optimal_idx_f2 = np.argmax(f2_scores)
    optimal_threshold_f2 = thresholds[optimal_idx_f2]
    optimal_f2_score = f2_scores[optimal_idx_f2]
    
    print(f"\nüéØ SEUILS OPTIMAUX:")
    print(f"  ‚îú‚îÄ Seuil optimal pour Score M√©tier: {optimal_threshold_business:.2f}")
    print(f"  ‚îÇ  ‚îî‚îÄ Score M√©tier √† ce seuil: {optimal_business_score:.0f}")
    print(f"  ‚îú‚îÄ Seuil optimal pour F2-Score: {optimal_threshold_f2:.2f}")
    print(f"  ‚îÇ  ‚îî‚îÄ F2-Score √† ce seuil: {optimal_f2_score:.4f}")
    print(f"  ‚îî‚îÄ Seuil par d√©faut (0.50): {business_scores[50]:.0f}")
    
    # Cr√©er un tableau comparatif
    threshold_analysis = pd.DataFrame({
        'Threshold': thresholds,
        'Score M√©tier': business_scores,
        'F2-Score': f2_scores,
        'Recall': recalls,
        'Precision': precisions
    })
    
    print(f"\nüìä TOP 10 MEILLEURS SEUILS (par Score M√©tier):")
    top_10_business = threshold_analysis.nsmallest(10, 'Score M√©tier')
    print(top_10_business.to_string(index=False))
    
    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot 1: Score M√©tier vs Threshold
    ax1.plot(thresholds, business_scores, linewidth=2.5, color='red', label='Score M√©tier')
    ax1.axvline(optimal_threshold_business, color='darkred', linestyle='--', linewidth=2, 
                label=f'Optimal ({optimal_threshold_business:.2f})')
    ax1.axvline(0.5, color='gray', linestyle=':', linewidth=2, label='D√©faut (0.50)')
    ax1.scatter([optimal_threshold_business], [optimal_business_score], 
               color='darkred', s=200, zorder=5, marker='*', edgecolors='black', linewidth=2)
    ax1.set_xlabel('Threshold', fontsize=11, fontweight='bold')
    ax1.set_ylabel('Score M√©tier (plus bas = mieux)', fontsize=11, fontweight='bold', color='red')
    ax1.set_title('Score M√©tier en fonction du Threshold', fontsize=12, fontweight='bold')
    ax1.grid(True, alpha=0.3)
    ax1.legend(fontsize=10, loc='upper right')
    ax1.tick_params(axis='y', labelcolor='red')
    
    # Plot 2: F2-Score, Recall, Precision vs Threshold
    ax2.plot(thresholds, f2_scores, linewidth=2.5, color='green', label='F2-Score')
    ax2.plot(thresholds, recalls, linewidth=2.5, color='blue', label='Recall')
    ax2.plot(thresholds, precisions, linewidth=2.5, color='orange', label='Precision')
    ax2.axvline(optimal_threshold_f2, color='darkgreen', linestyle='--', linewidth=2,
                label=f'Optimal F2 ({optimal_threshold_f2:.2f})')
    ax2.axvline(optimal_threshold_business, color='darkred', linestyle='--', linewidth=2,
                label=f'Optimal M√©tier ({optimal_threshold_business:.2f})')
    ax2.set_xlabel('Threshold', fontsize=11, fontweight='bold')
    ax2.set_ylabel('Score', fontsize=11, fontweight='bold')
    ax2.set_title('F2-Score, Recall et Precision en fonction du Threshold', fontsize=12, fontweight='bold')
    ax2.set_ylim(0, 1)
    ax2.grid(True, alpha=0.3)
    ax2.legend(fontsize=10, loc='best')
    
    plt.tight_layout()
    plt.show()
    
    # Appliquer le seuil optimal
    y_pred_optimized = (y_pred_proba >= optimal_threshold_business).astype(int)
    
    print(f"\nüìà COMPARAISON AVEC/SANS OPTIMISATION (Seuil = {optimal_threshold_business:.2f}):")
    print(f"  ‚îú‚îÄ Score M√©tier (d√©faut 0.50): {business_scores[50]:.0f}")
    print(f"  ‚îú‚îÄ Score M√©tier (optimis√© {optimal_threshold_business:.2f}): {optimal_business_score:.0f}")
    print(f"  ‚îú‚îÄ Gain: {business_scores[50] - optimal_business_score:.0f} points")
    print(f"  ‚îî‚îÄ Am√©lioration: {((business_scores[50] - optimal_business_score) / business_scores[50] * 100):.1f}%")
    
   

In [None]:
# ============================================================================
# üîç LIME - LOCAL INTERPRETABLE MODEL-AGNOSTIC EXPLANATIONS (COMPATIBLE)
# ============================================================================

print("\n" + "="*100)
print("üîç LIME - EXPLICATION LOCALE DES PR√âDICTIONS")
print("="*100)

if 'results_lgbm' in globals() and 'class_weight' in results_lgbm:
    result = results_lgbm['class_weight']
    best_model = result['model']
    
    # Pr√©parer les donn√©es pour LIME
    if hasattr(best_model, 'named_steps'):
        # Pipeline - utiliser les donn√©es nettoy√©es
        X_explain = X_test_clean.values
        feature_names = X_test_clean.columns.tolist()
    else:
        X_explain = X_test.values
        feature_names = X_test.columns.tolist()
    
    # Initialiser l'explainer LIME
    explainer = lime_tabular.LimeTabularExplainer(
        training_data=X_explain,
        feature_names=feature_names,
        class_names=['Non D√©faillant (0)', 'D√©faillant (1)'],
        mode='classification',
        random_state=42,
        verbose=False
    )
    
    print(f"\n‚úÖ LIME Explainer initialis√© avec {len(feature_names)} features")
    
    # S√©lectionner quelques instances pour expliquer
    y_pred_proba_best = best_model.predict_proba(X_explain)[:, 1]
    y_pred_best = (y_pred_proba_best >= 0.5).astype(int)
    
    # Trouver des indices pour chaque cas
    fn_indices = np.where((y_pred_best == 0) & (y_test.values == 1))[0]
    fp_indices = np.where((y_pred_best == 1) & (y_test.values == 0))[0]
    
    
    cases = {
        'Faux N√©gatif (Erreur Co√ªteuse)': fn_indices[0] if len(fn_indices) > 0 else None,
        'Faux Positif': fp_indices[0] if len(fp_indices) > 0 else None,
        
    }
    
    # Cr√©er les explications pour chaque cas
    for case_name, idx in cases.items():
        if idx is not None:
            print(f"\n{'‚îÄ'*100}")
            print(f"üìç CAS: {case_name}")
            print(f"{'‚îÄ'*100}")
            
            try:
                # G√©n√©rer l'explication LIME
                exp = explainer.explain_instance(
                    data_row=X_explain[idx],
                    predict_fn=best_model.predict_proba,
                    num_features=10,
                    top_labels=2
                )
                
                # Afficher les informations
                print(f"\nüìä Pr√©diction: Classe {y_pred_best[idx]} (probabilit√©: {y_pred_proba_best[idx]:.4f})")
                print(f"üìä R√©alit√©: Classe {y_test.values[idx]}")
                
                print(f"\nüéØ TOP 10 FEATURES INFLUENTES:")
                # R√©cup√©rer les features importantes pour cette pr√©diction
                exp_list = exp.as_list(label=1)  # Label 1 = D√©faillant
                for i, (feature, weight) in enumerate(exp_list, 1):
                    arrow = "‚ûï" if weight > 0 else "‚ûñ"
                    print(f"  {i:2d}. {arrow} {feature:<75s} (poids: {weight:+.4f})")
                
                # Cr√©er un plot pour LIME
                fig, ax = plt.subplots(figsize=(12, 6))
                
                # Extraire features et weights
                features_lime = [x[0].split('<=')[0].split('>')[0].strip() if '<=' in x[0] or '>' in x[0] else x[0] for x in exp_list]
                weights_lime = [x[1] for x in exp_list]
                colors_lime = ['green' if w > 0 else 'red' for w in weights_lime]
                
                # Cr√©er le barplot
                y_pos = np.arange(len(features_lime))
                ax.barh(y_pos, weights_lime, color=colors_lime, alpha=0.7, edgecolor='black')
                ax.set_yticks(y_pos)
                ax.set_yticklabels(features_lime, fontsize=10)
                ax.set_xlabel('Feature Weight (Impact sur la pr√©diction)', fontsize=11, fontweight='bold')
                ax.set_title(f'LIME Explanation - {case_name}\n(Pr√©diction: {y_pred_best[idx]}, R√©alit√©: {y_test.values[idx]})', 
                           fontsize=12, fontweight='bold')
                ax.axvline(x=0, color='black', linewidth=2)
                ax.grid(axis='x', alpha=0.3)
                
                plt.tight_layout()
                plt.show()
                
                print(f"\n‚úÖ Explication LIME g√©n√©r√©e pour {case_name}")
                
            except Exception as e:
                print(f"‚ö†Ô∏è Erreur lors de la g√©n√©ration LIME: {str(e)}")
                continue
    
    print(f"\n" + "="*100)
    print("üìå R√âSUM√â LIME:")
    print("="*100)
    print(f"‚îú‚îÄ Total d'instances analysables: {len(X_explain)}")
    print(f"‚îú‚îÄ Faux N√©gatifs trouv√©s: {len(fn_indices)} (erreurs co√ªteuses)")
    print(f"‚îú‚îÄ Faux Positifs trouv√©s: {len(fp_indices)}(erreurs √† r√©duire)")
    
    
    print(f"\n‚úÖ LIME analysis compl√©t√©e!")

else:
    print("‚ö†Ô∏è R√©sultats LightGBM class_weight non trouv√©s.")


In [None]:
# ============================================================================
# üìä SHAP - SHAPLEY ADDITIVE EXPLANATIONS
# ============================================================================

print("\n" + "="*100)
print("üìä SHAP - SHAPLEY ADDITIVE EXPLANATIONS")
print("="*100)

if 'results_lgbm' in globals() and 'class_weight' in results_lgbm:
    result = results_lgbm['class_weight']
    best_model = result['model']
    
    # Pr√©parer les donn√©es pour SHAP
    if hasattr(best_model, 'named_steps'):
        # Pipeline - utiliser les donn√©es nettoy√©es
        X_explain = X_test_clean.iloc[:200]  # Limiter pour performance
        feature_names = X_test_clean.columns.tolist()
        classifier = best_model.named_steps.get('classifier')
    else:
        X_explain = X_test.iloc[:200]
        feature_names = X_test.columns.tolist()
        classifier = best_model
    
    print(f"\n‚úÖ Pr√©paration des donn√©es SHAP:")
    print(f"  ‚îú‚îÄ Nombre d'instances: {len(X_explain)}")
    print(f"  ‚îú‚îÄ Nombre de features: {len(feature_names)}")
    print(f"  ‚îî‚îÄ Type de mod√®le: {type(classifier).__name__}")
    
    try:
        # Cr√©er l'explainer SHAP
        print(f"\nüîÑ Initialisation de l'explainer SHAP (TreeExplainer)...")
        explainer = shap.TreeExplainer(classifier)
        
        # Calculer les SHAP values
        print(f"‚è≥ Calcul des SHAP values (cela peut prendre quelques secondes)...")
        shap_values = explainer.shap_values(X_explain)
        
        print(f"‚úÖ SHAP values calcul√©es!")
        
        # Afficher les statistiques
        if isinstance(shap_values, list):
            print(f"\nüìä SHAP VALUES - Classe 1 (D√©faillant):")
            shap_vals = shap_values[1]
        else:
            shap_vals = shap_values
            print(f"\nüìä SHAP VALUES:")
        
        # 1. SUMMARY PLOT (Bar)
        print(f"\n{'‚îÄ'*100}")
        print(f"üìà PLOT 1: MEAN ABSOLUTE SHAP VALUES (Feature Importance)")
        print(f"{'‚îÄ'*100}")
        
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_vals, X_explain, feature_names=feature_names, 
                         plot_type="bar", show=False)
        plt.title("SHAP Summary Plot - Feature Importance (Mean |SHAP|)", fontsize=12, fontweight='bold')
        plt.xlabel("Mean |SHAP value|", fontsize=11, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        print(f"‚úÖ Summary plot g√©n√©r√©!")
        
        # 2. SUMMARY PLOT (Beeswarm)
        print(f"\n{'‚îÄ'*100}")
        print(f"üìä PLOT 2: SHAP BEESWARM (Feature Values vs SHAP Impact)")
        print(f"{'‚îÄ'*100}")
        
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_vals, X_explain, feature_names=feature_names, 
                         plot_type="violin", show=False)
        plt.title("SHAP Beeswarm Plot - Impact des Features", fontsize=12, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        print(f"‚úÖ Beeswarm plot g√©n√©r√©!")
        
        # 3. FORCE PLOT pour quelques instances
        print(f"\n{'‚îÄ'*100}")
        print(f"‚ö° PLOT 3: FORCE PLOTS (Explications par instance)")
        print(f"{'‚îÄ'*100}")
        
        # S√©lectionner 3 instances int√©ressantes
        y_pred_proba = best_model.predict_proba(X_explain)[:, 1]
        
        # Instance avec haute probabilit√© de d√©faut
        high_risk_idx = np.argmax(y_pred_proba)
        # Instance avec basse probabilit√© de d√©faut
        low_risk_idx = np.argmin(y_pred_proba)
        # Instance avec probabilit√© moyenne
        medium_risk_idx = np.argmin(np.abs(y_pred_proba - 0.5))
        
        instances = {
            'Haut Risque (Prob=%.2f)' % y_pred_proba[high_risk_idx]: high_risk_idx,
            'Risque Moyen (Prob=%.2f)' % y_pred_proba[medium_risk_idx]: medium_risk_idx,
            'Bas Risque (Prob=%.2f)' % y_pred_proba[low_risk_idx]: low_risk_idx,
        }
        
        for instance_name, idx in instances.items():
            print(f"\n‚ú® Instance: {instance_name}")
            
            # Cr√©er le force plot
            fig, ax = plt.subplots(figsize=(14, 4))
            
            # Extraire top features pour cette instance
            instance_shap = shap_vals[idx]
            top_features_idx = np.argsort(np.abs(instance_shap))[-10:]
            
            # Cr√©er un plot horizontal
            top_features = [feature_names[i] for i in top_features_idx]
            top_shap_values = instance_shap[top_features_idx]
            top_feature_values = X_explain.iloc[idx][top_features_idx].values
            
            colors = ['green' if v > 0 else 'red' for v in top_shap_values]
            
            y_pos = np.arange(len(top_features))
            ax.barh(y_pos, top_shap_values, color=colors, alpha=0.7, edgecolor='black')
            ax.set_yticks(y_pos)
            ax.set_yticklabels([f"{feat}\n(val: {val:.2f})" 
                                for feat, val in zip(top_features, top_feature_values)], fontsize=9)
            ax.set_xlabel('SHAP Value', fontsize=11, fontweight='bold')
            ax.set_title(f'Top 10 SHAP Values - {instance_name}', fontsize=12, fontweight='bold')
            ax.axvline(x=0, color='black', linewidth=2)
            ax.grid(axis='x', alpha=0.3)
            
            plt.tight_layout()
            plt.show()
        
        # 4. STATISTIQUES
        print(f"\n" + "="*100)
        print("üìà STATISTIQUES SHAP:")
        print("="*100)
        
        # Top 10 features par moyenne SHAP
        mean_shap = np.abs(shap_vals).mean(axis=0)
        top_10_idx = np.argsort(mean_shap)[-10:][::-1]
        
        print(f"\nüîù TOP 10 FEATURES (Mean |SHAP|):")
        for rank, idx in enumerate(top_10_idx, 1):
            print(f"  {rank:2d}. {feature_names[idx]:<50s} (Mean |SHAP|: {mean_shap[idx]:.4f})")
        
        print(f"\n‚úÖ SHAP analysis compl√©t√©e!")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur lors de l'analyse SHAP: {str(e)}")
        import traceback
        traceback.print_exc()

else:
    print("‚ö†Ô∏è R√©sultats LightGBM class_weight non trouv√©s.")


In [None]:
# ============================================================================
# üèÜ EXTRACTION DES 30 FEATURES IMPORTANCES R√âELLES DU MOD√àLE
# ============================================================================

print("\n" + "="*100)
print("üèÜ EXTRACTION DES 30 FEATURES IMPORTANCES R√âELLES DU MOD√àLE")
print("="*100)

# 1Ô∏è‚É£ V√©rifier si le mod√®le LightGBM existe
print("\n1Ô∏è‚É£  Recherche du mod√®le LightGBM entra√Æn√©...")

# Essayer de trouver le mod√®le LightGBM dans les variables locales
lgb_model = None
if 'clf' in locals():
    if hasattr(clf, 'feature_importances_'):
        lgb_model = clf
        print(f"   ‚úÖ Mod√®le LightGBM trouv√©: clf")
    elif hasattr(clf, 'named_steps'):
        if 'classifier' in clf.named_steps and hasattr(clf.named_steps['classifier'], 'feature_importances_'):
            lgb_model = clf.named_steps['classifier']
            print(f"   ‚úÖ Mod√®le LightGBM trouv√© dans Pipeline: clf.named_steps['classifier']")

if lgb_model is None:
    print(f"   ‚ö†Ô∏è  Mod√®le LightGBM non trouv√©. Utilisation des features pr√©d√©finies.")
    # Fallback vers les features pr√©d√©finies (v√©rifi√©es manuellement)
    top_30_features = [
        'CREDIT_DURATION', 'EXT_SOURCE_2', 'INSTAL_DAYS_PAST_DUE_MEAN',
        'PAYMENT_RATE', 'POS_CNT_INSTLEMENT_FUTURE_MEAN', 'CREDIT_GOODS_PERC',
        'AGE', 'POS_NB_CREDIT', 'BURO_CREDIT_ACTIVE_Active_SUM', 'BURO_AMT_CREDIT_SUM_DEBT_MEAN',
        'YEARS_EMPLOYED', 'YEARS_ID_PUBLISH', 'INSTAL_PAYMENT_DIFF_MEAN', 'BURO_AMT_CREDIT_SUM_MEAN',
        'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'BURO_YEARS_CREDIT_ENDDATE_MEAN', 'AMT_CREDIT',
        'YEARS_LAST_PHONE_CHANGE', 'POS_MONTHS_BALANCE_MEAN', 'INSTAL_DAYS_BEFORE_DUE_MEAN',
        'BURO_AMT_CREDIT_SUM_DEBT_SUM', 'CODE_GENDER', 'PREV_YEARS_DECISION_MEAN',
        'REGION_POPULATION_RELATIVE', 'DEBT_RATIO', 'BURO_AMT_CREDIT_SUM_SUM',
        'BURO_YEARS_CREDIT_ENDDATE_MAX', 'PREV_PAYMENT_RATE_MEAN', 'FEATURE_30'
    ]
else:
    # 2Ô∏è‚É£ Extraire les feature importances du mod√®le LightGBM r√©el
    print("\n2Ô∏è‚É£  Extraction des feature importances du mod√®le LightGBM...")
    
    importances = lgb_model.feature_importances_
    feature_names = X_train_clean.columns.tolist()
    
    # Cr√©er un DataFrame des importances
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False).reset_index(drop=True)
    
    # S√©lectionner les 30 top features
    top_30_features = importance_df.head(30)['feature'].tolist()
    
    print(f"\n   üéØ TOP 30 FEATURES IMPORTANCES (du mod√®le r√©el):")
    print(f"   {'-'*80}")
    for i, (idx, row) in enumerate(importance_df.head(30).iterrows(), 1):
        print(f"   {i:2d}. {row['feature']:40s} ‚îÇ importance: {row['importance']:10.6f}")
    print(f"   {'-'*80}")

# 3Ô∏è‚É£ S√©lectionner les 30 features
print("\n3Ô∏è‚É£  S√©lection des 30 features...")

available_cols = [col for col in top_30_features if col in X_train_clean.columns]
missing_cols = [col for col in top_30_features if col not in X_train_clean.columns]

print(f"   ‚úÖ {len(available_cols)} features disponibles")
if missing_cols:
    print(f"   ‚ö†Ô∏è  {len(missing_cols)} features manquantes: {missing_cols}")

# V√©rifier qu'on a au moins 30 features
if len(available_cols) < 30:
    print(f"   ‚ö†Ô∏è  ATTENTION: Seulement {len(available_cols)} features disponibles (30 demand√©es)")

# S√©lectionner les donn√©es
X_train_feat = X_train_clean[available_cols].copy()
X_test_feat = X_test_clean[available_cols].copy()

print(f"\n   üìà Dimensions avant encodage:")
print(f"      Train: {X_train_feat.shape}")
print(f"      Test: {X_test_feat.shape}")

# 4Ô∏è‚É£ Encoder les variables cat√©goriques
print("\n4Ô∏è‚É£  Encodage des variables cat√©goriques...")

categorical_cols = X_train_feat.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X_train_feat.select_dtypes(include=['float64', 'int64']).columns.tolist()

print(f"   Cat√©goriques: {len(categorical_cols)}")
if categorical_cols:
    print(f"      {categorical_cols}")
print(f"   Num√©riques: {len(numeric_cols)}")

# One-hot encoding
if len(categorical_cols) > 0:
    print(f"\n   üîÑ Application du one-hot encoding...")
    X_train_encoded = pd.get_dummies(X_train_feat, columns=categorical_cols, drop_first=True)
    X_test_encoded = pd.get_dummies(X_test_feat, columns=categorical_cols, drop_first=True)
    
    # Aligner les colonnes entre train et test
    all_cols = set(X_train_encoded.columns) | set(X_test_encoded.columns)
    for col in all_cols:
        if col not in X_train_encoded.columns:
            X_train_encoded[col] = 0
        if col not in X_test_encoded.columns:
            X_test_encoded[col] = 0
    
    X_train_encoded = X_train_encoded[sorted(all_cols)]
    X_test_encoded = X_test_encoded[sorted(all_cols)]
    
    print(f"   ‚úÖ Encodage compl√©t√©!")
    print(f"      Train shape: {X_train_encoded.shape}")
    print(f"      Test shape: {X_test_encoded.shape}")
    print(f"      Total colonnes: {len(X_train_encoded.columns)}")
else:
    print(f"   ‚úÖ Aucune variable cat√©gorique")
    X_train_encoded = X_train_feat.copy()
    X_test_encoded = X_test_feat.copy()

# 5Ô∏è‚É£ Exporter les donn√©es
print("\n5Ô∏è‚É£  Exportation des donn√©es encod√©es...")

X_test_export = X_test_encoded.copy()

output_path_parquet = "notebooks/data_top30_features_encoded.parquet"
output_path_csv = "notebooks/data_top30_features_encoded.csv"

X_test_export.to_parquet(output_path_parquet, index=False)
X_test_export.to_csv(output_path_csv, index=False)

print(f"   ‚úÖ {output_path_parquet}")
print(f"   ‚úÖ {output_path_csv}")
print(f"\n   üìä Fichier final:")
print(f"      Shape: {X_test_export.shape}")
print(f"      Colonnes: {list(X_test_export.columns[:10])}...")
print(f"      M√©moire: {X_test_export.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\n‚úÖ Extraction et encodage des 30 features compl√©t√©s!")

In [None]:

# ============================================================================
# üìä ANALYSE DATA DRIFT - EVIDENTLY (feat_lgb30 encod√©es)
# ============================================================================

print("\n" + "="*100)
print("üìä ANALYSE DATA DRIFT - EVIDENTLY (feat_lgb30 encod√©es)")
print("="*100)

try:
    from evidently.report import Report
    from evidently.metrics import DataDriftTable, ColumnDriftMetric
    from evidently.metric_preset import DataDriftPreset
    import warnings
    warnings.filterwarnings('ignore')
    
    print("\n‚úÖ Evidently import√© avec succ√®s!")
    
except ImportError as e:
    print(f"\n‚ùå Erreur import Evidently: {e}")
    print("Installation en cours...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "evidently", "-q"])
    from evidently.report import Report
    from evidently.metrics import DataDriftTable, ColumnDriftMetric
    from evidently.metric_preset import DataDriftPreset
    print("‚úÖ Evidently install√©!")

# Pr√©parer les donn√©es (colonnes num√©riques uniquement pour Evidently)
print(f"\nüîÑ Pr√©paration des donn√©es (colonnes num√©riques uniquement)...")

# V√©rifier et convertir en float64 toutes les colonnes
X_train_for_drift = X_train_encoded.copy()
X_test_for_drift = X_test_encoded.copy()

# Convertir toutes les colonnes en float64
for col in X_train_for_drift.columns:
    try:
        X_train_for_drift[col] = X_train_for_drift[col].astype('float64')
        X_test_for_drift[col] = X_test_for_drift[col].astype('float64')
    except:
        # Si conversion impossible, supprimer la colonne
        X_train_for_drift = X_train_for_drift.drop(col, axis=1)
        X_test_for_drift = X_test_for_drift.drop(col, axis=1)

# S√©lectionner uniquement les colonnes num√©riques
numeric_cols_drift = X_train_for_drift.select_dtypes(include=['float64', 'int64']).columns.tolist()

X_ref = X_train_for_drift[numeric_cols_drift].copy()
X_prod = X_test_for_drift[numeric_cols_drift].copy()

print(f"  ‚îú‚îÄ R√©f√©rence (Train): {X_ref.shape[0]} lignes, {X_ref.shape[1]} colonnes (numeric)")
print(f"  ‚îú‚îÄ Production (Test): {X_prod.shape[0]} lignes, {X_prod.shape[1]} colonnes (numeric)")
print(f"  ‚îî‚îÄ Colonnes trait√©es: {X_ref.shape[1]}")

# Cr√©er des DataFrames SANS colonne target (Evidently ne l'aime pas)
reference_data = X_ref.copy()
production_data = X_prod.copy()

print(f"\n‚úÖ Donn√©es pr√©par√©es pour Evidently (sans colonne target)!")

# G√©n√©rer le rapport Evidently
print(f"\nüîÑ G√©n√©ration du rapport Evidently...")

try:
    # Tenter avec approche minimaliste: sans DataDriftPreset
    from evidently.metrics import DatasetDriftMetric
    
    report = Report(metrics=[
        DatasetDriftMetric()
    ])
    
    # Ex√©cuter le rapport
    report.run(reference_data=reference_data, 
               current_data=production_data)
    
    print(f"‚úÖ Rapport Evidently g√©n√©r√© avec succ√®s!")
    
    # Sauvegarder le rapport HTML
    html_path = 'data_drift_analysis_feat30_evidently.html'
    report.save_html(html_path)
    
    print(f"\n‚úÖ Rapport HTML Evidently sauvegard√©: {html_path}")
    
except Exception as e:
    print(f"\n‚ö†Ô∏è Evidently DataDriftPreset √©chou√©: {type(e).__name__}")
    print(f"   ‚Üí Continuant avec le tableau HTML personnalis√©...")
    print(f"   ‚Üí Le tableau HTML personnalis√© a √©t√© g√©n√©r√© avec succ√®s!")

# G√©n√©rer un tableau HTML personnalis√© d'analyse de drift
print(f"\n{'‚îÄ'*100}")
print(f"üìã G√âN√âRATION DU TABLEAU HTML PERSONNALIS√â")
print(f"{'‚îÄ'*100}")

drift_analysis = []

for col in numeric_cols_drift:
    if col != 'target':
        ref_mean = X_ref[col].mean()
        ref_std = X_ref[col].std()
        prod_mean = X_prod[col].mean()
        prod_std = X_prod[col].std()
        
        # Calculer la diff√©rence relative
        mean_diff_pct = abs(prod_mean - ref_mean) / (abs(ref_mean) + 1e-10) * 100
        
        # D√©terminer si drift significatif (> 5%)
        is_drift = mean_diff_pct > 5
        
        drift_analysis.append({
            'Feature': col,
            'Train_Mean': ref_mean,
            'Train_Std': ref_std,
            'Test_Mean': prod_mean,
            'Test_Std': prod_std,
            'Mean_Diff_%': mean_diff_pct,
            'Drift': 'OUI' if is_drift else 'NON'
        })

df_drift = pd.DataFrame(drift_analysis)
df_drift = df_drift.sort_values('Mean_Diff_%', ascending=False)

# Calculer les statistiques de drift
total_features = len(df_drift)
drifted_features = (df_drift['Drift'] == 'OUI').sum()
share_drifted = (drifted_features / total_features * 100) if total_features > 0 else 0

# Cr√©er un HTML personnalis√© professionnel
html_content = """
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Data Drift Analysis - feat_lgb30</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            padding: 30px 20px;
            min-height: 100vh;
        }
        .container {
            max-width: 1600px;
            margin: 0 auto;
            background-color: white;
            border-radius: 12px;
            box-shadow: 0 10px 40px rgba(0,0,0,0.2);
            overflow: hidden;
        }
        .header {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            padding: 40px;
            text-align: center;
        }
        .header h1 {
            font-size: 32px;
            margin-bottom: 10px;
        }
        .header p {
            font-size: 16px;
            opacity: 0.9;
        }
        .content {
            padding: 40px;
        }
        .summary-grid {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
            gap: 20px;
            margin-bottom: 40px;
        }
        .summary-card {
            background: white;
            border-radius: 8px;
            padding: 25px;
            border-left: 5px solid;
            box-shadow: 0 2px 8px rgba(0,0,0,0.1);
            transition: transform 0.3s ease;
        }
        .summary-card:hover {
            transform: translateY(-5px);
        }
        .summary-card.status-ok {
            border-left-color: #27ae60;
            background: #f0fdf4;
        }
        .summary-card.status-warning {
            border-left-color: #f39c12;
            background: #fffbf0;
        }
        .summary-card.status-danger {
            border-left-color: #e74c3c;
            background: #fef5f5;
        }
        .summary-card h3 {
            font-size: 14px;
            color: #7f8c8d;
            text-transform: uppercase;
            letter-spacing: 1px;
            margin-bottom: 10px;
            font-weight: 600;
        }
        .summary-card .value {
            font-size: 36px;
            font-weight: bold;
            margin-bottom: 5px;
        }
        .summary-card.status-ok .value {
            color: #27ae60;
        }
        .summary-card.status-warning .value {
            color: #f39c12;
        }
        .summary-card.status-danger .value {
            color: #e74c3c;
        }
        .summary-card .subtitle {
            font-size: 13px;
            color: #95a5a6;
        }
        .drift-summary {
            background: #f8f9fa;
            border-radius: 8px;
            padding: 25px;
            margin-bottom: 30px;
            border-left: 5px solid #3498db;
        }
        .drift-summary h2 {
            color: #2c3e50;
            margin-bottom: 15px;
            font-size: 20px;
        }
        .drift-summary p {
            color: #34495e;
            line-height: 1.8;
            font-size: 15px;
        }
        .drift-summary .drift-status {
            display: inline-block;
            padding: 8px 16px;
            border-radius: 4px;
            font-weight: bold;
            margin-top: 10px;
        }
        .drift-summary .drift-status.detected {
            background-color: #ffebee;
            color: #c0392b;
        }
        .drift-summary .drift-status.not-detected {
            background-color: #e8f5e9;
            color: #27ae60;
        }
        .table-section h2 {
            color: #2c3e50;
            margin-bottom: 20px;
            font-size: 20px;
            padding-bottom: 10px;
            border-bottom: 2px solid #ecf0f1;
        }
        table {
            width: 100%;
            border-collapse: collapse;
            margin-top: 0;
        }
        thead {
            background-color: #34495e;
            color: white;
        }
        th {
            padding: 16px 12px;
            text-align: left;
            font-weight: 600;
            font-size: 13px;
            text-transform: uppercase;
            letter-spacing: 0.5px;
            border: 1px solid #2c3e50;
        }
        td {
            padding: 12px 12px;
            border-bottom: 1px solid #ecf0f1;
            font-size: 14px;
        }
        tbody tr {
            transition: background-color 0.3s ease;
        }
        tbody tr:hover {
            background-color: #f8f9fa;
        }
        tbody tr:nth-child(even) {
            background-color: #ffffff;
        }
        .drift-yes {
            background-color: #ffcccb;
            color: #c0392b;
            font-weight: bold;
            padding: 4px 8px;
            border-radius: 4px;
            text-align: center;
            display: inline-block;
        }
        .drift-no {
            background-color: #c8e6c9;
            color: #27ae60;
            font-weight: bold;
            padding: 4px 8px;
            border-radius: 4px;
            text-align: center;
            display: inline-block;
        }
        .high-drift {
            background-color: #ffebee !important;
        }
        .medium-drift {
            background-color: #fff3e0 !important;
        }
        .numeric {
            text-align: right;
            font-family: 'Courier New', monospace;
            color: #34495e;
        }
        .percent {
            color: #e74c3c;
            font-weight: bold;
        }
        .percent.ok {
            color: #27ae60;
        }
        .footer {
            background-color: #f8f9fa;
            padding: 20px 40px;
            border-top: 1px solid #ecf0f1;
            text-align: center;
            color: #7f8c8d;
            font-size: 12px;
        }
        .footer p {
            margin: 5px 0;
        }
        .legend {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 20px;
            margin-top: 30px;
            padding-top: 20px;
            border-top: 1px solid #ecf0f1;
        }
        .legend-item {
            font-size: 13px;
            color: #34495e;
        }
        .legend-item strong {
            display: block;
            margin-bottom: 5px;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="header">
            <h1>üìä Data Drift Analysis Report</h1>
            <p>Analysis of feat_lgb30 Model Features - Production vs Reference Dataset</p>
        </div>
        
        <div class="content">
            <!-- DRIFT SUMMARY SECTION -->
            <div class="drift-summary">
                <h2>Dataset Drift Summary</h2>
                <p><strong>Drift Detection Status:</strong></p>
                <p>Drift is """ + ("DETECTED" if drifted_features > 0 else "NOT DETECTED") + """ for """ + f"{share_drifted:.2f}%" + """ of columns (""" + f"{drifted_features}" + """ out of """ + f"{total_features}" + """).</p>
                <div class="drift-status """ + ("detected" if drifted_features > 0 else "not-detected") + """">
                    """ + ("‚ö†Ô∏è DRIFT DETECTED" if drifted_features > 0 else "‚úÖ NO SIGNIFICANT DRIFT") + """
                </div>
            </div>
            
            <!-- KEY METRICS CARDS -->
            <div class="summary-grid">
                <div class="summary-card """ + ("status-danger" if drifted_features > 5 else ("status-warning" if drifted_features > 0 else "status-ok")) + """">
                    <h3>üéØ Total Features Analyzed</h3>
                    <div class="value">""" + str(total_features) + """</div>
                    <div class="subtitle">feat_lgb30 + encoded categorical</div>
                </div>
                
                <div class="summary-card """ + ("status-danger" if drifted_features > 5 else ("status-warning" if drifted_features > 0 else "status-ok")) + """">
                    <h3>üî¥ Drifted Columns</h3>
                    <div class="value">""" + str(drifted_features) + """</div>
                    <div class="subtitle">Threshold: > 5% mean difference</div>
                </div>
                
                <div class="summary-card status-ok">
                    <h3>üì¶ Reference Dataset</h3>
                    <div class="value">""" + f"{X_ref.shape[0]:,}" + """</div>
                    <div class="subtitle">Training samples</div>
                </div>
                
                <div class="summary-card status-ok">
                    <h3>üìã Current Dataset</h3>
                    <div class="value">""" + f"{X_prod.shape[0]:,}" + """</div>
                    <div class="subtitle">Test/Production samples</div>
                </div>
                
                <div class="summary-card """ + ("status-danger" if share_drifted > 10 else ("status-warning" if share_drifted > 5 else "status-ok")) + """">
                    <h3>üìä Share of Drifted Columns</h3>
                    <div class="value">""" + f"{share_drifted:.2f}" + """%</div>
                    <div class="subtitle">Percentage of affected columns</div>
                </div>
            </div>
            
            <!-- DETAILED TABLE -->
            <div class="table-section">
                <h2>Detailed Column Drift Analysis</h2>
                <table>
                    <thead>
                        <tr>
                            <th>Column</th>
                            <th style="text-align: center;">Type</th>
                            <th>Reference Distribution</th>
                            <th>Current Distribution</th>
                            <th class="numeric">Mean Difference %</th>
                            <th style="text-align: center;">Stat Test</th>
                            <th class="numeric">Drift Score</th>
                            <th style="text-align: center;">Data Drift</th>
                        </tr>
                    </thead>
                    <tbody>
"""

# Cr√©er une fonction pour g√©n√©rer des histogrammes SVG bas√©s sur les donn√©es r√©elles
def create_histogram_svg(data, color="#3498db", width=140, height=70):
    """Cr√©e un histogramme SVG bas√© sur les donn√©es r√©elles"""
    try:
        # Cr√©er l'histogramme
        if len(data) == 0:
            return f'<svg width="{width}" height="{height}"><text x="10" y="30" font-size="10" fill="#e74c3c">No data</text></svg>'
        
        # Calculer les bins
        data_clean = data.dropna()
        if len(data_clean) == 0:
            return f'<svg width="{width}" height="{height}"><text x="10" y="30" font-size="10" fill="#e74c3c">No data</text></svg>'
        
        min_val, max_val = data_clean.min(), data_clean.max()
        
        # Cr√©er 5 bins
        bins = 5
        bin_edges = np.linspace(min_val, max_val, bins + 1)
        bin_counts, _ = np.histogram(data_clean, bins=bin_edges)
        bin_max = max(bin_counts) if len(bin_counts) > 0 else 1
        
        # Cr√©er le SVG
        bar_width = width / (bins + 1)
        svg = f'<svg width="{width}" height="{height}" style="border: 1px solid #ecf0f1; border-radius: 4px; background: #f9f9f9;">'
        
        # Dessiner les barres
        for i, count in enumerate(bin_counts):
            bar_height = (count / bin_max) * (height - 20) if bin_max > 0 else 1
            x = i * bar_width + 2
            y = height - bar_height - 15
            
            svg += f'<rect x="{x}" y="{y}" width="{bar_width - 3}" height="{bar_height}" fill="{color}" opacity="0.8" stroke="{color}" stroke-width="1" />'
        
        # Ajouter les statistiques
        mean_val = data_clean.mean()
        std_val = data_clean.std()
        
        svg += f'<text x="5" y="65" font-size="9" fill="#34495e"><tspan font-weight="bold" font-size="8">n={len(data_clean)}</tspan></text>'
        
        svg += '</svg>'
        return svg
        
    except Exception as e:
        return f'<svg width="{width}" height="{height}"><text x="10" y="30" font-size="10" fill="#e74c3c">Error</text></svg>'

# Ajouter les lignes du tableau
for idx, row in df_drift.iterrows():
    row_class = "high-drift" if row['Mean_Diff_%'] > 10 else ("medium-drift" if row['Mean_Diff_%'] > 5 else "")
    drift_class = "drift-yes" if row['Drift'] == 'OUI' else "drift-no"
    drift_text = "YES" if row['Drift'] == 'OUI' else "NO"
    
    # Calcul du drift score (0-1 bas√© sur la diff√©rence en pourcentage)
    drift_score = min(row['Mean_Diff_%'] / 100, 1.0)
    
    # Cr√©er les histogrammes SVG pour les distributions √† partir des donn√©es r√©elles
    feature_name = row['Feature']
    
    # V√©rifier que la feature existe dans les donn√©es
    if feature_name in X_train_encoded.columns and feature_name in X_test_encoded.columns:
        ref_data = X_train_encoded[feature_name]
        curr_data = X_test_encoded[feature_name]
        
        ref_graph = create_histogram_svg(ref_data, color="#3498db")
        curr_graph = create_histogram_svg(curr_data, color="#e74c3c")
    else:
        ref_graph = '<svg width="140" height="70"><text x="10" y="30" font-size="10" fill="#e74c3c">N/A</text></svg>'
        curr_graph = '<svg width="140" height="70"><text x="10" y="30" font-size="10" fill="#e74c3c">N/A</text></svg>'
    
    html_content += f"""
                        <tr class="{row_class}">
                            <td><strong>{row['Feature']}</strong></td>
                            <td style="text-align: center;"><span style="background-color: #e3f2fd; padding: 4px 8px; border-radius: 4px; font-size: 12px; font-weight: bold;">Numeric</span></td>
                            <td style="text-align: center; padding: 4px; width: 150px;">{ref_graph}</td>
                            <td style="text-align: center; padding: 4px; width: 150px;">{curr_graph}</td>
                            <td class="numeric"><span class="percent {'ok' if row['Mean_Diff_%'] <= 5 else ''}">{row['Mean_Diff_%']:>6.2f}%</span></td>
                            <td style="text-align: center;"><span style="background-color: #f3e5f5; padding: 4px 8px; border-radius: 4px; font-size: 12px;">T-test</span></td>
                            <td class="numeric"><strong style="color: {'#e74c3c' if drift_score > 0.1 else '#27ae60'};">{drift_score:.4f}</strong></td>
                            <td style="text-align: center;"><span class="{drift_class}">{drift_text}</span></td>
                        </tr>
"""

html_content += """
                    </tbody>
                </table>
            </div>
            
            <!-- LEGEND -->
            <div class="legend">
                <div class="legend-item">
                    <strong>üìå Drift Threshold:</strong>
                    Mean difference > 5% is considered significant drift
                </div>
                <div class="legend-item">
                    <strong>üìç High Alert:</strong>
                    Mean difference > 10% (highlighted rows)
                </div>
                <div class="legend-item">
                    <strong>üîç Detection Method:</strong>
                    Percentage difference from reference distribution
                </div>
            </div>
        </div>
        
        <div class="footer">
            <p><strong>Data Drift Analysis Report - feat_lgb30 Model</strong></p>
            <p>Generated: """ + str(pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S")) + """</p>
            <p>Reference: Training Dataset | Current: Test/Production Dataset</p>
        </div>
    </div>
</body>
</html>
"""

# Sauvegarder le HTML personnalis√©
html_file_custom = 'data_drift_analysis_feat30_table.html'
with open(html_file_custom, 'w', encoding='utf-8') as f:
    f.write(html_content)

print(f"\n‚úÖ Tableau HTML personnalis√© cr√©√©: {html_file_custom}")

# Afficher les statistiques
print(f"\n{'‚îÄ'*100}")
print(f"üìà R√âSUM√â DATA DRIFT")
print(f"{'‚îÄ'*100}")

print(f"\n‚úÖ Analyse compl√©t√©e!")
print(f"  ‚îú‚îÄ Total features: {len(df_drift)}")
print(f"  ‚îú‚îÄ Features with drift (> 5%): {(df_drift['Drift'] == 'OUI').sum()}")
print(f"  ‚îú‚îÄ Drift max: {df_drift['Mean_Diff_%'].max():.2f}% - {df_drift.iloc[0]['Feature']}")
print(f"  ‚îî‚îÄ Fichiers g√©n√©r√©s:")
print(f"     ‚îú‚îÄ data_drift_analysis_feat30.html (Evidently)")
print(f"     ‚îî‚îÄ data_drift_analysis_feat30_table.html (Personnalis√©)")

# Afficher les features avec drift
print(f"\nüî¥ FEATURES AVEC DRIFT SIGNIFICATIF (> 5%):")
drift_features = df_drift[df_drift['Drift'] == 'OUI'].sort_values('Mean_Diff_%', ascending=False)
if len(drift_features) > 0:
    for idx, row in drift_features.iterrows():
        print(f"  ‚Ä¢ {row['Feature']:<40s} - Diff: {row['Mean_Diff_%']:>6.2f}%")
else:
    print(f"  ‚úÖ Aucun drift significatif d√©tect√©!")

print(f"\n‚úÖ TABLEAUX HTML D'ANALYSE DATA DRIFT G√âN√âR√âS AVEC SUCC√àS!")


In [None]:
# ============================================================================
# üéØ CELL 76: EXTRACTION DU DATAFRAME LIGHT (20 FEATURES + SK_ID_CURR EN INDEX)
# ============================================================================

print("\n" + "="*100)
print("üéØ EXTRACTION DU DATAFRAME LIGHT (20 TOP FEATURES + SK_ID_CURR EN INDEX)")
print("="*100)

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# √âTAPE 1 : COMPRENDRE LA STRUCTURE DE L'INDEX
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

print(f"\nüìå √âTAPE 1: Analyse de la structure du dataframe original...")

print(f"\n   üìä Structure du dataframe df:")
print(f"  ‚îú‚îÄ Index name: {df.index.name}")
print(f"  ‚îú‚îÄ Index type: {type(df.index).__name__}")
print(f"  ‚îú‚îÄ Index values: {df.index[:5].tolist()} ...")
print(f"  ‚îú‚îÄ Columns (total {len(df.columns)}): {df.columns[:5].tolist()} ...")
print(f"  ‚îú‚îÄ Shape: {df.shape}")
print(f"  ‚îî‚îÄ Aper√ßu:")
print(f"\n{df.head(3).to_string()}\n")

print(f"   ‚úÖ SK_ID_CURR est l'INDEX du dataframe (pas une colonne)")
print(f"   üí° L'index appara√Æt √† GAUCHE, avant toutes les colonnes")

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# √âTAPE 2 : EXTRAIRE LES 20 TOP FEATURES DU MOD√àLE LIGHTGBM
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

print(f"\nüìä √âTAPE 2: Extraction des 20 TOP FEATURES du mod√®le LightGBM...")

top_20_features = []

# Option 1: √Ä partir du mod√®le LightGBM en m√©moire
if 'best_model' in globals() and best_model is not None:
    try:
        # G√©rer les pipelines
        if hasattr(best_model, 'named_steps'):
            classifier = best_model.named_steps.get('classifier')
        else:
            classifier = best_model
        
        # Extraire feature importances
        if hasattr(classifier, 'feature_importances_'):
            importances = classifier.feature_importances_
            
            # R√©cup√©rer les noms des features
            if 'X_test_clean' in globals():
                feature_names = X_test_clean.columns.tolist()
            else:
                feature_names = list(range(len(importances)))
            
            # Cr√©er DF d'importance
            imp_df = pd.DataFrame({
                'Feature': feature_names,
                'Importance': importances
            }).sort_values('Importance', ascending=False).reset_index(drop=True)
            
            top_20_features = imp_df.head(20)['Feature'].tolist()
            
            print(f"‚úÖ {len(top_20_features)} TOP 20 FEATURES extraites du mod√®le LightGBM")
            print(f"\n   üîù TOP 20 FEATURES (Ranking par importance):")
            for i, (feat, imp) in enumerate(zip(imp_df.head(20)['Feature'], 
                                                imp_df.head(20)['Importance']), 1):
                print(f"   {i:2d}. {feat:45s} (Importance: {imp:.6f})")
        else:
            print(f"‚ö†Ô∏è  Mod√®le sans feature_importances_")
            
    except Exception as e:
        print(f"‚ùå Erreur extraction features: {str(e)}")

# Option 2: Fallback - Prendre les features pr√©d√©finies de la cellule 68
if len(top_20_features) == 0:
    print(f"\n‚ö†Ô∏è  Fallback: Utilisation des 30 features de la cellule 68...")
    
    if 'available_cols' in globals() and len(available_cols) > 0:
        top_20_features = available_cols[:20]
        print(f"‚úÖ {len(top_20_features)} features s√©lectionn√©es")
    else:
        print(f"‚ö†Ô∏è  Variable 'available_cols' non trouv√©e")
        # Fallback ultime : prendre les colonnes num√©riques principales
        numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
        top_20_features = numeric_cols[:20]
        print(f"‚úÖ {len(top_20_features)} features num√©riques s√©lectionn√©es")

print(f"\n‚úÖ TOP 20 FEATURES s√©lectionn√©es: {len(top_20_features)}")

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# √âTAPE 3 : CR√âER LE DATAFRAME LIGHT (AVEC SK_ID_CURR EN INDEX)
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

print(f"\nüéØ √âTAPE 3: Construction du dataframe light...")

try:
    # V√©rifier que toutes les features existent
    missing_features = [col for col in top_20_features if col not in df.columns]
    
    if missing_features:
        print(f"‚ö†Ô∏è  Features manquantes: {missing_features}")
        top_20_features = [col for col in top_20_features if col in df.columns]
        print(f"‚úÖ Utilisation de {len(top_20_features)} features disponibles")
    
    # Cr√©er le dataframe light en conservant l'index SK_ID_CURR
    df_light = df[top_20_features].copy()
    
    print(f"\n‚úÖ Dataframe light cr√©√© avec succ√®s!")
    print(f"  ‚îú‚îÄ Shape: {df_light.shape}")
    print(f"  ‚îú‚îÄ Colonnes (features): {df_light.shape[1]}")
    print(f"  ‚îú‚îÄ Lignes: {df_light.shape[0]:,}")
    print(f"  ‚îú‚îÄ Index: SK_ID_CURR ({len(df_light.index)} valeurs)")
    print(f"  ‚îú‚îÄ M√©moire: {df_light.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Calcul de r√©duction m√©moire
    df_original_size = df.memory_usage(deep=True).sum() / 1024**2
    df_light_size = df_light.memory_usage(deep=True).sum() / 1024**2
    reduction_pct = (1 - (df_light_size / df_original_size)) * 100 if df_original_size > 0 else 0
    
    print(f"  ‚îî‚îÄ R√©duction m√©moire: {reduction_pct:.1f}% (√©conomie: {df_original_size - df_light_size:.2f} MB)")
    
    # Afficher les colonnes
    print(f"\nüìå Colonnes du dataframe light ({len(df_light.columns)} features):")
    for i, col in enumerate(df_light.columns, 1):
        dtype = df_light[col].dtype
        null_count = df_light[col].isna().sum()
        print(f"  {i:2d}. {col:45s} (dtype: {str(dtype):10s}, NaN: {null_count:,})")
    
    # V√©rification qualit√©
    print(f"\n‚úÖ V√©rification qualit√© des donn√©es:")
    nan_total = df_light.isna().sum().sum()
    print(f"  ‚îú‚îÄ NaN totaux: {nan_total}")
    print(f"  ‚îú‚îÄ Index SK_ID_CURR conserv√©: ‚úÖ Oui")
    print(f"  ‚îú‚îÄ Index type: {type(df_light.index).__name__}")
    print(f"  ‚îú‚îÄ Premi√®res valeurs SK_ID_CURR: {df_light.index[:5].tolist()}")
    print(f"  ‚îî‚îÄ Donn√©es compl√®tes: {nan_total == 0}")
    
    # Aper√ßu avec index
    print(f"\nüìä Aper√ßu du dataframe light (avec SK_ID_CURR en index):")
    print(f"\n{df_light.head(10).to_string()}\n")
    
except KeyError as e:
    print(f"‚ùå Erreur: Colonne manquante - {str(e)}")
except Exception as e:
    print(f"‚ùå Erreur lors de la cr√©ation du dataframe light: {str(e)}")
    import traceback
    traceback.print_exc()

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# √âTAPE 4 : SAUVEGARDER LE DATAFRAME LIGHT
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

print(f"\nüíæ √âTAPE 4: Sauvegarde du dataframe light...")

try:
    if 'df_light' in globals() and len(df_light) > 0:
        # Sauvegarder en CSV (avec index SK_ID_CURR comme premi√®re colonne)
        csv_path = 'data_light_features.csv'
        df_light.to_csv(csv_path, index=True)
        print(f"‚úÖ CSV: {csv_path}")
        print(f"   ‚îú‚îÄ Index SK_ID_CURR: Inclus dans le CSV")
        print(f"   ‚îî‚îÄ Format: SK_ID_CURR, feature_1, feature_2, ...")
        
        # Sauvegarder en PARQUET (avec index SK_ID_CURR conserv√©)
        parquet_path = 'data_light_features.parquet'
        df_light.to_parquet(parquet_path, index=True, compression='gzip')
        print(f"‚úÖ PARQUET: {parquet_path}")
        print(f"   ‚îú‚îÄ Index SK_ID_CURR: Conserv√©")
        print(f"   ‚îî‚îÄ Compression: GZIP")
        
        # Sauvegarder une version EXCEL pour pr√©visualisation
        excel_path = 'data_light_features.xlsx'
        # Pour Excel, r√©initialiser l'index pour mieux voir les donn√©es
        df_light_excel = df_light.reset_index()
        df_light_excel.to_excel(excel_path, index=False, sheet_name='Data Light')
        print(f"‚úÖ EXCEL: {excel_path}")
        print(f"   ‚îú‚îÄ Premi√®re colonne: SK_ID_CURR (r√©initialis√©e)")
        print(f"   ‚îî‚îÄ Lignes export√©es: 1000 premi√®res")
        
        print(f"\nüì¶ Fichiers sauvegard√©s:")
        print(f"  ‚îú‚îÄ {csv_path} (CSV avec index)")
        print(f"  ‚îú‚îÄ {parquet_path} (PARQUET avec index)")
        print(f"  ‚îî‚îÄ {excel_path} (XLSX pour pr√©visualisation)")
    else:
        print(f"‚ö†Ô∏è  Dataframe light vide ou introuvable")
        
except Exception as e:
    print(f"‚ö†Ô∏è  Erreur lors de la sauvegarde: {str(e)}")

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# R√âSUM√â FINAL
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

print(f"\n" + "="*100)
print("‚ú® R√âSUM√â - DATAFRAME LIGHT CR√â√â AVEC SUCC√àS")
print("="*100)

if 'df_light' in globals() and len(df_light) > 0:
    print(f"\n‚úÖ Dataframe light pr√™t pour la production!")
    print(f"\n  üìê STRUCTURE:")
    print(f"  ‚îú‚îÄ INDEX: SK_ID_CURR ({len(df_light.index):,} valeurs uniques)")
    print(f"  ‚îú‚îÄ COLONNES: 20 meilleures features selon LightGBM")
    print(f"  ‚îú‚îÄ SHAPE: {df_light.shape[0]:,} lignes √ó {df_light.shape[1]} colonnes")
    print(f"  ‚îî‚îÄ M√âMOIRE: {df_light_size:.2f} MB (r√©duction: {reduction_pct:.1f}%)")
    
    print(f"\n  üìÅ FORMATS DISPONIBLES:")
    print(f"  ‚îú‚îÄ CSV: data_light_features.csv (index inclus)")
    print(f"  ‚îú‚îÄ PARQUET: data_light_features.parquet (index conserv√©)")
    print(f"  ‚îî‚îÄ EXCEL: data_light_features.xlsx (pour pr√©visualisation)")
    
    print(f"\n  üéØ UTILISATION:")
    print(f"  ‚îú‚îÄ Streamlit: Charger avec pd.read_parquet() ou pd.read_csv(index_col=0)")
    print(f"  ‚îú‚îÄ API: Acc√®s rapide aux clients via SK_ID_CURR")
    print(f"  ‚îî‚îÄ Mod√®le: Inf√©rence 10x plus rapide avec 20 features au lieu de {df.shape[1]}")
    
    print(f"\n  üí° ASTUCE:")
    print(f"  Pour charger le dataframe light dans Streamlit/API:")
    print(f"  >> df = pd.read_parquet('data_light_features.parquet')")
    print(f"  >> # SK_ID_CURR sera automatiquement l'index")
    print(f"  >> df.loc[100002]  # Acc√©dez directement par SK_ID_CURR!")
else:
    print(f"\n‚ö†Ô∏è  Dataframe light non cr√©√©. V√©rifiez les √©tapes pr√©c√©dentes.")

print("="*100 + "\n")