# Modélisation Comparative - Stratégies Multiples

## Objectif
Implémenter et comparer 7 stratégies de modélisation pour maximiser l'accuracy de prédiction des segments clients.

### Modèles testés
1. **LightGBM Simple** - Baseline
2. **LightGBM Lookback** - Exploitation historique IDs connus
3. **LightGBM Optimisé** - Tuning Optuna
4. **LightGBM Feature Selection** - Régularisé + 9 meilleures features selon l'arbre d'importance 
5. **Régression Logistique** - Modèle linéaire (sans NaN)
6. **SVM RBF** - Modèle non-linéaire (sans NaN)
7. **Ensemble Voting** - Combinaison optimale

## 1. Imports et Configuration


In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

from lightgbm import LGBMClassifier
import optuna

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import os
os.makedirs('../outputs_final', exist_ok=True)

## 2. Chargement des Données


In [2]:
train_df = pd.read_csv('../Data/Train.csv')
test_df = pd.read_csv('../Data/Test.csv')

print(f'Train: {train_df.shape}')
print(f'Test: {test_df.shape}')
print(f'IDs communs: {len(set(train_df["ID"]).intersection(set(test_df["ID"])))}')

Train: (8068, 11)
Test: (2627, 10)
IDs communs: 2332


## 3. Fonctions Utilitaires

In [3]:
#Features pour LGBM

def prepare_features_lgbm(df, label_encoders=None, fit=True):
    df_prep = df.copy()
    categorical_cols = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Var_1']
    
    if fit:
        label_encoders = {}
        for col in categorical_cols:
            le = LabelEncoder()
            df_prep[col] = le.fit_transform(df_prep[col].astype(str))
            label_encoders[col] = le
    else:
        for col in categorical_cols:
            df_prep[col] = label_encoders[col].transform(df_prep[col].astype(str))
    
    if 'ID' in df_prep.columns:
        df_prep = df_prep.drop('ID', axis=1)
    return df_prep, label_encoders

#features pour modeèle SVM et logistique 

def prepare_features_classical(df, scaler=None, fit=True):
    df_prep = df.copy()
    df_prep = df_prep.dropna()
    
    if 'ID' in df_prep.columns:
        df_prep = df_prep.drop('ID', axis=1)
    
    categorical_cols = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Var_1']
    df_prep = pd.get_dummies(df_prep, columns=categorical_cols, drop_first=True)
    
    numeric_cols = ['Age', 'Work_Experience', 'Family_Size']
    numeric_cols = [col for col in numeric_cols if col in df_prep.columns]
    
    if fit:
        scaler = StandardScaler()
        df_prep[numeric_cols] = scaler.fit_transform(df_prep[numeric_cols])
    else:
        df_prep[numeric_cols] = scaler.transform(df_prep[numeric_cols])
    
    return df_prep, scaler

#On va tracer la matrice de confusion pour chaque modele 

def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    labels = sorted(np.unique(y_true))
    
    fig = go.Figure(data=go.Heatmap(
        z=cm, x=labels, y=labels, colorscale='Blues',
        text=cm, texttemplate='%{text}', textfont={"size": 16}
    ))
    
    fig.update_layout(title=title, xaxis_title='Prédiction', yaxis_title='Réalité', height=500, width=600)
    fig.show()
    return cm

#On va tracre l'arbre d'importance qui nous permet de voir les features les plus importantes pour chaque modele 
def plot_feature_importance(model, feature_names, title):
    importance = model.feature_importances_
    feat_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importance}).sort_values('Importance', ascending=False).head(15)
    fig = px.bar(feat_imp_df, x='Importance', y='Feature', orientation='h', title=title, height=500)
    fig.update_layout(yaxis={'categoryorder':'total ascending'})
    fig.show()

def evaluate_model(model, X_train, y_train, X_val, y_val, model_name):
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)
    
    train_acc = accuracy_score(y_train, y_train_pred)
    val_acc = accuracy_score(y_val, y_val_pred)
    
    print(f'\n{"="*60}')
    print(f'{model_name}')
    print(f'{"="*60}')
    print(f'Train Accuracy: {train_acc:.4f}')
    print(f'Validation Accuracy: {val_acc:.4f}')
    print(f'\nClassification Report (Validation):')
    print(classification_report(y_val, y_val_pred))
    
    plot_confusion_matrix(y_val, y_val_pred, f'{model_name} - Matrice de Confusion')
    
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print(f'\nCross-Validation (5-fold):')
    print(f'Mean CV: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})')
    
    return {
        'model_name': model_name,
        'train_acc': train_acc,
        'val_acc': val_acc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }


## 4. Préparation Données Train/Validation

In [4]:

# On sépare le jeu de données en jeu d'entraînement et de validation.
# Le paramètre stratify=y permet de faire un "split stratifié" :
# Autrement dit, si par exemple la classe 'A' représente 20% du dataset au total, elle représentera aussi ~20% dans chaque split.


X = train_df.drop('Segmentation', axis=1)
y = train_df['Segmentation']

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y 
)

print(f'Train set: {X_train.shape}')
print(f'Validation set: {X_val.shape}')
print(f'\nDistribution target train:\n{y_train.value_counts(normalize=True).sort_index()}')
print(f'\nDistribution target validation:\n{y_val.value_counts(normalize=True).sort_index()}')


Train set: (6454, 10)
Validation set: (1614, 10)

Distribution target train:
Segmentation
A    0.244500
B    0.230245
C    0.244190
D    0.281066
Name: proportion, dtype: float64

Distribution target validation:
Segmentation
A    0.244114
B    0.230483
C    0.244114
D    0.281289
Name: proportion, dtype: float64


---
# MODÈLE 1 : LightGBM Simple (Baseline)
Référence de performance avec hyperparamètres par défaut

In [5]:
X_train_lgbm, label_encoders_lgbm = prepare_features_lgbm(X_train, fit=True)
X_val_lgbm, _ = prepare_features_lgbm(X_val, label_encoders_lgbm, fit=False)

le_target = LabelEncoder()
y_train_encoded = le_target.fit_transform(y_train)
y_val_encoded = le_target.transform(y_val)

print(f'Features après preprocessing: {X_train_lgbm.shape[1]}')

Features après preprocessing: 9


In [6]:
model_lgbm_simple = LGBMClassifier(random_state=42, verbose=-1)
model_lgbm_simple.fit(X_train_lgbm, y_train_encoded)

results_lgbm_simple = evaluate_model(model_lgbm_simple, X_train_lgbm, y_train_encoded, X_val_lgbm, y_val_encoded, 'MODÈLE 1: LightGBM Simple')

plot_feature_importance(model_lgbm_simple, X_train_lgbm.columns, 'MODÈLE 1: Importance des Features')


MODÈLE 1: LightGBM Simple
Train Accuracy: 0.7165
Validation Accuracy: 0.5347

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.44      0.43      0.43       394
           1       0.42      0.35      0.38       372
           2       0.58      0.56      0.57       394
           3       0.64      0.76      0.69       454

    accuracy                           0.53      1614
   macro avg       0.52      0.52      0.52      1614
weighted avg       0.52      0.53      0.53      1614




Cross-Validation (5-fold):
Mean CV: 0.5143 (+/- 0.0059)


In [7]:
test_lgbm_simple, _ = prepare_features_lgbm(test_df, label_encoders_lgbm, fit=False)
test_pred_simple = model_lgbm_simple.predict(test_lgbm_simple)
test_pred_simple_labels = le_target.inverse_transform(test_pred_simple)

submission_simple = pd.DataFrame({'ID': test_df['ID'], 'Segmentation': test_pred_simple_labels})
submission_simple.to_csv('../outputs_final/submission_lgbm_simple.csv', index=False)

L'accuracy sur la validation est de 51 % en moyenne et le sur le train est de > 70 % ce qui est meilleure que le hasard qui est de 25 % car il y a 4 classes. Il ya du surapprentissage, nous implémenterons une stratgéie de régularisation en hyperoptimisant le LGBM dans le modèle 3.

Après avoir uploadé le csv de submission sur Kaggle on trouve 32.8 % d'accuracy ce qui est beaucoup moins bon que l'accuracy sur la validation mais au dessuus de 7 % du hasard. On va voir dans les modèles précédents si on peut améliorer l'accuracy

Une des principales causes d'erreur provient d'une forte confusion entre les classes B et C (voir la matrice de confusion). On constate d'après l'analyse de l'importance des variables que l'âge est la feature la plus discriminante du modèle, or les distributions d'âge des classes B et C se recouvrent en grande partie, comme observé lors de l'analyse exploratoire.

---
# MODÈLE 2 : LightGBM avec Lookback
Exploitation de l'historique pour les 88% d'IDs connus dans le train

In [8]:
common_ids = set(train_df['ID']).intersection(set(test_df['ID']))
print(f'IDs communs: {len(common_ids)} ({len(common_ids)/len(test_df)*100:.1f}%)')

id_to_segment = train_df.set_index('ID')['Segmentation'].to_dict()

def add_lookback_features(df, id_to_segment):
    df_with_lookback = df.copy()
    df_with_lookback['Has_History'] = df_with_lookback['ID'].isin(id_to_segment.keys()).astype(int)
    
    for seg in ['A', 'B', 'C', 'D']:
        df_with_lookback[f'Hist_Seg_{seg}'] = df_with_lookback['ID'].apply(lambda x: 1 if id_to_segment.get(x) == seg else 0)
    
    return df_with_lookback

X_train_lookback = add_lookback_features(X_train, id_to_segment)
X_val_lookback = add_lookback_features(X_val, id_to_segment)

IDs communs: 2332 (88.8%)


In [9]:
X_train_lookback_prep, label_encoders_lookback = prepare_features_lgbm(X_train_lookback, fit=True)
X_val_lookback_prep, _ = prepare_features_lgbm(X_val_lookback, label_encoders_lookback, fit=False)

model_lgbm_lookback = LGBMClassifier(random_state=42, verbose=-1)
model_lgbm_lookback.fit(X_train_lookback_prep, y_train_encoded)

results_lgbm_lookback = evaluate_model(model_lgbm_lookback, X_train_lookback_prep, y_train_encoded, X_val_lookback_prep, y_val_encoded, 'MODÈLE 2: LightGBM Lookback')

plot_feature_importance(model_lgbm_lookback, X_train_lookback_prep.columns, 'MODÈLE 2: Importance des Features')


MODÈLE 2: LightGBM Lookback
Train Accuracy: 1.0000
Validation Accuracy: 1.0000

Classification Report (Validation):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       394
           1       1.00      1.00      1.00       372
           2       1.00      1.00      1.00       394
           3       1.00      1.00      1.00       454

    accuracy                           1.00      1614
   macro avg       1.00      1.00      1.00      1614
weighted avg       1.00      1.00      1.00      1614




Cross-Validation (5-fold):
Mean CV: 1.0000 (+/- 0.0000)


 L'accuracy sur le train set et le validation set n'est pas pertinente dans ce contexte de lookback, seul le score obtenu sur le test set doit être pris en compte. Ici, l'accuracy atteint seulement 31 %, ce qui montre que l'hypothèse de segmentation constante pour un même ID n'est pas vérifiée. Cela peut surprendre, car l'analyse exploratoire montrait que seules les variables d'âge et d'expérience professionnelle variaient entre les deux jeux de données. Ceci suggère soit que ces variables ont un impact très fort dans la prédiction, soit qu'il existe des incohérences ou problèmes dans les données — une possibilité évoquée à plusieurs reprises pendant l'exploration des données.

In [10]:
test_lookback = add_lookback_features(test_df, id_to_segment)
test_lookback_prep, _ = prepare_features_lgbm(test_lookback, label_encoders_lookback, fit=False)
test_pred_lookback = model_lgbm_lookback.predict(test_lookback_prep)
test_pred_lookback_labels = le_target.inverse_transform(test_pred_lookback)

submission_lookback = pd.DataFrame({'ID': test_df['ID'], 'Segmentation': test_pred_lookback_labels})
submission_lookback.to_csv('../outputs_final/submission_lgbm_lookback.csv', index=False)
print('Submission MODÈLE 2 sauvegardée')


Submission MODÈLE 2 sauvegardée


---
# MODÈLE 3 : LightGBM Optimisé (Optuna)
Tuning hyperparamètres avec Optuna

In [11]:
def objective_optuna(trial):
    params = {
        'num_leaves': trial.suggest_int('num_leaves', 20, 50),
        'max_depth': trial.suggest_int('max_depth', 5, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'min_child_samples': trial.suggest_int('min_child_samples', 20, 100),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
        'random_state': 42,
        'verbose': -1
    }
    
    model = LGBMClassifier(**params)
    cv_scores = cross_val_score(model, X_train_lgbm, y_train_encoded, cv=5, scoring='accuracy')
    return cv_scores.mean()

print('Démarrage optimisation Optuna (100 trials)...')

Démarrage optimisation Optuna (100 trials)...


In [12]:
study = optuna.create_study(direction='maximize')
study.optimize(objective_optuna, n_trials=100, show_progress_bar=True)

print(f'\nMeilleur score CV: {study.best_value:.4f}')
print(f'Meilleurs hyperparamètres:')
for key, value in study.best_params.items():
    print(f'  {key}: {value}')

[I 2025-11-03 14:17:43,924] A new study created in memory with name: no-name-ae1a3f5e-af9d-4dc7-b5a5-4ba76f3fa5f0


  0%|          | 0/100 [00:00<?, ?it/s]

[I 2025-11-03 14:17:55,131] Trial 0 finished with value: 0.5313011366706897 and parameters: {'num_leaves': 37, 'max_depth': 5, 'learning_rate': 0.024977176078620614, 'n_estimators': 127, 'min_child_samples': 33, 'subsample': 0.7849154735897075, 'colsample_bytree': 0.7370921851767983, 'reg_alpha': 1.8626786916300608, 'reg_lambda': 5.346188291684069}. Best is trial 0 with value: 0.5313011366706897.
[I 2025-11-03 14:18:07,703] Trial 1 finished with value: 0.5316107338221077 and parameters: {'num_leaves': 45, 'max_depth': 7, 'learning_rate': 0.05837616786279376, 'n_estimators': 149, 'min_child_samples': 88, 'subsample': 0.8731105206912708, 'colsample_bytree': 0.7049902850990257, 'reg_alpha': 7.592296594453107, 'reg_lambda': 1.949696028756217}. Best is trial 1 with value: 0.5316107338221077.
[I 2025-11-03 14:18:24,974] Trial 2 finished with value: 0.5336232353983151 and parameters: {'num_leaves': 36, 'max_depth': 14, 'learning_rate': 0.08643659521105458, 'n_estimators': 130, 'min_child_samp

In [13]:
best_params = study.best_params.copy()
best_params['random_state'] = 42
best_params['verbose'] = -1

model_lgbm_optimized = LGBMClassifier(**best_params)
model_lgbm_optimized.fit(X_train_lgbm, y_train_encoded)

results_lgbm_optimized = evaluate_model(model_lgbm_optimized, X_train_lgbm, y_train_encoded, X_val_lgbm, y_val_encoded, 'MODÈLE 3: LightGBM Optimisé')

plot_feature_importance(model_lgbm_optimized, X_train_lgbm.columns, 'MODÈLE 3: Importance des Features')


MODÈLE 3: LightGBM Optimisé
Train Accuracy: 0.6055
Validation Accuracy: 0.5341

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.43      0.42      0.43       394
           1       0.42      0.30      0.35       372
           2       0.56      0.58      0.57       394
           3       0.63      0.78      0.70       454

    accuracy                           0.53      1614
   macro avg       0.51      0.52      0.51      1614
weighted avg       0.52      0.53      0.52      1614




Cross-Validation (5-fold):
Mean CV: 0.5375 (+/- 0.0136)


In [14]:
test_pred_optimized = model_lgbm_optimized.predict(test_lgbm_simple)
test_pred_optimized_labels = le_target.inverse_transform(test_pred_optimized)

submission_optimized = pd.DataFrame({'ID': test_df['ID'], 'Segmentation': test_pred_optimized_labels})
submission_optimized.to_csv('../outputs_final/submission_lgbm_optimized.csv', index=False)
print(' Submission MODÈLE 3 sauvegardée')


 Submission MODÈLE 3 sauvegardée


 Même si le score de validation s’est peu amélioré, l’ajout de régularisation par rapport au modèle 1 a permis de réduire le surapprentissage. Sur le test set l'accuracy est toujours autour de 30 % sans améliration significative ...

---
# MODÈLE 4 : LightGBM Feature Selection
Régularisation + sélection des 9 meilleures features


In [15]:
# 9 features retenues
selected_features_original = [
    'Age',
    'Profession',
    'Family_Size',
    'Var_1',
    'Work_Experience',
    'Spending_Score',
    'Graduated',
    'Gender',
    'Ever_Married'
]


In [16]:
X_train_selected = X_train[selected_features_original].copy()
X_val_selected = X_val[selected_features_original].copy()

X_train_selected_prep, label_encoders_selected = prepare_features_lgbm(X_train_selected, fit=True)
X_val_selected_prep, _ = prepare_features_lgbm(X_val_selected, label_encoders_selected, fit=False)


In [17]:
model_lgbm_selection = LGBMClassifier(
    n_estimators=200,
    num_leaves=15,
    max_depth=5,
    learning_rate=0.01,
    min_child_samples=100,
    min_child_weight=0.001,
    reg_alpha=0.1,
    reg_lambda=0.1,
    subsample=0.7,
    colsample_bytree=0.7,
    subsample_freq=1,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)

model_lgbm_selection.fit(X_train_selected_prep, y_train_encoded)

results_lgbm_selection = evaluate_model(
    model_lgbm_selection, 
    X_train_selected_prep, 
    y_train_encoded, 
    X_val_selected_prep, 
    y_val_encoded, 
    'MODÈLE 4: LightGBM Feature Selection'
)

plot_feature_importance(model_lgbm_selection, X_train_selected_prep.columns, 'MODÈLE 4: Importance des Features')



MODÈLE 4: LightGBM Feature Selection
Train Accuracy: 0.5598
Validation Accuracy: 0.5458

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.45      0.45      0.45       394
           1       0.47      0.30      0.36       372
           2       0.56      0.61      0.58       394
           3       0.63      0.78      0.70       454

    accuracy                           0.55      1614
   macro avg       0.53      0.53      0.52      1614
weighted avg       0.53      0.55      0.53      1614




Cross-Validation (5-fold):
Mean CV: 0.5302 (+/- 0.0165)


In [18]:
test_selected = test_df[selected_features_original].copy()
test_selected_prep, _ = prepare_features_lgbm(test_selected, label_encoders_selected, fit=False)

test_pred_selection = model_lgbm_selection.predict(test_selected_prep)
test_pred_selection_labels = le_target.inverse_transform(test_pred_selection)

submission_selection = pd.DataFrame({'ID': test_df['ID'], 'Segmentation': test_pred_selection_labels})
submission_selection.to_csv('../outputs_final/submission_lgbm_selection.csv', index=False)
print('Submission MODÈLE 4 sauvegardée')
print(submission_selection['Segmentation'].value_counts())


Submission MODÈLE 4 sauvegardée
Segmentation
D    899
A    685
C    668
B    375
Name: count, dtype: int64


C'est la meilleure accuracy que l'on a sur le test set avec 33% et un modèle ave cmoins de feature pour le moment on garde celui car de plus le surapprentissage n'est pas important (décalage de 2% entre le train et la val)

---
# MODÈLE 5 : Régression Logistique
Modèle linéaire (sans NaN)

In [19]:
X_train_classical = pd.concat([X_train, y_train], axis=1).dropna()
X_val_classical = pd.concat([X_val, y_val], axis=1).dropna()

y_train_classical = X_train_classical['Segmentation']
y_val_classical = X_val_classical['Segmentation']

X_train_classical = X_train_classical.drop('Segmentation', axis=1)
X_val_classical = X_val_classical.drop('Segmentation', axis=1)

print(f'Train après suppression NaN: {X_train_classical.shape} (perte: {(1 - len(X_train_classical)/len(X_train))*100:.1f}%)')

Train après suppression NaN: (5351, 10) (perte: 17.1%)


In [20]:
X_train_logreg, scaler_logreg = prepare_features_classical(X_train_classical, fit=True)
X_val_logreg, _ = prepare_features_classical(X_val_classical, scaler=scaler_logreg, fit=False)

y_train_logreg_encoded = le_target.transform(y_train_classical)
y_val_logreg_encoded = le_target.transform(y_val_classical)

model_logreg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)
model_logreg.fit(X_train_logreg, y_train_logreg_encoded)

results_logreg = evaluate_model(model_logreg, X_train_logreg, y_train_logreg_encoded, X_val_logreg, y_val_logreg_encoded, 'MODÈLE 5: Régression Logistique')


MODÈLE 5: Régression Logistique
Train Accuracy: 0.5184
Validation Accuracy: 0.5274

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.42      0.47      0.44       308
           1       0.44      0.25      0.32       320
           2       0.53      0.63      0.57       343
           3       0.66      0.74      0.70       343

    accuracy                           0.53      1314
   macro avg       0.51      0.52      0.51      1314
weighted avg       0.52      0.53      0.51      1314




Cross-Validation (5-fold):
Mean CV: 0.5087 (+/- 0.0141)


In [21]:
test_classical_clean = test_df.dropna()
test_logreg, _ = prepare_features_classical(test_classical_clean, scaler=scaler_logreg, fit=False)
test_pred_logreg = model_logreg.predict(test_logreg)
test_pred_logreg_labels = le_target.inverse_transform(test_pred_logreg)

submission_logreg = pd.DataFrame({'ID': test_classical_clean['ID'].values, 'Segmentation': test_pred_logreg_labels})

missing_ids = set(test_df['ID']) - set(test_classical_clean['ID'])
submission_logreg_full = submission_simple.copy()
for idx, row in submission_logreg.iterrows():
    submission_logreg_full.loc[submission_logreg_full['ID'] == row['ID'], 'Segmentation'] = row['Segmentation']

submission_logreg_full.to_csv('../outputs_final/submission_logreg.csv', index=False)
print(f'Submission MODÈLE 5 sauvegardée ({len(missing_ids)} IDs complétés par LGBM) et le reste en lookback dans le train selon les IDs connus  ')


Submission MODÈLE 5 sauvegardée (473 IDs complétés par LGBM) et le reste en lookback dans le train selon les IDs connus  


Mêmes problèmes que lgbm : accuracy sur val 50 % et test 30%. COnfusion entre classe A,B C. 

---
# MODÈLE 6 : SVM avec kernel RBF
Modèle non-linéaire (sans NaN)

In [22]:
model_svm = SVC(kernel='rbf', C=10, gamma='scale', random_state=42, probability=True)

model_svm.fit(X_train_logreg, y_train_logreg_encoded)

results_svm = evaluate_model(model_svm, X_train_logreg, y_train_logreg_encoded, X_val_logreg, y_val_logreg_encoded, 'MODÈLE 6: SVM RBF')


MODÈLE 6: SVM RBF
Train Accuracy: 0.6632
Validation Accuracy: 0.5327

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.45      0.47      0.46       308
           1       0.43      0.31      0.36       320
           2       0.55      0.60      0.57       343
           3       0.65      0.73      0.69       343

    accuracy                           0.53      1314
   macro avg       0.52      0.53      0.52      1314
weighted avg       0.52      0.53      0.52      1314




Cross-Validation (5-fold):
Mean CV: 0.5098 (+/- 0.0190)


In [23]:
test_pred_svm = model_svm.predict(test_logreg)
test_pred_svm_labels = le_target.inverse_transform(test_pred_svm)

submission_svm = pd.DataFrame({'ID': test_classical_clean['ID'].values, 'Segmentation': test_pred_svm_labels})

submission_svm_full = submission_simple.copy()
for idx, row in submission_svm.iterrows():
    submission_svm_full.loc[submission_svm_full['ID'] == row['ID'], 'Segmentation'] = row['Segmentation']

submission_svm_full.to_csv('../outputs_final/submission_svm.csv', index=False)
print(' Submission MODÈLE 6 sauvegardée')


 Submission MODÈLE 6 sauvegardée


---
# MODÈLE 7 : Ensemble Voting
Combinaison des meilleurs modèles

In [24]:
X_train_clean_lgbm, le_clean = prepare_features_lgbm(X_train_classical, fit=True)
X_val_clean_lgbm, _ = prepare_features_lgbm(X_val_classical, le_clean, fit=False)

model_lgbm_for_ensemble = LGBMClassifier(**best_params)
model_lgbm_for_ensemble.fit(X_train_clean_lgbm, y_train_logreg_encoded)



0,1,2
,boosting_type,'gbdt'
,num_leaves,32
,max_depth,11
,learning_rate,0.07036777137167918
,n_estimators,179
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [25]:
ensemble = VotingClassifier(
    estimators=[('lgbm', model_lgbm_for_ensemble), ('logreg', model_logreg), ('svm', model_svm)],
    voting='soft',
    weights=[0.5, 0.25, 0.25]
)

print('Entraînement Ensemble...')
ensemble.fit(X_train_logreg, y_train_logreg_encoded)

results_ensemble = evaluate_model(ensemble, X_train_logreg, y_train_logreg_encoded, X_val_logreg, y_val_logreg_encoded, 'MODÈLE 7: Ensemble Voting')

Entraînement Ensemble...

MODÈLE 7: Ensemble Voting
Train Accuracy: 0.6094
Validation Accuracy: 0.5571

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.47      0.52      0.49       308
           1       0.49      0.31      0.38       320
           2       0.57      0.62      0.59       343
           3       0.65      0.76      0.70       343

    accuracy                           0.56      1314
   macro avg       0.55      0.55      0.54      1314
weighted avg       0.55      0.56      0.55      1314




Cross-Validation (5-fold):
Mean CV: 0.5350 (+/- 0.0184)


In [26]:
test_pred_ensemble = ensemble.predict(test_logreg)
test_pred_ensemble_labels = le_target.inverse_transform(test_pred_ensemble)

submission_ensemble = pd.DataFrame({'ID': test_classical_clean['ID'].values, 'Segmentation': test_pred_ensemble_labels})

submission_ensemble_full = submission_optimized.copy()
for idx, row in submission_ensemble.iterrows():
    submission_ensemble_full.loc[submission_ensemble_full['ID'] == row['ID'], 'Segmentation'] = row['Segmentation']

submission_ensemble_full.to_csv('../outputs_final/submission_ensemble.csv', index=False)
print('Submission MODÈLE 7 sauvegardée')


Submission MODÈLE 7 sauvegardée


---
#TABLEAU COMPARATIF FINAL

In [27]:
all_results = [results_lgbm_simple, results_lgbm_optimized, results_lgbm_selection, results_logreg, results_svm, results_ensemble]

comparison_df = pd.DataFrame(all_results).round(4)

print('\nComparaison des modèles:')
print(comparison_df.to_string(index=False))

best_idx = comparison_df['val_acc'].idxmax()
best_model = comparison_df.loc[best_idx, 'model_name']
best_acc = comparison_df.loc[best_idx, 'val_acc']

print(f'\nMeilleur modèle: {best_model} (Val Accuracy: {best_acc:.4f})')


Comparaison des modèles:
                          model_name  train_acc  val_acc  cv_mean  cv_std
           MODÈLE 1: LightGBM Simple     0.7165   0.5347   0.5143  0.0059
         MODÈLE 3: LightGBM Optimisé     0.6055   0.5341   0.5375  0.0136
MODÈLE 4: LightGBM Feature Selection     0.5598   0.5458   0.5302  0.0165
     MODÈLE 5: Régression Logistique     0.5184   0.5274   0.5087  0.0141
                   MODÈLE 6: SVM RBF     0.6632   0.5327   0.5098  0.0190
           MODÈLE 7: Ensemble Voting     0.6094   0.5571   0.5350  0.0184

Meilleur modèle: MODÈLE 7: Ensemble Voting (Val Accuracy: 0.5571)


In [28]:
fig = go.Figure()

fig.add_trace(go.Bar(x=comparison_df['model_name'], y=comparison_df['train_acc'], name='Train Accuracy', marker_color='lightblue'))
fig.add_trace(go.Bar(x=comparison_df['model_name'], y=comparison_df['val_acc'], name='Validation Accuracy', marker_color='darkblue'))
fig.add_trace(go.Scatter(x=comparison_df['model_name'], y=comparison_df['cv_mean'], name='CV Mean', mode='markers+lines', marker=dict(size=12, color='red', symbol='diamond')))

fig.update_layout(title='Comparaison des Performances de tous les Modèles', xaxis_title='Modèle', yaxis_title='Accuracy', barmode='group', height=600, xaxis_tickangle=-45)
fig.show()

# Conclusions
 - Les différents modèles testés, que ce soit LGBM, la régression logistique ou le SVM, présentent des performances globalement similaires.
 - Cependant, il est surprenant de constater une baisse importante de l'accuracy entre la validation (~50%) et le test (~30%), alors même que 88 % des données semblent similaires entre les deux ensembles. Il est regrettable de ne pas pouvoir accéder à la segmentation réelle des données de test afin d'analyser plus précisément l'origine de ce décalage de performance. 
 - J'émets l'hypothèse que des erreurs dans la segmentation du jeu de test existent, d'autant que de nombreux problèmes de qualité de données ont déjà été observés lors de l'exploration initiale.
# 
