# **Probl√®me de Classification : Pr√©diction de la Vari√©t√© de Paddy (Riz)**

---



**üìã D√©finition du Probl√®me**

**1. Contexte Agricole**

Dans la culture du paddy (riz) en Inde (districts comme Cuddalore, Kurinjipadi, etc.), le choix de la vari√©t√© de semence (ex. : CO_43, ponmani, delux ponni) est critique pour maximiser le rendement en fonction des conditions locales :

- Facteurs influents : Type de sol (alluvial, clay), conditions m√©t√©orologiques (pluie par p√©riode : 30DRain, temp√©ratures min/max par phase de croissance, vent, humidit√©), pratiques culturales (Nursery: dry/wet, intrants comme DAP_20days, Urea_40Days, pesticides), bloc agricole (Agriblock), superficie (Hectares), et rendement observ√© (Paddy yield(in Kg)).
- D√©fi r√©el : Les agriculteurs doivent s√©lectionner la vari√©t√© optimale avant plantation, bas√©e sur des donn√©es historiques/environnementales, pour optimiser rendement et r√©sistance (ex. : ponmani tol√®re mieux l'argile humide, CO_43 alluvial sec).

**2. Probl√®me ML**

Classification multi-classe pour pr√©dire la Vari√©t√© de Paddy ('CO_43', 'ponmani', 'delux ponni') √† partir des features agronomiques et m√©t√©o.

- Objectif : Recommander la vari√©t√© la plus adapt√©e (r√©duire risques, ‚Üë rendement de 10-20% en moyenne).
- Impact : Outil d√©cisionnel pour fermiers (ex. : app mobile input sol/pluie ‚Üí output vari√©t√©).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import skew, kurtosis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report
)
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Configuration des graphiques
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_style("whitegrid")

sns.set_palette("husl")

In [None]:
# Charger le fichier original
df = pd.read_csv('data/paddy_dataset_fe.csv', sep=',', encoding='utf-8', low_memory=False)

In [None]:
# Aper√ßu des premi√®res lignes
print("APER√áU DES DONN√âES:")
print("-" * 30)
print(df.head())
print()

In [None]:
# Charger le fichier original
df = pd.read_csv('data/paddy_dataset_fe.csv', sep=',', encoding='utf-8', low_memory=False)
# Cr√©ation d'une copie pour ne pas modifier df original
df_class = df.copy()

In [None]:
# Aper√ßu des premi√®res lignes
print("APER√áU DES DONN√âES:")
print("-" * 30)
print(df_class.head())
print()

In [None]:
# S√©paration X et y
# S√©paration X et y (apr√®s One-Hot Encoding avec drop_first=True)

# Identification automatique des colonnes One-Hot cr√©√©es pour Variety
variety_onehot_cols = [col for col in df.columns if col.startswith('Variety_')]

print("Colonnes One-Hot d√©tect√©es pour la cible 'Variety' :")
print(variety_onehot_cols)

# S√©paration features / target
X = df.drop(variety_onehot_cols, axis=1)   # Toutes les colonnes sauf les One-Hot de Variety
y = df[variety_onehot_cols]                # y = les colonnes One-Hot (format multi-colonnes)

print(f"Shape de X: {X.shape}")
print(f"Shape de y: {y.shape}")
print(f"\nNombre de features: {X.shape[1]}")
print(f"\nListe des features:")
print(X.columns.tolist()[:30])  # Affiche les 30 premi√®res

## 2.3 Split Train/Test

In [None]:
# Split des donn√©es

# Conversion one-hot (n,3) ‚Üí labels (n,)
y_labels = np.argmax(y.values, axis=1)

# Split stratifi√©
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_labels,
    test_size=0.2,
    random_state=42,
    stratify=y_labels
)

print("Split des donn√©es (80% train - 20% test):")
print("="*80)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape : {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape : {y_test.shape}")

# Noms des vari√©t√©s
variety_names = [col.replace('Variety_', '') for col in y.columns]

print("\nDistribution de la variable cible (Variety):")
print("-"*80)
for i, variety in enumerate(variety_names):
    train_count = np.sum(y_train == i)
    test_count  = np.sum(y_test == i)

    print(f"Train - {variety:12} : {train_count:4d} ({train_count/len(y_train)*100:5.2f}%)")
    print(f"Test  - {variety:12} : {test_count:4d} ({test_count/len(y_test)*100:5.2f}%)")

## 2.5 Normalisation des donn√©es

In [None]:
# Standardisation des features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Normalisation effectu√©e avec StandardScaler")
print("="*80)
print("\nStatistiques apr√®s normalisation (X_train):")
print(f"Moyenne: {X_train_scaled.mean(axis=0).mean():.6f}")
print(f"√âcart-type: {X_train_scaled.std(axis=0).mean():.6f}")

# Conversion en DataFrame pour plus de clart√©
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)

---
# 3. Mod√©lisation & √âvaluation
---

In [None]:
def evaluate_model(name, model, X_train, X_test, y_train, y_test):
    """
    Fonction g√©n√©rique d'√©valuation (binaire & multi-classes)
    """

    # Pr√©dictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    n_classes = len(np.unique(y_test))

    # ROC-AUC
    roc_auc = None
    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test)

        if n_classes == 2:
            roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
        else:
            roc_auc = roc_auc_score(
                y_test,
                y_pred_proba,
                multi_class='ovr',
                average='weighted'
            )

    # M√©triques
    results = {
        'Mod√®le': name,
        'Accuracy Train': accuracy_score(y_train, y_pred_train),
        'Accuracy Test': accuracy_score(y_test, y_pred_test),
        'Precision': precision_score(y_test, y_pred_test, average='weighted'),
        'Recall': recall_score(y_test, y_pred_test, average='weighted'),
        'F1-Score': f1_score(y_test, y_pred_test, average='weighted'),
        'ROC-AUC': roc_auc
    }

    # Affichage des r√©sultats
    print(f"\n{'='*80}")
    print(f"MOD√àLE: {name}")
    print(f"{'='*80}")

    for metric, value in results.items():
        if metric != 'Mod√®le':
            if value is not None:
                print(f"{metric:20s}: {value:.4f}")
            else:
                print(f"{metric:20s}: N/A")

    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred_test)
    class_labels = np.unique(y_test)

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_labels,
                yticklabels=class_labels)
    plt.title(f'Matrice de Confusion - {name}', fontsize=14, fontweight='bold')
    plt.ylabel('Vraie Classe')
    plt.xlabel('Classe Pr√©dite')
    plt.tight_layout()
    plt.show()

    # Rapport de classification
    print("\nRapport de classification:")
    print(classification_report(y_test, y_pred_test))

    return results

## 3.2 K-Nearest Neighbors (KNN)

In [None]:
 # Entra√Ænement du mod√®le KNN
print("Entra√Ænement du mod√®le K-Nearest Neighbors...")
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)



# √âvaluation
results_knn = evaluate_model('K-Nearest Neighbors', knn,
                             X_train_scaled, X_test_scaled,
                             y_train, y_test)

## 3.3 Logistic Regression

In [None]:
# Entra√Ænement du mod√®le Logistic Regression
print("Entra√Ænement du mod√®le Logistic Regression...")
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)

# √âvaluation
results_lr = evaluate_model('Logistic Regression', lr,
                            X_train_scaled, X_test_scaled,
                            y_train, y_test)

## 3.4 Decision Tree avec visualisation

In [None]:
# Entra√Ænement du mod√®le Decision Tree
print("Entra√Ænement du mod√®le Decision Tree...")
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train_scaled, y_train)

# √âvaluation
results_dt = evaluate_model('Decision Tree', dt,
                           X_train_scaled, X_test_scaled,
                           y_train, y_test)

In [None]:
# Visualisation de l'arbre de d√©cision
plt.figure(figsize=(22, 12))

plot_tree(
    dt,                                  # le mod√®le entra√Æn√©
    feature_names=X.columns,             # noms des features
    class_names=dt.classes_.astype(str), # classes r√©elles
    filled=True,
    rounded=True,
    fontsize=9
)

plt.title("Visualisation de l'Arbre de D√©cision", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 3.5 Random Forest avec Feature Importance

In [None]:
# Entra√Ænement du mod√®le Random Forest
print("Entra√Ænement du mod√®le Random Forest...")
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train_scaled, y_train)

# √âvaluation
results_rf = evaluate_model('Random Forest', rf,
                           X_train_scaled, X_test_scaled,
                           y_train, y_test)

In [None]:
# Trier par importance d√©croissante
feature_importance_sorted = feature_importance.sort_values('Importance', ascending=False)

# S√©lectionner les Top N features les plus importantes (par ex. top 5)
top_n = 5
top_features = feature_importance_sorted.head(top_n)

# Visualisation avec barres horizontales
plt.figure(figsize=(8, 6))
colors = plt.cm.viridis(np.linspace(0, 1, len(top_features)))
plt.barh(top_features['Feature'][::-1], top_features['Importance'][::-1],
         color=colors, edgecolor='black', alpha=0.8)
plt.xlabel('Importance', fontsize=12, fontweight='bold')
plt.ylabel('Features', fontsize=12, fontweight='bold')
plt.title(f'Top {top_n} Feature Importance - Random Forest', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.8)

# Ajouter les valeurs
for i, v in enumerate(top_features['Importance'][::-1]):
    plt.text(v + 0.001, i, f'{v:.4f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print(f"\nTop {top_n} Features les plus importantes:")
print(top_features.to_string(index=False))

## 3.6 XGBoost

In [None]:
# Entra√Ænement du mod√®le XGBoost
print("Entra√Ænement du mod√®le XGBoost...")
xgb = XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1, eval_metric='logloss')
xgb.fit(X_train_scaled, y_train)

# √âvaluation
results_xgb = evaluate_model('XGBoost', xgb,
                            X_train_scaled, X_test_scaled,
                            y_train, y_test)

## 3.7 Comparaison des mod√®les baseline

In [None]:
# Comparaison des r√©sultats
comparison_df = pd.DataFrame([results_knn, results_lr, results_dt, results_rf, results_xgb])
comparison_df = comparison_df.set_index('Mod√®le')

print("\n" + "="*100)
print("COMPARAISON DES MOD√àLES BASELINE")
print("="*100)
print(comparison_df.to_string())

# Visualisation comparative
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy Test', 'Precision', 'Recall', 'F1-Score']
colors_palette = sns.color_palette('husl', len(comparison_df))

for idx, metric in enumerate(metrics):
    row = idx // 2
    col = idx % 2

    comparison_df[metric].plot(kind='bar', ax=axes[row, col],
                               color=colors_palette, alpha=0.8, edgecolor='black')
    axes[row, col].set_title(f'Comparaison - {metric}', fontsize=13, fontweight='bold')
    axes[row, col].set_ylabel('Score', fontsize=11)
    axes[row, col].set_xlabel('')
    axes[row, col].grid(axis='y', alpha=0.3)
    axes[row, col].set_xticklabels(axes[row, col].get_xticklabels(), rotation=45, ha='right')

    # Ajouter les valeurs sur les barres
    for i, v in enumerate(comparison_df[metric]):
        axes[row, col].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
print("\n=== MOD√àLES SANS PCA ===")

# ==============================
# Decision Trees
# ==============================
dt_gini = DecisionTreeClassifier(
    criterion='gini',
    max_depth=5,
    random_state=42
)

dt_entropy = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=5,
    random_state=42
)

dt_gini.fit(X_train_scaled, y_train)
dt_entropy.fit(X_train_scaled, y_train)

y_pred_dt_gini = dt_gini.predict(X_test_scaled)
y_pred_dt_entropy = dt_entropy.predict(X_test_scaled)

print("\nDecision Tree - Gini")
print(classification_report(y_test, y_pred_dt_gini))
print("Accuracy :", accuracy_score(y_test, y_pred_dt_gini))
print("F1-score :", f1_score(y_test, y_pred_dt_gini, average='weighted'))

print("\nDecision Tree - Entropy")
print(classification_report(y_test, y_pred_dt_entropy))
print("Accuracy :", accuracy_score(y_test, y_pred_dt_entropy))
print("F1-score :", f1_score(y_test, y_pred_dt_entropy, average='weighted'))


# ==============================
# Random Forest
# ==============================
rf_gini = RandomForestClassifier(
    n_estimators=100,
    criterion='gini',
    random_state=42
)

rf_entropy = RandomForestClassifier(
    n_estimators=100,
    criterion='entropy',
    random_state=42
)

rf_gini.fit(X_train_scaled, y_train)
rf_entropy.fit(X_train_scaled, y_train)

y_pred_rf_gini = rf_gini.predict(X_test_scaled)
y_pred_rf_entropy = rf_entropy.predict(X_test_scaled)

print("\nRandom Forest - Gini")
print(classification_report(y_test, y_pred_rf_gini))
print("Accuracy :", accuracy_score(y_test, y_pred_rf_gini))
print("F1-score :", f1_score(y_test, y_pred_rf_gini, average='weighted'))

print("\nRandom Forest - Entropy")
print(classification_report(y_test, y_pred_rf_entropy))
print("Accuracy :", accuracy_score(y_test, y_pred_rf_entropy))
print("F1-score :", f1_score(y_test, y_pred_rf_entropy, average='weighted'))

Le but d‚Äôutiliser Gini ou Entropie dans les arbres de d√©cision et les for√™ts al√©atoires est le m√™me :

`Trouver la meilleure s√©paration possible √† chaque n≈ìud pour que les sous-n≈ìuds soient les plus purs possibles.`

Autrement dit :

- On veut que chaque branche contienne des exemples aussi homog√®nes que possible (ex. tous de la m√™me classe).

- Gini et Entropie sont juste deux fa√ßons diff√©rentes de mesurer l‚Äôimpuret√© ou le d√©sordre d‚Äôun n≈ìud.

- L‚Äôarbre utilise cette mesure pour d√©cider quelle variable et quelle valeur de split choisir √† chaque √©tape.

---
# 4. Fine-tuning des Meilleurs Mod√®les
---

## 4.1 S√©lection des meilleurs mod√®les

In [None]:
# S√©lection des 3 meilleurs mod√®les bas√©s sur le F1-Score
best_models_idx = comparison_df['F1-Score'].nlargest(3).index
print("Top 3 mod√®les s√©lectionn√©s pour le fine-tuning:")
print("="*80)
for i, model_name in enumerate(best_models_idx, 1):
    print(f"{i}. {model_name} - F1-Score: {comparison_df.loc[model_name, 'F1-Score']:.4f}")

## 4.2 Fine-tuning avec GridSearchCV

In [None]:
# Dictionnaire des mod√®les et param√®tres √† tuner
models_to_tune = {
    'Logistic Regression': {
        'model': LogisticRegression(max_iter=1000, random_state=42),
        'params': {
            'C': [0.001, 0.01, 0.1, 1, 10, 100],
            'penalty': ['l2'],
            'solver': ['lbfgs', 'liblinear']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=42, n_jobs=-1),
        'params': {
            'n_estimators': [100, 200, 300],
            'max_depth': [5, 10, 15, 20, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'XGBoost': {
        'model': XGBClassifier(random_state=42, n_jobs=-1, eval_metric='logloss'),
        'params': {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 5, 7, 9],
            'learning_rate': [0.01, 0.1, 0.3],
            'subsample': [0.8, 0.9, 1.0],
            'colsample_bytree': [0.8, 0.9, 1.0]
        }
    }
}

# Stockage des mod√®les tun√©s
tuned_models = {}
tuned_results = []

for model_name, config in models_to_tune.items():
    print(f"\n{'='*80}")
    print(f"FINE-TUNING: {model_name}")
    print(f"{'='*80}")

    # GridSearchCV
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        cv=5,
        scoring='f1',
        n_jobs=-1,
        verbose=1
    )

    grid_search.fit(X_train_scaled, y_train)

    print(f"\nMeilleurs param√®tres trouv√©s:")
    for param, value in grid_search.best_params_.items():
        print(f"  {param}: {value}")

    print(f"\nMeilleur score (F1) en validation crois√©e: {grid_search.best_score_:.4f}")

    # Stocker le meilleur mod√®le
    tuned_models[model_name] = grid_search.best_estimator_

    # √âvaluation sur le test set
    results = evaluate_model(f'{model_name} (Tuned)',
                            grid_search.best_estimator_,
                            X_train_scaled, X_test_scaled,
                            y_train, y_test)
    tuned_results.append(results)

## 4.3 Comparaison Avant/Apr√®s Fine-tuning

In [None]:
# Comparaison des r√©sultats
tuned_comparison_df = pd.DataFrame(tuned_results).set_index('Mod√®le')

print("\n" + "="*100)
print("COMPARAISON DES MOD√àLES APR√àS FINE-TUNING")
print("="*100)
print(tuned_comparison_df.to_string())

# Visualisation comparative
fig, ax = plt.subplots(figsize=(14, 6))

x = np.arange(len(tuned_comparison_df))
width = 0.15

metrics = ['Accuracy Test', 'Precision', 'Recall', 'F1-Score']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for i, (metric, color) in enumerate(zip(metrics, colors)):
    ax.bar(x + i*width, tuned_comparison_df[metric], width,
           label=metric, color=color, alpha=0.8, edgecolor='black')

ax.set_xlabel('Mod√®les', fontsize=12, fontweight='bold')
ax.set_ylabel('Score', fontsize=12, fontweight='bold')
ax.set_title('Comparaison des M√©triques - Mod√®les Tun√©s', fontsize=14, fontweight='bold')
ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(tuned_comparison_df.index, rotation=15, ha='right')
ax.legend(loc='lower right', fontsize=10)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

**Performance g√©n√©rale :**

- XGBoost a la meilleure accuracy, precision, recall, F1-score, et ROC-AUC, ce qui montre qu‚Äôil excelle √† la fois sur la d√©tection des classes et la g√©n√©ralisation.

**Overfitting :**

- Random Forest a 100% de pr√©cision sur le train, mais perd un peu en test (0.844), ce qui montre un l√©ger surapprentissage.

- XGBoost reste tr√®s performant sur le test (0.950) tout en ayant une accuracy train l√©g√®rement inf√©rieure √† 1 (0.958), ce qui montre un bon √©quilibre.

**ROC-AUC :**

- XGBoost atteint 0.982, ce qui est excellent et indique qu‚Äôil s√©pare tr√®s bien les classes positives et n√©gatives.

> Conclusion : Pour ce dataset, XGBoost (Tuned) est le mod√®le √† retenir pour les pr√©dictions.

## 4.4 Courbes ROC des mod√®les tun√©s

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import numpy as np

colors = ['#e74c3c', '#3498db', '#2ecc71']

# Convertir y_test en 1D si n√©cessaire
if isinstance(y_test, np.ndarray) and y_test.ndim > 1 and y_test.shape[1] > 1:
    y_test_labels = np.argmax(y_test, axis=1)  # convertit one-hot en labels 1D
else:
    y_test_labels = y_test.flatten()  # si d√©j√† 1D

# Boucle pour chaque mod√®le tun√©
for model_name, model in tuned_models.items():
    y_pred_proba = model.predict_proba(X_test_scaled)

    # ROC-AUC multi-classe (One-vs-Rest)
    roc_auc = roc_auc_score(y_test_labels, y_pred_proba[:, 1])
    print(f"{model_name:25s}: ROC-AUC = {roc_auc:.4f}")

    # Tracer les courbes ROC
    plt.figure(figsize=(8,5))
    for i, color in zip(range(y_pred_proba.shape[1]), colors):
        fpr, tpr, _ = roc_curve(y_test_labels == i, y_pred_proba[:, i])
        plt.plot(fpr, tpr, color=color, lw=2.5, label=f'Classe {i}')

    plt.plot([0,1], [0,1], 'k--', lw=2, label='Al√©atoire (AUC=0.5)')
    plt.xlabel("FPR")
    plt.ylabel("TPR")
    plt.title(f"ROC multi-classe - {model_name}")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()


Comparaison des mod√®les (ROC-AUC multi-classe)
- XGBoost : Courbes tr√®s hautes et proches du coin sup√©rieur gauche ‚Üí tr√®s bonne s√©paration des 3 classes.
- Random Forest : Bonnes courbes, un peu moins nettes que XGBoost.
- Logistic Regression : Courbes basses, surtout pour la classe rouge ‚Üí difficult√© √† s√©parer les classes (mod√®le trop simple).

---
# 5. S√©lection de Caract√©ristiques avec SelectKBest
---

## 5.1 S√©lection des K meilleures features

In [None]:
# SelectKBest avec diff√©rentes valeurs de K
k_values = [5, 7, 10]
feature_selection_results = []

for k in k_values:
    print(f"\n{'='*80}")
    print(f"S√âLECTION DES {k} MEILLEURES FEATURES")
    print(f"{'='*80}")

    # SelectKBest
    selector = SelectKBest(score_func=f_classif, k=k)
    X_train_selected = selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = selector.transform(X_test_scaled)

    # Obtenir les noms des features s√©lectionn√©es
    selected_features = X.columns[selector.get_support()].tolist()

    print(f"\nFeatures s√©lectionn√©es ({k}):")
    for i, feature in enumerate(selected_features, 1):
        print(f"  {i}. {feature}")

    # Scores des features
    feature_scores = pd.DataFrame({
        'Feature': X.columns[selector.get_support()],
        'Score': selector.scores_[selector.get_support()]
    }).sort_values('Score', ascending=False)

    print(f"\nScores des features:")
    print(feature_scores.to_string(index=False))

    # Tester avec diff√©rents mod√®les
    models_test = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
    }

    for model_name, model in models_test.items():
        model.fit(X_train_selected, y_train)
        y_pred = model.predict(X_test_selected)

        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        feature_selection_results.append({
            'K': k,
            'Mod√®le': model_name,
            'Accuracy': accuracy,
            'F1-Score': f1
        })

        print(f"\n{model_name} avec {k} features:")
        print(f"  Accuracy: {accuracy:.4f}")
        print(f"  F1-Score: {f1:.4f}")

**Logistic Regression**

- Tr√®s stable : Accuracy = 0.7581 quelle que soit la s√©lection (5, 7 ou 10 features).

- F1-Score stable √† ~0.654.

- Cela montre que ce mod√®le est peu sensible √† l‚Äôajout de features suppl√©mentaires pour ce dataset.

**Random Forest**

- Performance chute quand on ajoute plus de features (10 features ‚Üí accuracy 0.6720).

- Avec 7 features, F1-Score est le plus haut (0.6805), ce qui sugg√®re que certaines features suppl√©mentaires peuvent introduire du bruit pour ce mod√®le.

**XGBoost**

- Avec 5 features : accuracy 0.7545, F1 0.6520 ‚Üí proche de RF et LR.

- Avec 7 features : l√©g√®re am√©lioration (accuracy 0.7186, F1 0.6608).

- Avec 10 features : meilleure combinaison (accuracy 0.7276, F1 0.6726).

- Indique que XGBoost b√©n√©ficie d‚Äôun peu plus de features, mais pas trop non plus.

**Top features selon l‚Äôimportance**

- Wind Direction_D61_D90_sw ‚Üí Score 3.58

- Wind Direction_D61_D90_se ‚Üí Score 2.43

- Wind Direction_D91_D120_nw ‚Üí Score 1.98

- Inst Wind Speed_D31_D60 ‚Üí Score 1.79

- Max temp_D91_D120 ‚Üí Score 1.79

> Ces features m√©t√©o semblent √™tre les plus influentes sur le rendement du paddy, ce qui est coh√©rent avec la logique agronomique (direction du vent et temp√©rature ont un impact sur le rendement).

## 5.2 Visualisation des scores avec SelectKBest

In [None]:
# Visualiser tous les scores des features
# Limiter aux 10 features avec le score le plus √©lev√©
top_n = 10
selector_all = SelectKBest(score_func=f_classif, k='all')
selector_all.fit(X_train_scaled, y_train)

all_feature_scores = pd.DataFrame({
    'Feature': X.columns,
    'Score': selector_all.scores_
}).sort_values('Score', ascending=True)
top_features = all_feature_scores.sort_values('Score', ascending=False).head(top_n)

# Visualisation avec barres horizontales
plt.figure(figsize=(10, 6))
colors = plt.cm.plasma(np.linspace(0, 1, len(top_features)))
plt.barh(top_features['Feature'][::-1], top_features['Score'][::-1],  # invers√© pour afficher la plus haute en haut
         color=colors, edgecolor='black', alpha=0.8)
plt.xlabel('Score F-value', fontsize=12, fontweight='bold')
plt.ylabel('Features', fontsize=12, fontweight='bold')
plt.title(f'Top {top_n} Features - SelectKBest (f_classif)', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)

# Ajouter les valeurs sur les barres
for i, v in enumerate(top_features['Score'][::-1]):
    plt.text(v + 2, i, f'{v:.1f}', va='center', fontsize=9)

plt.tight_layout()
plt.show()

## 5.3 Comparaison des performances avec diff√©rents K

In [None]:
# Comparaison des r√©sultats
fs_results_df = pd.DataFrame(feature_selection_results)

print("\n" + "="*100)
print("R√âSULTATS DE LA S√âLECTION DE FEATURES")
print("="*100)
print(fs_results_df.to_string(index=False))

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Accuracy
for model_name in fs_results_df['Mod√®le'].unique():
    model_data = fs_results_df[fs_results_df['Mod√®le'] == model_name]
    axes[0].plot(model_data['K'], model_data['Accuracy'],
                marker='o', linewidth=2.5, markersize=10, label=model_name)

axes[0].set_xlabel('Nombre de Features (K)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Accuracy', fontsize=12, fontweight='bold')
axes[0].set_title('Accuracy vs Nombre de Features', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3)
axes[0].set_xticks(k_values)

# F1-Score
for model_name in fs_results_df['Mod√®le'].unique():
    model_data = fs_results_df[fs_results_df['Mod√®le'] == model_name]
    axes[1].plot(model_data['K'], model_data['F1-Score'],
                marker='o', linewidth=2.5, markersize=10, label=model_name)

axes[1].set_xlabel('Nombre de Features (K)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('F1-Score', fontsize=12, fontweight='bold')
axes[1].set_title('F1-Score vs Nombre de Features', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)
axes[1].set_xticks(k_values)

plt.tight_layout()
plt.show()

**Interpr√©tation**

1. Logistic Regression

- Stable √† 0.758 d‚Äôaccuracy quelle que soit la taille K des features.

- F1-Score stable aussi (~0.654).

- Peu sensible √† l‚Äôajout de features suppl√©mentaires ‚Üí le mod√®le est simple et lin√©aire.

2. Random Forest

- F1-Score augmente l√©g√®rement avec 7 features (0.6805), mais baisse pour 10 features (0.6516).

- Indique que certaines features suppl√©mentaires ajoutent du bruit et d√©gradent la performance.

3. XGBoost

- Performances am√©liorent avec 10 features (F1-Score 0.6726, Accuracy 0.7276) ‚Üí peut mieux g√©rer les interactions entre features.

- Moins stable avec peu de features, mais gagne en complexit√© avec plus de variables.

In [None]:
# Cr√©er le dossier si n√©cessaire
os.makedirs("classification", exist_ok=True)

# Trouver le meilleur mod√®le selon F1-Score
best_result = max(feature_selection_results, key=lambda x: x['F1-Score'])
best_k = best_result['K']
best_model_name = best_result['Mod√®le']

print(f"Meilleur mod√®le : {best_model_name} avec K={best_k} features, F1-Score={best_result['F1-Score']:.4f}")

# Re-cr√©er et entra√Æner le mod√®le correspondant pour sauvegarde
selector_best = SelectKBest(score_func=f_classif, k=best_k)
X_train_best = selector_best.fit_transform(X_train_scaled, y_train)
X_test_best = selector_best.transform(X_test_scaled)

selected_features_best = X.columns[selector_best.get_support()].tolist()
print(f"Features s√©lectionn√©es pour le meilleur mod√®le : {selected_features_best}")

# Cr√©er le mod√®le
if best_model_name == 'XGBoost':
    model_best = XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
elif best_model_name == 'Random Forest':
    model_best = RandomForestClassifier(n_estimators=100, random_state=42)
else:
    model_best = LogisticRegression(max_iter=1000, random_state=42)

# Entra√Ænement
model_best.fit(X_train_best, y_train)

# Sauvegarde
joblib.dump(model_best, f"classification/best_model_{best_model_name}_K{best_k}.pkl")
print(f"Mod√®le {best_model_name} sauvegard√© avec succ√®s !")