# SAE-82 – Comparaison Modèles ML (SVM, RF, NB)

**En tant que** ML engineer  
**Je veux** tester plusieurs modèles ML  
**Afin de** trouver le meilleur pour la classification

## Objectifs

- Entraîner 4 modèles : Logistic Regression, LinearSVC, Random Forest, Naive Bayes
- Comparer les performances (Accuracy, F1, Precision, Recall, temps)
- Sauvegarder le meilleur modèle

## Input / Output

- **Input** : `data/cleaned/reviews_text_cleaned.parquet` + `outputs/models/tfidf_vectorizer.pkl`
- **Output** : `outputs/ml_models_comparison.csv` + meilleur modèle `.pkl`

In [None]:
import pandas as pd
import numpy as np
import pickle
import time
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
import warnings
warnings.filterwarnings('ignore')

print("Imports OK")

## 1. Chargement et Préparation des Données

In [None]:
# Chemins
DATA_PATH = Path('../../data/cleaned/reviews_text_cleaned.parquet')
TFIDF_MODEL_PATH = Path('../../outputs/models/tfidf_vectorizer.pkl')
OUTPUT_CSV = Path('../../outputs/ml_models_comparison.csv')
BEST_MODEL_PATH = Path('../../outputs/models/best_ml_model.pkl')

# Charger donnees
df = pd.read_parquet(DATA_PATH)
df['text_cleaned'] = df['text_cleaned'].fillna('')
print(f"Donnees chargees: {len(df):,} reviews")

# Charger TF-IDF
with open(TFIDF_MODEL_PATH, 'rb') as f:
    tfidf = pickle.load(f)
print(f"TF-IDF charge: {len(tfidf.vocabulary_):,} termes")

# Transformer
X = tfidf.transform(df['text_cleaned'])
y = df['stars']
print(f"X shape: {X.shape}")

# Distribution
print(f"\nDistribution des etoiles:")
print(y.value_counts().sort_index())

In [None]:
# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape[0]:,} | Test: {X_test.shape[0]:,}")

## 2. Définition des Modèles

In [None]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs'),
    'LinearSVC (SVM)': LinearSVC(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Naive Bayes': MultinomialNB()
}

print(f"Modeles a entrainer: {len(models)}")
for name in models:
    print(f"  - {name}")

## 3. Entraînement et Évaluation

In [None]:
results = []
trained_models = {}

for name, model in models.items():
    print(f"\n{'='*60}")
    print(f"  {name}")
    print(f"{'='*60}")
    
    # Entrainement
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Metriques
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted')
    rec = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    results.append({
        'Model': name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1': f1,
        'Train Time (s)': round(train_time, 2)
    })
    
    trained_models[name] = model
    
    print(f"  Accuracy:  {acc:.4f}")
    print(f"  Precision: {prec:.4f}")
    print(f"  Recall:    {rec:.4f}")
    print(f"  F1:        {f1:.4f}")
    print(f"  Temps:     {train_time:.2f}s")

print(f"\n\nTous les modeles entraines!")

## 4. Tableau Comparatif

In [None]:
# Creer le tableau comparatif
results_df = pd.DataFrame(results).sort_values('F1', ascending=False)
results_df.index = range(1, len(results_df) + 1)
results_df.index.name = 'Rank'

print("TABLEAU COMPARATIF DES MODELES")
print("=" * 80)
display(results_df.style.format({
    'Accuracy': '{:.4f}',
    'Precision': '{:.4f}',
    'Recall': '{:.4f}',
    'F1': '{:.4f}',
    'Train Time (s)': '{:.2f}'
}).background_gradient(cmap='YlGn', subset=['F1', 'Accuracy']))

In [None]:
# Sauvegarder CSV
OUTPUT_CSV.parent.mkdir(parents=True, exist_ok=True)
results_df.to_csv(OUTPUT_CSV, index=False)
print(f"Tableau sauvegarde: {OUTPUT_CSV}")

## 5. Classification Report Détaillé par Modèle

In [None]:
for name, model in trained_models.items():
    y_pred = model.predict(X_test)
    print(f"\n{'='*60}")
    print(f"  Classification Report: {name}")
    print(f"{'='*60}")
    print(classification_report(y_test, y_pred))

## 6. Sauvegarde du Meilleur Modèle

In [None]:
# Identifier le meilleur modele
best_row = results_df.iloc[0]
best_name = best_row['Model']
best_model = trained_models[best_name]

print(f"Meilleur modele: {best_name}")
print(f"  F1: {best_row['F1']:.4f}")
print(f"  Accuracy: {best_row['Accuracy']:.4f}")

# Sauvegarder
BEST_MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(BEST_MODEL_PATH, 'wb') as f:
    pickle.dump(best_model, f)

print(f"\nModele sauvegarde: {BEST_MODEL_PATH}")
print(f"  Taille: {BEST_MODEL_PATH.stat().st_size / 1024:.1f} KB")

In [None]:
# Verification du rechargement
with open(BEST_MODEL_PATH, 'rb') as f:
    loaded = pickle.load(f)

y_check = loaded.predict(X_test[:5])
print(f"Modele recharge OK")
print(f"  Predictions: {y_check}")
print(f"  Vraies:      {y_test.iloc[:5].values}")

## 7. Résumé

### Conclusions

- Le tableau comparatif ci-dessus montre les performances de chaque modèle
- Le meilleur modèle est sauvegardé dans `outputs/models/best_ml_model.pkl`
- Le tableau CSV est dans `outputs/ml_models_comparison.csv`
- Les classes extrêmes (1★ et 5★) sont généralement mieux classées que les intermédiaires

### Modèles testés

| Modèle | Description |
|--------|-------------|
| Logistic Regression | Baseline (SAE-81) |
| LinearSVC | SVM linéaire optimisé |
| Random Forest | Ensemble de 100 arbres |
| Naive Bayes | MultinomialNB |