# SAE-81 – Classification Baseline – Logistic Regression

**En tant que** ML engineer  
**Je veux** un modèle baseline de classification  
**Afin de** avoir un point de référence simple

## Objectifs

- Split train/test (80/20, stratifié)
- Logistic Regression avec TF-IDF (de SAE-76)
- Métriques : Accuracy, Precision, Recall, F1
- Classification report
- Sauvegarder le modèle

## Input / Output

- **Input** : `data/cleaned/reviews_text_cleaned.parquet` + `outputs/models/tfidf_vectorizer.pkl`
- **Output** : `outputs/models/lr_baseline.pkl`

In [None]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
import warnings
warnings.filterwarnings('ignore')

print("Imports OK")

## 1. Chargement des Données et du Vectorizer TF-IDF

In [None]:
# Configuration des chemins
DATA_PATH = Path('../../data/cleaned/reviews_text_cleaned.parquet')
TFIDF_MODEL_PATH = Path('../../outputs/models/tfidf_vectorizer.pkl')
LR_MODEL_PATH = Path('../../outputs/models/lr_baseline.pkl')

# Creer le dossier output si necessaire
LR_MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)

print(f"Data: {DATA_PATH}")
print(f"TF-IDF model: {TFIDF_MODEL_PATH}")
print(f"LR output: {LR_MODEL_PATH}")

In [None]:
# Chargement des donnees
df = pd.read_parquet(DATA_PATH)
print(f"Donnees chargees: {len(df):,} reviews")
print(f"Colonnes: {list(df.columns)}")

# Verification des colonnes necessaires
assert 'text_cleaned' in df.columns, "Colonne 'text_cleaned' manquante!"
assert 'stars' in df.columns, "Colonne 'stars' manquante!"

# Remplir les NaNs
df['text_cleaned'] = df['text_cleaned'].fillna('')

# Apercu
print(f"\nDistribution des etoiles:")
print(df['stars'].value_counts().sort_index())
df[['text_cleaned', 'stars']].head()

In [None]:
# Chargement du TF-IDF vectorizer pre-entraine (SAE-76)
with open(TFIDF_MODEL_PATH, 'rb') as f:
    tfidf_vectorizer = pickle.load(f)

print(f"TF-IDF vectorizer charge")
print(f"  Type: {type(tfidf_vectorizer).__name__}")
print(f"  Params: ngram_range={tfidf_vectorizer.ngram_range}, max_features={tfidf_vectorizer.max_features}")
print(f"  Vocabulaire: {len(tfidf_vectorizer.vocabulary_):,} termes")

## 2. Préparation des Features TF-IDF

In [None]:
# Transformer les textes en features TF-IDF
X = tfidf_vectorizer.transform(df['text_cleaned'])
y = df['stars']

print(f"Features TF-IDF generees")
print(f"  X shape: {X.shape}")
print(f"  y shape: {y.shape}")
print(f"  Sparsity: {(1 - X.nnz / (X.shape[0] * X.shape[1])) * 100:.2f}%")

## 3. Split Train / Test (80/20)

In [None]:
# Split stratifie
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Split train/test effectue")
print(f"  Train: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(df)*100:.0f}%)")
print(f"  Test:  {X_test.shape[0]:,} samples ({X_test.shape[0]/len(df)*100:.0f}%)")
print(f"\nDistribution train:")
print(y_train.value_counts().sort_index())
print(f"\nDistribution test:")
print(y_test.value_counts().sort_index())

## 4. Entraînement – Logistic Regression

In [None]:
# Entrainement du modele
import time

print("Entrainement de la Logistic Regression...")
start = time.time()

lr = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs'
)
lr.fit(X_train, y_train)

duration = time.time() - start
print(f"Modele entraine en {duration:.2f}s")
print(f"  Classes: {lr.classes_}")
print(f"  Nombre d'iterations: {lr.n_iter_}")

## 5. Prédictions et Évaluation

In [None]:
# Predictions
y_pred = lr.predict(X_test)

# Metriques globales
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("=" * 50)
print("METRIQUES GLOBALES")
print("=" * 50)
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f} (weighted)")
print(f"  Recall:    {recall:.4f} (weighted)")
print(f"  F1-score:  {f1:.4f} (weighted)")
print("=" * 50)

In [None]:
# Classification Report detaille
print("\nCLASSIFICATION REPORT")
print("=" * 60)
report = classification_report(y_test, y_pred)
print(report)

In [None]:
# Classification report sous forme de DataFrame pour analyse
report_dict = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report_dict).T
print("Metriques par classe:")
display(report_df.style.format('{:.3f}').background_gradient(cmap='YlGn', subset=['f1-score']))

## 6. Sauvegarde du Modèle

In [None]:
# Sauvegarde du modele
with open(LR_MODEL_PATH, 'wb') as f:
    pickle.dump(lr, f)

print(f"Modele sauvegarde: {LR_MODEL_PATH}")
print(f"  Taille: {LR_MODEL_PATH.stat().st_size / 1024:.1f} KB")

In [None]:
# Verification du rechargement
with open(LR_MODEL_PATH, 'rb') as f:
    loaded_lr = pickle.load(f)

# Test de prediction avec le modele recharge
y_pred_check = loaded_lr.predict(X_test[:5])
print(f"Modele recharge et fonctionnel")
print(f"  Predictions test: {y_pred_check}")
print(f"  Vraies valeurs:   {y_test.iloc[:5].values}")

## 7. Résumé

### Résultats du Baseline

| Métrique | Valeur |
|----------|--------|
| Accuracy | cf. ci-dessus |
| Precision (weighted) | cf. ci-dessus |
| Recall (weighted) | cf. ci-dessus |
| F1-score (weighted) | cf. ci-dessus |

### Points clés

- **Modèle** : Logistic Regression (solver=lbfgs, max_iter=1000)
- **Features** : TF-IDF Bigrams (5000 features, SAE-76)
- **Split** : 80% train / 20% test (stratifié, random_state=42)
- **Sauvegarde** : `outputs/models/lr_baseline.pkl`

Ce modèle servira de **baseline** pour comparer avec des modèles plus avancés (SVM, Random Forest, BERT, etc.).