# SAE-76 - TF-IDF Optimis√©

**En tant que** ML engineer  
**Je veux** optimiser les param√®tres TF-IDF  
**Afin de** am√©liorer la qualit√© de la repr√©sentation

## Objectifs

- Explorer les param√®tres (max_features, min_df, max_df, ngrams)
- Comparer diff√©rentes configurations
- S√©lectionner la meilleure configuration
- Sauvegarder le mod√®le optimal

## Input/Output

- **Input**: `data/cleaned/reviews_text_cleaned.parquet`
- **Output**: `outputs/models/tfidf_vectorizer.pkl`

In [None]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
import time

pd.set_option('display.max_columns', None)
print("‚úÖ Imports r√©ussis")

## 1. Chargement des Donn√©es

In [None]:
# Configuration des chemins
DATA_PATH = Path('../../data/cleaned/reviews_text_cleaned.parquet')
MODEL_PATH = Path('../../outputs/models/tfidf_vectorizer.pkl')

# Cr√©er le dossier output si n√©cessaire
MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)

print(f"üìÇ Chargement: {DATA_PATH}")

In [None]:
# Chargement des donn√©es
df = pd.read_parquet(DATA_PATH)
print(f"‚úÖ Donn√©es charg√©es: {len(df):,} reviews")

# V√©rification de la colonne text_cleaned
if 'text_cleaned' not in df.columns:
    raise KeyError("‚ùå La colonne 'text_cleaned' est manquante!")

# Remplir les NaNs potentiels
df['text_cleaned'] = df['text_cleaned'].fillna('')

# Aper√ßu
df[['text_cleaned']].head()

## 2. Exploration et Comparaison des Configurations

Nous allons tester 3 configurations:
1. **Default**: Param√®tres par d√©faut
2. **Limited**: Vocabulaire limit√© (max_features=5000, min_df=5, max_df=0.8)
3. **Bigrams**: Unigrams + Bigrams avec vocabulaire limit√©

In [None]:
# D√©finition des vectorizers
vectorizers = {
    "Default": TfidfVectorizer(),
    "Limited": TfidfVectorizer(
        max_features=5000,
        min_df=5,
        max_df=0.8
    ),
    "Bigrams": TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        min_df=5
    )
}

results = []

print("üöÄ D√©but de la comparaison...")

for name, vect in vectorizers.items():
    print(f"\n‚è≥ Traitement config: {name}...")
    start_time = time.time()
    
    # Fit transform
    X = vect.fit_transform(df['text_cleaned'])
    
    # Calculs m√©triques
    sparsity = (1 - X.nnz / (X.shape[0] * X.shape[1])) * 100
    vocab_size = len(vect.get_feature_names_out())
    memory_mb = X.data.nbytes / (1024 * 1024)
    duration = time.time() - start_time
    
    print(f"   ‚úÖ Termin√© en {duration:.2f}s")
    print(f"   üìä Shape: {X.shape}")
    print(f"   üìö Vocab size: {vocab_size:,}")
    print(f"   üï∏Ô∏è Sparsity: {sparsity:.4f}%")
    
    results.append({
        'Configuration': name,
        'Vocab Size': vocab_size,
        'Sparsity (%)': sparsity,
        'Memory (MB)': memory_mb,
        'Time (s)': duration
    })

In [None]:
# Afficher le tableau comparatif
results_df = pd.DataFrame(results)
print("\nüèÜ R√©sultats Comparatifs:")
display(results_df)

## 3. Analyse et Choix

### Analyse des r√©sultats
- **Default**: Vocabulaire tr√®s large, risque de bruit et d'overfitting. Lent.
- **Limited**: Vocabulaire contr√¥l√©, plus rapide, focus sur les mots fr√©quents.
- **Bigrams**: Capture des contextes (ex: "not good"), vocabulaire contr√¥l√© par max_features.

### Choix Final
Nous choisissons la configuration **Bigrams** car elle permet de capturer plus de sens (n√©gation, superlatifs compos√©s) tout en gardant une dimensionnalit√© g√©rable (5000 features).

## 4. Sauvegarde du Meilleur Mod√®le

In [None]:
best_model_name = "Bigrams"
best_vectorizer = vectorizers[best_model_name]

print(f"üíæ Sauvegarde du mod√®le '{best_model_name}' vers {MODEL_PATH}...")

with open(MODEL_PATH, 'wb') as f:
    pickle.dump(best_vectorizer, f)

print("‚úÖ Mod√®le sauvegard√© avec succ√®s!")

In [None]:
# V√©rification du chargement
with open(MODEL_PATH, 'rb') as f:
    loaded_vect = pickle.load(f)

print(f"üîç V√©rification rechargement: {type(loaded_vect)}")
print(f"   Params: ngram_range={loaded_vect.ngram_range}, max_features={loaded_vect.max_features}")