# Suppression des Stopwords

Objectif: Supprimer les mots courants (stopwords) tout en gardant les mots importants pour le contexte (comme la négation).

Partie de la story **SAE-72**.

In [1]:
import sys
import os
import pandas as pd
import nltk

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../..', 'src')))

from text_preprocessing import tokenize_text, remove_stopwords

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\melou\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\melou\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Chargement des Données

In [2]:
# Load cleaned reviews
reviews_path = '../../data/cleaned/reviews_clean.parquet'
if os.path.exists(reviews_path):
    reviews = pd.read_parquet(reviews_path)
    print(f"Loaded {len(reviews)} reviews")
    
    # Use a sample for speed
    reviews = reviews.head(1000).copy()
    print("Using sample of 1000 reviews")
else:
    print("Data file not found. Creating dummy data.")
    # Create text with negations to test our features
    reviews = pd.DataFrame({'text': [
        "The food is not good at all.",
        "This is no doubt the best place.",
        "I will not be returning.",
        "Service was great and fast."
    ]})

# Apply tokenization first (SAE-71)
reviews['tokens'] = reviews['text'].apply(str).apply(tokenize_text)

Loaded 999985 reviews
Using sample of 1000 reviews


## Application: Suppression Standard vs Conservation Négation

In [3]:
# Standard removal
reviews['tokens_clean_standard'] = reviews['tokens'].apply(lambda x: remove_stopwords(x))

# Removal preserving negation
negation_words = {'no', 'not', 'nor', 'neither'}
reviews['tokens_clean_negation'] = reviews['tokens'].apply(lambda x: remove_stopwords(x, exclude=negation_words))

print("Stopwords removal applied.")

Stopwords removal applied.


## Comparaison

In [4]:
# Find examples containing negation words
neg_examples = reviews[reviews['text'].str.contains('not|no ', case=False, na=False)].head(5)

for idx, row in neg_examples.iterrows():
    print(f"\nTexte Original: {row['text']}")
    print(f"Standard Clean: {row['tokens_clean_standard']}")
    print(f"Keep Negation : {row['tokens_clean_negation']}")


Texte Original: Went for lunch and found that my burger was meh. What was obvious was that the focus of the burgers is the amount of different and random crap they can pile on it and not the flavor of the meat. My burger patty seemed steamed and appeared to be a preformed patty, contrary to what is stated on the menu. I can get ground beef from Kroger and make a burger that blows them out of the water.
Standard Clean: ['Went', 'lunch', 'found', 'burger', 'meh', '.', 'obvious', 'focus', 'burgers', 'amount', 'different', 'random', 'crap', 'pile', 'flavor', 'meat', '.', 'burger', 'patty', 'seemed', 'steamed', 'appeared', 'preformed', 'patty', ',', 'contrary', 'stated', 'menu', '.', 'get', 'ground', 'beef', 'Kroger', 'make', 'burger', 'blows', 'water', '.']
Keep Negation : ['Went', 'lunch', 'found', 'burger', 'meh', '.', 'obvious', 'focus', 'burgers', 'amount', 'different', 'random', 'crap', 'pile', 'not', 'flavor', 'meat', '.', 'burger', 'patty', 'seemed', 'steamed', 'appeared', 'preform

## Statistiques de Réduction

In [5]:
reviews['count_original'] = reviews['tokens'].apply(len)
reviews['count_clean'] = reviews['tokens_clean_negation'].apply(len)

avg_orig = reviews['count_original'].mean()
avg_clean = reviews['count_clean'].mean()
reduction = (avg_orig - avg_clean) / avg_orig * 100

print(f"Nombre moyen de tokens (original): {avg_orig:.1f}")
print(f"Nombre moyen de tokens (apres stopwords): {avg_clean:.1f}")
print(f"Réduction de taille: {reduction:.1f}%")

Nombre moyen de tokens (original): 118.3
Nombre moyen de tokens (apres stopwords): 68.1
Réduction de taille: 42.4%
