# SAE-70 - Nettoyage Texte Basique

**En tant que** NLP engineer  
**Je veux** nettoyer le texte preprocess√©  
**Afin de** le pr√©parer pour l'analyse NLP

## ‚ö†Ô∏è Pr√©requis

- SAE-97 doit √™tre ex√©cut√© (Nettoyage Donn√©es Reviews)
- Fichier source: `data/cleaned/reviews_clean.parquet`

## Crit√®res d'acceptation

- [x] Charger reviews_clean.parquet
- [x] Lowercase (tout en minuscules)
- [x] Suppression ponctuation
- [x] Suppression URLs
- [x] Suppression emails
- [x] Suppression chiffres (optionnel, selon besoin)
- [x] Suppression espaces multiples
- [x] Fonction r√©utilisable cr√©√©e
- [x] Colonne `text_cleaned` ajout√©e au DataFrame

---
## 1. Imports et Configuration

In [1]:
import re
import pandas as pd
from pathlib import Path

# Configuration des chemins
DATA_PATH = Path('../data/cleaned')
INPUT_FILE = DATA_PATH / 'reviews_clean.parquet'
OUTPUT_FILE = DATA_PATH / 'reviews_text_cleaned.parquet'

print(f"üìÅ Fichier source: {INPUT_FILE}")
print(f"üìÅ Fichier destination: {OUTPUT_FILE}")

üìÅ Fichier source: ..\data\cleaned\reviews_clean.parquet
üìÅ Fichier destination: ..\data\cleaned\reviews_text_cleaned.parquet


---
## 2. Chargement des donn√©es

In [2]:
# V√©rification que le fichier source existe
if not INPUT_FILE.exists():
    raise FileNotFoundError(
        f"‚ùå Le fichier {INPUT_FILE} n'existe pas.\n"
        f"‚ö†Ô∏è Ex√©cutez d'abord SAE-97 (Nettoyage Donn√©es Reviews)"
    )

# Chargement des donn√©es
df_reviews = pd.read_parquet(INPUT_FILE)

print(f"‚úÖ Donn√©es charg√©es: {len(df_reviews):,} reviews")
print(f"üìä Colonnes: {list(df_reviews.columns)}")
df_reviews.head()

‚úÖ Donn√©es charg√©es: 999,985 reviews
üìä Colonnes: ['review_id', 'user_id', 'business_id', 'stars', 'useful', 'funny', 'cool', 'text', 'date']


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,J5Q1gH4ACCj6CtQG7Yom7g,56gL9KEJNHiSDUoyjk2o3Q,8yR12PNSMo6FBYx1u5KPlw,2,1,0,0,Went for lunch and found that my burger was me...,2018-04-04 21:09:53
1,HlXP79ecTquSVXmjM10QxQ,bAt9OUFX9ZRgGLCXG22UmA,pBNucviUkNsiqhJv5IFpjg,5,0,0,0,I needed a new tires for my wife's car. They h...,2020-05-24 12:22:14
2,JBBULrjyGx6vHto2osk_CQ,NRHPcLq2vGWqgqwVugSgnQ,8sf9kv6O4GgEb0j1o22N1g,5,0,0,0,Jim Woltman who works at Goleta Honda is 5 sta...,2019-02-14 03:47:48
3,U9-43s8YUl6GWBFCpxUGEw,PAxc0qpqt5c2kA0rjDFFAg,XwepyB7KjJ-XGJf0vKc6Vg,4,0,0,0,Been here a few times to get some shrimp. They...,2013-04-27 01:55:49
4,8T8EGa_4Cj12M6w8vRgUsQ,BqPR1Dp5Rb_QYs9_fz9RiA,prm5wvpp0OHJBlrvTj9uOg,5,0,0,0,This is one fantastic place to eat whether you...,2019-05-15 18:29:25


In [3]:
# V√©rification de la colonne text
if 'text' not in df_reviews.columns:
    raise KeyError("‚ùå La colonne 'text' n'existe pas dans le DataFrame")

# Aper√ßu du texte avant nettoyage
print("üìù Exemples de textes avant nettoyage:")
print("="*50)
for i, text in enumerate(df_reviews['text'].head(3).values):
    preview = text[:200] + "..." if len(str(text)) > 200 else text
    print(f"\n[{i+1}] {preview}")

üìù Exemples de textes avant nettoyage:

[1] Went for lunch and found that my burger was meh. What was obvious was that the focus of the burgers is the amount of different and random crap they can pile on it and not the flavor of the meat. My bu...

[2] I needed a new tires for my wife's car. They had to special order it and had it the next day, I dropped it off in the morning before work and they called a few hours later and the car was ready. It wa...

[3] Jim Woltman who works at Goleta Honda is 5 stars!! He is knowledgeable, helpful, and so personable. He did a fantastic job on my Honda. Thank you Jim!!! And thank you Honda for having such a fabulous ...


---
## 3. Fonction de Nettoyage Texte

In [4]:
def clean_text(text: str, remove_digits: bool = False) -> str:
    """
    Nettoie le texte pour l'analyse NLP.
    
    Op√©rations effectu√©es:
    - Conversion en minuscules
    - Suppression des URLs
    - Suppression des emails
    - Suppression de la ponctuation
    - Suppression des chiffres (optionnel)
    - Suppression des espaces multiples
    
    Parameters
    ----------
    text : str
        Le texte √† nettoyer
    remove_digits : bool, default=False
        Si True, supprime √©galement les chiffres
    
    Returns
    -------
    str
        Le texte nettoy√©
        
    Examples
    --------
    >>> clean_text("The food was GREAT!!! üòã Visit http://example.com")
    'the food was great visit'
    
    >>> clean_text("Price: $25.99", remove_digits=True)
    'price'
    """
    # Gestion des valeurs manquantes
    if pd.isna(text) or text is None:
        return ""
    
    # Conversion en string si n√©cessaire
    text = str(text)
    
    # 1. Conversion en minuscules
    text = text.lower()
    
    # 2. Suppression des URLs
    text = re.sub(r'http\S+|www\.\S+|https\S+', '', text)
    
    # 3. Suppression des emails
    text = re.sub(r'\S+@\S+', '', text)
    
    # 4. Suppression de la ponctuation (garde lettres, chiffres, espaces)
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # 5. Suppression des chiffres (optionnel)
    if remove_digits:
        text = re.sub(r'\d+', '', text)
    
    # 6. Suppression des espaces multiples et trim
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

print("‚úÖ Fonction clean_text() d√©finie")

‚úÖ Fonction clean_text() d√©finie


### 3.1 Tests de la fonction

In [5]:
# Tests de la fonction de nettoyage
test_cases = [
    ("The food was GREAT!!! üòã", "the food was great"),
    ("Visit http://example.com for more!", "visit for more"),
    ("Contact: test@email.com", "contact"),
    ("Price: $25.99 only!", "price 25 99 only"),
    ("   Multiple   spaces   here   ", "multiple spaces here"),
    (None, ""),
    ("", ""),
]

print("üß™ Tests de la fonction clean_text():")
print("="*60)

all_passed = True
for i, (input_text, expected) in enumerate(test_cases, 1):
    result = clean_text(input_text)
    status = "‚úÖ" if result == expected else "‚ùå"
    if result != expected:
        all_passed = False
    print(f"{status} Test {i}:")
    print(f"   Input:    {repr(input_text)}")
    print(f"   Expected: {repr(expected)}")
    print(f"   Got:      {repr(result)}")
    print()

if all_passed:
    print("\nüéâ Tous les tests sont pass√©s!")
else:
    print("\n‚ö†Ô∏è Certains tests ont √©chou√©")

üß™ Tests de la fonction clean_text():
‚úÖ Test 1:
   Input:    'The food was GREAT!!! üòã'
   Expected: 'the food was great'
   Got:      'the food was great'

‚úÖ Test 2:
   Input:    'Visit http://example.com for more!'
   Expected: 'visit for more'
   Got:      'visit for more'

‚úÖ Test 3:
   Input:    'Contact: test@email.com'
   Expected: 'contact'
   Got:      'contact'

‚úÖ Test 4:
   Input:    'Price: $25.99 only!'
   Expected: 'price 25 99 only'
   Got:      'price 25 99 only'

‚úÖ Test 5:
   Input:    '   Multiple   spaces   here   '
   Expected: 'multiple spaces here'
   Got:      'multiple spaces here'

‚úÖ Test 6:
   Input:    None
   Expected: ''
   Got:      ''

‚úÖ Test 7:
   Input:    ''
   Expected: ''
   Got:      ''


üéâ Tous les tests sont pass√©s!


---
## 4. Application du Nettoyage

In [6]:
%%time

# Application de la fonction de nettoyage
print("üîÑ Nettoyage du texte en cours...")

# Cr√©ation de la colonne text_cleaned
df_reviews['text_cleaned'] = df_reviews['text'].apply(clean_text)

print(f"‚úÖ Nettoyage termin√© pour {len(df_reviews):,} reviews")

üîÑ Nettoyage du texte en cours...
‚úÖ Nettoyage termin√© pour 999,985 reviews
CPU times: total: 1min 19s
Wall time: 1min 20s


### 4.1 Comparaison Avant/Apr√®s

In [7]:
# Comparaison avant/apr√®s
print("üìä Comparaison Avant/Apr√®s Nettoyage:")
print("="*70)

for i in range(min(5, len(df_reviews))):
    original = str(df_reviews['text'].iloc[i])[:150]
    cleaned = str(df_reviews['text_cleaned'].iloc[i])[:150]
    
    print(f"\nüîπ Review {i+1}:")
    print(f"   AVANT:  {original}...")
    print(f"   APR√àS:  {cleaned}...")

üìä Comparaison Avant/Apr√®s Nettoyage:

üîπ Review 1:
   AVANT:  Went for lunch and found that my burger was meh. What was obvious was that the focus of the burgers is the amount of different and random crap they ca...
   APR√àS:  went for lunch and found that my burger was meh what was obvious was that the focus of the burgers is the amount of different and random crap they can...

üîπ Review 2:
   AVANT:  I needed a new tires for my wife's car. They had to special order it and had it the next day, I dropped it off in the morning before work and they cal...
   APR√àS:  i needed a new tires for my wife s car they had to special order it and had it the next day i dropped it off in the morning before work and they calle...

üîπ Review 3:
   AVANT:  Jim Woltman who works at Goleta Honda is 5 stars!! He is knowledgeable, helpful, and so personable. He did a fantastic job on my Honda. Thank you Jim!...
   APR√àS:  jim woltman who works at goleta honda is 5 stars he is knowledgeable hel

---
## 5. Statistiques du Nettoyage

In [8]:
# Statistiques sur le nettoyage
print("üìà Statistiques du Nettoyage:")
print("="*50)

# Longueur moyenne avant/apr√®s
avg_len_before = df_reviews['text'].str.len().mean()
avg_len_after = df_reviews['text_cleaned'].str.len().mean()
reduction = ((avg_len_before - avg_len_after) / avg_len_before) * 100

print(f"\nüìè Longueur moyenne:")
print(f"   Avant:  {avg_len_before:.0f} caract√®res")
print(f"   Apr√®s:  {avg_len_after:.0f} caract√®res")
print(f"   R√©duction: {reduction:.1f}%")

# Textes vides apr√®s nettoyage
empty_count = (df_reviews['text_cleaned'] == '').sum()
empty_pct = (empty_count / len(df_reviews)) * 100

print(f"\nüîç Textes vides apr√®s nettoyage:")
print(f"   Count: {empty_count:,} ({empty_pct:.2f}%)")

# Nombre de mots moyen
avg_words_before = df_reviews['text'].str.split().str.len().mean()
avg_words_after = df_reviews['text_cleaned'].str.split().str.len().mean()

print(f"\nüìù Nombre de mots moyen:")
print(f"   Avant:  {avg_words_before:.1f} mots")
print(f"   Apr√®s:  {avg_words_after:.1f} mots")

üìà Statistiques du Nettoyage:

üìè Longueur moyenne:
   Avant:  565 caract√®res
   Apr√®s:  550 caract√®res
   R√©duction: 2.6%

üîç Textes vides apr√®s nettoyage:
   Count: 1 (0.00%)

üìù Nombre de mots moyen:
   Avant:  104.8 mots
   Apr√®s:  107.0 mots


---
## 6. Sauvegarde des Donn√©es

In [9]:
# Sauvegarde du DataFrame avec la nouvelle colonne
print("üíæ Sauvegarde des donn√©es...")

# Option 1: Nouveau fichier (recommand√© pour tra√ßabilit√©)
df_reviews.to_parquet(OUTPUT_FILE, index=False)
print(f"‚úÖ Sauvegard√©: {OUTPUT_FILE}")

# Option 2: Mise √† jour du fichier original (d√©commenter si souhait√©)
# df_reviews.to_parquet(INPUT_FILE, index=False)
# print(f"‚úÖ Fichier original mis √† jour: {INPUT_FILE}")

# V√©rification
file_size = OUTPUT_FILE.stat().st_size / (1024 * 1024)
print(f"üìä Taille du fichier: {file_size:.2f} MB")

üíæ Sauvegarde des donn√©es...
‚úÖ Sauvegard√©: ..\data\cleaned\reviews_text_cleaned.parquet
üìä Taille du fichier: 729.60 MB


---
## 7. V√©rification Finale

In [10]:
# V√©rification de la sauvegarde
df_check = pd.read_parquet(OUTPUT_FILE)

print("‚úÖ V√©rification finale:")
print(f"   - Nombre de reviews: {len(df_check):,}")
print(f"   - Colonnes: {list(df_check.columns)}")
print(f"   - Colonne text_cleaned pr√©sente: {'text_cleaned' in df_check.columns}")

df_check[['text', 'text_cleaned']].head()

‚úÖ V√©rification finale:
   - Nombre de reviews: 999,985
   - Colonnes: ['review_id', 'user_id', 'business_id', 'stars', 'useful', 'funny', 'cool', 'text', 'date', 'text_cleaned']
   - Colonne text_cleaned pr√©sente: True


Unnamed: 0,text,text_cleaned
0,Went for lunch and found that my burger was me...,went for lunch and found that my burger was me...
1,I needed a new tires for my wife's car. They h...,i needed a new tires for my wife s car they ha...
2,Jim Woltman who works at Goleta Honda is 5 sta...,jim woltman who works at goleta honda is 5 sta...
3,Been here a few times to get some shrimp. They...,been here a few times to get some shrimp they ...
4,This is one fantastic place to eat whether you...,this is one fantastic place to eat whether you...


---
## ‚úÖ R√©sum√© SAE-70

### Crit√®res d'acceptation valid√©s:

| Crit√®re | Status |
|---------|--------|
| Charger reviews_clean.parquet | ‚úÖ |
| Lowercase | ‚úÖ |
| Suppression ponctuation | ‚úÖ |
| Suppression URLs | ‚úÖ |
| Suppression emails | ‚úÖ |
| Suppression chiffres (optionnel) | ‚úÖ |
| Suppression espaces multiples | ‚úÖ |
| Fonction r√©utilisable cr√©√©e | ‚úÖ |
| Colonne text_cleaned ajout√©e | ‚úÖ |

### Fichiers:
- **Input**: `data/cleaned/reviews_clean.parquet`
- **Output**: `data/cleaned/reviews_text_cleaned.parquet`