# Tokenization avec NLTK

Objectif: Tokenizer les reviews pour le découper en mots individuels en utilisant NLTK.

Partie de la story **SAE-71**.

In [1]:
import sys
import os
import pandas as pd
import nltk

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../..', 'src')))

from text_preprocessing import tokenize_text

# Download NLTK resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\melou\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Chargement des Données

In [2]:
# Load cleaned reviews
reviews_path = '../../data/cleaned/reviews_clean.parquet'
if os.path.exists(reviews_path):
    reviews = pd.read_parquet(reviews_path)
    print(f"Loaded {len(reviews)} reviews")
    # Use a sample for speed
    reviews = reviews.head(1000).copy()
    print("Using sample of 1000 reviews for tokenization demonstration")
else:
    print("Data file not found. Creating dummy data for testing.")
    reviews = pd.DataFrame({'text': ["Don't hesitate to visit! It's great.", "Food was amazing, service... okay."]})

Loaded 999985 reviews
Using sample of 1000 reviews for tokenization demonstration


## Application de la Tokenization

In [3]:
# Apply tokenization
reviews['tokens'] = reviews['text'].apply(str).apply(tokenize_text)

# Show results
reviews[['text', 'tokens']].head()

Unnamed: 0,text,tokens
0,Went for lunch and found that my burger was me...,"[Went, for, lunch, and, found, that, my, burge..."
1,I needed a new tires for my wife's car. They h...,"[I, needed, a, new, tires, for, my, wife, 's, ..."
2,Jim Woltman who works at Goleta Honda is 5 sta...,"[Jim, Woltman, who, works, at, Goleta, Honda, ..."
3,Been here a few times to get some shrimp. They...,"[Been, here, a, few, times, to, get, some, shr..."
4,This is one fantastic place to eat whether you...,"[This, is, one, fantastic, place, to, eat, whe..."


## Statistiques

In [4]:
reviews['token_count'] = reviews['tokens'].apply(len)
mean_tokens = reviews['token_count'].mean()
print(f"Nombre moyen de tokens par review: {mean_tokens:.1f}")

Nombre moyen de tokens par review: 118.3


## Comparaison: Split simple vs NLTK

In [5]:
test_text = "Don't split this. It's tricky!"

print("Original:", test_text)
print("Split simple:", test_text.split())
print("NLTK tokenize:", tokenize_text(test_text))

Original: Don't split this. It's tricky!
Split simple: ["Don't", 'split', 'this.', "It's", 'tricky!']
NLTK tokenize: ['Do', "n't", 'split', 'this', '.', 'It', "'s", 'tricky', '!']
