# Préprocessing des textes – BBC News Dataset

## Objectif
Nettoyer et normaliser les textes afin de les rendre exploitables
pour la vectorisation TF-IDF et le clustering K-Means.


# Cellule 2 — Import des bibliothèques

In [22]:
import pandas as pd
import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# Cellule 3 — Téléchargement des ressources NLTK

In [23]:
nltk.download("punkt_tab")
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [24]:
import nltk
import os

# Créer un dossier nltk_data local
nltk_data_dir = os.path.join(os.getcwd(), "nltk_data")

if not os.path.exists(nltk_data_dir):
    os.makedirs(nltk_data_dir)

# Dire à NLTK d'utiliser ce dossier
nltk.data.path.insert(0, nltk_data_dir)

# Télécharger les ressources nécessaires
nltk.download("punkt_tab", download_dir=nltk_data_dir)
nltk.download("punkt", download_dir=nltk_data_dir)
nltk.download("stopwords", download_dir=nltk_data_dir)
nltk.download("wordnet", download_dir=nltk_data_dir)


[nltk_data] Downloading package punkt_tab to
[nltk_data]     c:\Users\Dell\Desktop\Mon_Projet_Py\notebooks\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     c:\Users\Dell\Desktop\Mon_Projet_Py\notebooks\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     c:\Users\Dell\Desktop\Mon_Projet_Py\notebooks\nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     c:\Users\Dell\Desktop\Mon_Projet_Py\notebooks\nltk_dat
[nltk_data]     a...
[nltk_data]   Package wordnet is already up-to-date!


True

# Cellule 4 — Chargement du dataset

In [25]:
df = pd.read_csv("data/bbc-text.csv")
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


# Cellule 5 — Initialisation des outils NLP

In [26]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

# Cellule 6 — Fonction de nettoyage du texte

In [27]:
def clean_text(text):
    # 1. Passage en minuscules
    text = text.lower()
    
    # 2. Suppression des chiffres
    text = re.sub(r"\d+", "", text)
    
    # 3. Suppression de la ponctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    # 4. Tokenisation
    tokens = word_tokenize(text)
    
    # 5. Suppression des stopwords et des mots courts
    tokens = [
        word for word in tokens
        if word not in stop_words and len(word) > 2
    ]
    
    # 6. Lemmatisation
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # 7. Reconstruction du texte
    return " ".join(tokens)

# Cellule 7 — Application du preprocessing

In [28]:
from nltk.tokenize import word_tokenize

df["clean_text"] = df["text"].apply(clean_text)
df[["text", "clean_text"]].head()

Unnamed: 0,text,clean_text
0,tv future in the hands of viewers with home th...,future hand viewer home theatre system plasma ...
1,worldcom boss left books alone former worldc...,worldcom bos left book alone former worldcom b...
2,tigers wary of farrell gamble leicester say ...,tiger wary farrell gamble leicester say rushed...
3,yeading face newcastle in fa cup premiership s...,yeading face newcastle cup premiership side ne...
4,ocean s twelve raids box office ocean s twelve...,ocean twelve raid box office ocean twelve crim...


# Cellule 8 — Vérification des textes vides

In [29]:
# Nombre de textes vides après nettoyage
(df["clean_text"].str.len() == 0).sum()

np.int64(0)

# Cellule 9 — Comparaison longueur avant / après nettoyage

In [30]:
df["length_before"] = df["text"].apply(lambda x: len(x.split()))
df["length_after"] = df["clean_text"].apply(lambda x: len(x.split()))

df[["length_before", "length_after"]].describe()

Unnamed: 0,length_before,length_after
count,2225.0,2225.0
mean,390.295281,209.914607
std,241.753128,121.168297
min,90.0,47.0
25%,250.0,135.0
50%,337.0,184.0
75%,479.0,260.0
max,4492.0,2128.0


# Cellule 10 — Exemple avant / après preprocessing

In [31]:
print("Texte original :\n")
print(df["text"].iloc[0])

print("\nTexte après preprocessing :\n")
print(df["clean_text"].iloc[0])

Texte original :

tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being

In [32]:
# Sélection des colonnes utiles
df_clean = df[["clean_text", "category"]].copy()

# Sauvegarde du dataset nettoyé
df_clean.to_csv("data/bbc-text-cleaned.csv", index=False)

print("Dataset nettoyé sauvegardé avec succès.")


Dataset nettoyé sauvegardé avec succès.


## Conclusion du preprocessing

- Les textes ont été nettoyés et normalisés.
- Les mots sans valeur sémantique ont été supprimés.
- Les textes sont maintenant prêts pour la vectorisation TF-IDF.
- La colonne `category` n’a pas été utilisée dans cette étape.
