# Notebook 3: Data Cleaning

Raw text is full of 'garbage' like emojis, URLs, and extra spaces. In this notebook, we perform **"Digital Janitor Work"** to prepare the data for our AI.

![Text Cleaning](assets/data_cleaning_visual.png)

### Our Cleaning List:
1. **Cleaning**: Remove emojis and special symbols.
2. **Lemmatizing**: Turning 'running' and 'ran' into the simple word 'run'.
3. **Stopwords**: Removing boring words like 'the', 'is', and 'at' so the AI can focus on important words like 'broken' or 'amazing'.

In [1]:
import pandas as pd
import sys
from pathlib import Path

# Import our custom cleaning tools from the 'src' folder
sys.path.append("..")
from src.preprocessing import clean_text, lemmatize, remove_stopwords

input_file = Path("../data/raw/raw_reviews.csv")
output_file = Path("../data/processed/clean_reviews.csv")

print("Cleaning tools are unplugged and ready!")

Cleaning tools are unplugged and ready!


## 1. The "Before & After"

Let's see the cleaning process in action on a single 'noisy' review.

In [2]:
dirty_example = "I LOVE this shop!!! üòç Visit http://shopease.com"
clean_example = clean_text(dirty_example)

print(f"Dirty: {dirty_example}")
print(f"Clean: {clean_example}")

Dirty: I LOVE this shop!!! üòç Visit http://shopease.com
Clean: i love this shop visit httpshopeasecom


## 2. Mass Processing

Now we apply this to all 1,000 reviews at once. We also create a **'Label'** (0 for Sad, 1 for Okay, 2 for Happy) so the AI knows the answer key.

In [3]:
df = pd.read_csv(input_file)

print("Working... (Cleaning 1,000 reviews)")
df['clean_text'] = df['review'].apply(clean_text)
df['lemma_text'] = df['clean_text'].apply(lemmatize)
df['final_text'] = df['lemma_text'].apply(remove_stopwords)

# Create the 'Answer Key' for the AI
df['label'] = df['rating'].apply(lambda r: 0 if r in (1, 2) else (1 if r == 3 else 2))

df[['review', 'final_text', 'label']].head()

Working... (Cleaning 1,000 reviews)


Unnamed: 0,review,final_text,label
0,¬°Me encanta este producto! lo mejor que he com...,I encanta este producto lo mejor que comprado,1
1,"Tr√®s bon produit, fonctionne parfaitement. MER...",tr√®s bon produit fonctionne parfaitement merci,2
2,"Lieferung war langsam, aber Produkt ist GUT.",lieferung war langsam aber produkt ist gut,2
3,"Tr√®s bon produit, fonctionne parfaitement. MER...",tr√®s bon produit fonctionne parfaitement merci,2
4,"Lieferung war langsam, aber Produkt ist GUT.",lieferung war langsam aber produkt ist gut,1


## Summary
The data is now 'standardized'. Every word is lowercase, simplified, and meaningful. Time to build the models!