# Preprocessing Experiments: xLSTM Paper -style Light vs Aggressive Cleaning

This notebook implements and compares two text preprocessing pipelines:

- **Light preprocessing (xLSTM-inspired)**: minimal normalization consistent with the xLSTM paper.
- **Aggressive preprocessing**: stronger normalization (legacy pipeline) aimed at reducing noise but potentially removing useful signals.

We export both processed datasets to `data/processed/` for downstream models (TF-IDF, Naive Bayes, Transformer).


## Why preprocessing?

Preprocessing can:
- reduce noise (URLs, emails, repeated chars, formatting artifacts),
- normalize lexical variants (lemmatization),
- reduce vocabulary size (stopword removal),
- improve generalization for classical models (TF-IDF / Naive Bayes).

However, aggressive cleaning may also remove predictive signals (e.g., slang, misspellings, emoji cues).
Therefore we compare **light vs aggressive** preprocessing as an ablation study.


In [3]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from pathlib import Path
from tqdm import tqdm

# NLP tools for light preprocessing
import spacy
from nltk.corpus import stopwords

# Import our preprocessing functions
import sys
sys.path.append("../src")
from preprocessing import preprocess_light_xlstm, preprocess  # adjust name if using preprocess_aggressive

# Initialize spaCy and stopword list
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words("english"))

# Paths for saving processed datasets
processed_dir = Path("../data/processed")
processed_dir.mkdir(parents=True, exist_ok=True)


This cell loads the required libraries, initializes the spaCy model and NLTK stopwords, and imports the two preprocessing functions.  
- `preprocess_light_xlstm()` implements the light pipeline: lowercasing, URL/email removal, number replacement (`<NUM>`), lemmatization, stopword removal, and whitespace normalization.  
- `preprocess()` (or `preprocess_aggressive()`) implements the aggressive cleaning pipeline already coded in `src/preprocessing.py`.

We also ensure that a `data/processed/` directory exists for saving the cleaned datasets.


In [None]:
# Load Civil Comments dataset
dataset = load_dataset("google/civil_comments")
train_ds = dataset["train"]
valid_ds = dataset["validation"]
test_ds  = dataset["test"]

# Convert to Pandas for easier manipulation
df_train = train_ds.to_pandas()
df_valid = valid_ds.to_pandas()
df_test  = test_ds.to_pandas()

# Show shapes
print("Train size:", df_train.shape)
print("Validation size:", df_valid.shape)
print("Test size:", df_test.shape)

# --- Subsample for development ---
SUBSAMPLE_FRAC = 0.1  # 10%

df_train_small = df_train.sample(frac=SUBSAMPLE_FRAC, random_state=42).copy()
df_valid_small = df_valid.sample(frac=SUBSAMPLE_FRAC, random_state=42).copy()
df_test_small  = df_test.sample(frac=SUBSAMPLE_FRAC, random_state=42).copy()

print(len(df_train_small), len(df_valid_small), len(df_test_small))



Train size: (1804874, 8)
Validation size: (97320, 8)
Test size: (97320, 8)
180487 9732 9732


## Target definition

Civil Comments labels are continuous scores in [0, 1].
For classification baselines, we convert them to binary targets using a threshold `THRESH`.

We will:
- create binary labels for each toxicity dimension,
- optionally create a global label `is_toxic = 1(toxicity > THRESH)` for single-label baselines.


In [None]:
THRESH = 0.5
label_cols = [
    "toxicity", "severe_toxicity", "obscene",
    "threat", "insult", "identity_attack"
]

for col in label_cols:
    df_train_small[col + "_bin"] = (df_train_small[col] > THRESH).astype(int)
    df_valid_small[col + "_bin"] = (df_valid_small[col] > THRESH).astype(int)
    df_test_small[col + "_bin"]  = (df_test_small[col] > THRESH).astype(int)


## Light preprocessing (xLSTM-inspired)

Steps:
1. Lowercase
2. Remove URLs and emails
3. Replace numeric sequences with `<NUM>`
4. Lemmatization with spaCy
5. Stopword removal with NLTK
6. Whitespace normalization


Advantages:

- preserves sentence structure and many important signals (e.g. obfuscated insults, writing style)

- often better suited for Transformers and models capable of learning complex patterns

- lower risk of over-cleaning the text

Drawbacks:

- larger vocabulary, which can lead to higher sparsity for TF-IDF or Naive Bayes models

- retains more noise (typos, repetitions, emojis, etc.)

## Aggressive preprocessing (legacy)

This pipeline applies stronger normalization (e.g., slang expansion, emoji conversion, spelling correction, etc.).
It may reduce noise further but can also discard useful information.

We use it as a comparison baseline to quantify how much cleaning is beneficial.


Implemented steps:

- lowercasing

- removal of HTML tags and HTML entities

- removal of URLs and email addresses

- Unicode and accent normalization

- removal of punctuation and special characters

- normalization of excessive character repetitions (e.g. sooooo â†’ soo)

- stopword removal

- lemmatization

Advantages:

- strongly reduces vocabulary size

- beneficial for Naive Bayes and TF-IDF models by reducing sparsity

- produces more standardized and stable text representations

Drawbacks:

- may remove subtle toxic signals such as creative spellings or obfuscations

- can overly normalize the original text and reduce stylistic information

## Apply preprocessing and compare

We compare:
- average token length (words),
- fraction of empty outputs,
- vocabulary size proxy,
- qualitative examples (same raw text processed by both pipelines).


In [None]:
from preprocessing import preprocess_light_xlstm, preprocess

def light(text):
    return preprocess_light_xlstm(text, nlp=nlp, stopwords=stop_words)

def aggressive(text):
    return preprocess(text)


In [8]:
# --- Sanity check: very small sample ---
N_SANITY = 100

df_sanity = df_train.sample(N_SANITY, random_state=42).copy()

df_sanity["text_light"] = df_sanity["text"].apply(light)
df_sanity["text_aggressive"] = df_sanity["text"].apply(aggressive)

df_sanity[["text", "text_light", "text_aggressive"]].head(5)


Unnamed: 0,text,text_light,text_aggressive
286892,What a breathe of fresh air to have someone wh...,breathe fresh air someone embrace common sense...,breathe fresh air someone embrace common sense...
419218,Your jewish friends were the ones who told you...,jewish friend one tell zionist control canada ...,jewish friend one tell zionists control canada...
1055330,Possible collusion by Trump and his affiliates...,"possible collusion trump affiliate debunk , st...",possible collusion trump affiliate debunk stat...
1382764,Exactly. We need a % of GDP spending cap at t...,exactly . need % gdp spend cap federal level (...,exactly need gap spending cap federal level ei...
256049,"By your own comment, even if some of them vote...","comment , even vote ndp pq , trudeau demonstra...",comment even vote nip pm trudeau demonstrably ...


In [11]:
#df_train["text_light"] = df_train["text"].apply(light)
#df_train["text_aggressive"] = df_train["text"].apply(aggressive)

df_train_small["text_light"] = df_train_small["text"].apply(light)
df_train_small["text_aggressive"] = df_train_small["text"].apply(aggressive)

KeyboardInterrupt: 

In [None]:
df_test_small["text_light"] = df_train_small["text"].apply(light)
df_test_small["text_aggressive"] = df_train_small["text"].apply(aggressive)
df_valid_small["text_light"] = df_train_small["text"].apply(light)
df_valid_small["text_aggressive"] = df_train_small["text"].apply(aggressive)

## Comparison of preprocessing strategies

We compare the two pipelines using text length statistics and qualitative examples.


In [None]:
df_train_small["len_light"] = df_train_small["text_light"].str.split().apply(len)
df_train_small["len_aggr"]  = df_train_small["text_aggressive"].str.split().apply(len)

df_train_small[["len_light", "len_aggr"]].describe()


## Qualitative comparison: raw vs light vs aggressive preprocessing

To better understand the impact of preprocessing, we display a concrete example
of a raw comment alongside its light (xLSTM-style) and aggressive cleaned versions.
This qualitative comparison highlights how different preprocessing choices
transform the same input text.


In [None]:
# Select one example comment
example_text = df_train.loc[0, "text"]

# Apply both preprocessing pipelines
example_light = light(example_text)
example_aggressive = aggressive(example_text)

print("RAW COMMENT:")
print(example_text)

print("\n" + "="*100 + "\n")

print("LIGHT PREPROCESSING (xLSTM-style):")
print(example_light)

print("\n" + "="*100 + "\n")

print("AGGRESSIVE PREPROCESSING:")
print(example_aggressive)


## Export processed datasets

We export two versions:
- `civil_comments_light_xlstm.(csv|parquet)`
- `civil_comments_aggressive.(csv|parquet)`

They are stored in `data/processed/` and excluded from Git versioning.


In [None]:
df_train_small.to_csv("../data/processed/train_cleaned.csv", index=False)
df_valid_small.to_csv("../data/processed/valid_cleaned.csv", index=False)
df_test_small.to_csv("../data/processed/test_cleaned.csv", index=False)


## Conclusion

This notebook compared light and aggressive preprocessing strategies.
The resulting datasets will be used to train and evaluate multiple models,
allowing us to assess the impact of preprocessing choices on performance.
