# Preprocessing Experiments: xLSTM Paper -style Light vs Aggressive Cleaning

This notebook implements and compares two text preprocessing pipelines:

- **Light preprocessing (xLSTM-inspired)**: minimal normalization consistent with the xLSTM paper.
- **Aggressive preprocessing**: stronger normalization (legacy pipeline) aimed at reducing noise but potentially removing useful signals.

We export both processed datasets to `data/processed/` for downstream models (TF-IDF, Naive Bayes, Transformer).


## Why preprocessing?

Preprocessing can:
- reduce noise (URLs, emails, repeated chars, formatting artifacts),
- normalize lexical variants (lemmatization),
- reduce vocabulary size (stopword removal),
- improve generalization for classical models (TF-IDF / Naive Bayes).

However, aggressive cleaning may also remove predictive signals (e.g., slang, misspellings, emoji cues).
Therefore we compare **light vs aggressive** preprocessing as an ablation study.


In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from pathlib import Path
from tqdm import tqdm

# NLP tools for light preprocessing
import spacy
from nltk.corpus import stopwords

# Import our preprocessing functions
import sys
sys.path.append("../src")
from preprocessing import preprocess_light_xlstm, preprocess, preprocess_light_xlstm_from_doc, preprocess_from_doc # adjust name if using preprocess_aggressive

# Initialize spaCy and stopword list
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words("english"))

# Paths for saving processed datasets
processed_dir = Path("../data/processed")
processed_dir.mkdir(parents=True, exist_ok=True)


  from .autonotebook import tqdm as notebook_tqdm
  import pkg_resources as pkgr


This cell loads the required libraries, initializes the spaCy model and NLTK stopwords, and imports the two preprocessing functions.  
- `preprocess_light_xlstm()` implements the light pipeline: lowercasing, URL/email removal, number replacement (`<NUM>`), lemmatization, stopword removal, and whitespace normalization.  
- `preprocess()` (or `preprocess_aggressive()`) implements the aggressive cleaning pipeline already coded in `src/preprocessing.py`.

We also ensure that a `data/processed/` directory exists for saving the cleaned datasets.


In [2]:
# Load Civil Comments dataset
dataset = load_dataset("thesofakillers/jigsaw-toxic-comment-classification-challenge")  #google/civil_comments

# train (80%) / validation(10%) / test(10%) split
train_test = dataset["train"].train_test_split(
    test_size=0.2,
    seed=42
)

valid_test = train_test["test"].train_test_split(
    test_size=0.5,
    seed=42
)

train_ds = train_test["train"]
valid_ds = valid_test["train"]
test_ds  = valid_test["test"]


# Convert to Pandas for easier manipulation
df_train = train_ds.to_pandas().rename(columns={"comment_text": "text"}).drop(columns=["id"])
df_valid = valid_ds.to_pandas().rename(columns={"comment_text": "text"}).drop(columns=["id"])
df_test  = test_ds.to_pandas().rename(columns={"comment_text": "text"}).drop(columns=["id"])

# Show shapes
print("Train size:", df_train.shape)
print("Validation size:", df_valid.shape)
print("Test size:", df_test.shape)

# --- Subsample for development ---
SUBSAMPLE_FRAC = 1  # 100%

df_train_small = df_train.sample(frac=SUBSAMPLE_FRAC, random_state=42).copy()
df_valid_small = df_valid.sample(frac=SUBSAMPLE_FRAC, random_state=42).copy()
df_test_small  = df_test.sample(frac=SUBSAMPLE_FRAC, random_state=42).copy()

print(len(df_train_small), len(df_valid_small), len(df_test_small))



Train size: (127656, 7)
Validation size: (15957, 7)
Test size: (15958, 7)
127656 15957 15958


## Light preprocessing (xLSTM-inspired)

Steps:
1. Lowercase
2. Remove URLs and emails
3. Replace numeric sequences with `<NUM>`
4. Lemmatization with spaCy
5. Stopword removal with NLTK
6. Whitespace normalization


Advantages:

- preserves sentence structure and many important signals (e.g. obfuscated insults, writing style)

- often better suited for Transformers and models capable of learning complex patterns

- lower risk of over-cleaning the text

Drawbacks:

- larger vocabulary, which can lead to higher sparsity for TF-IDF or Naive Bayes models

- retains more noise (typos, repetitions, emojis, etc.)

## Aggressive preprocessing

This pipeline applies stronger normalization (e.g., slang expansion, emoji conversion, spelling correction, etc.).
It may reduce noise further but can also discard useful information.

We use it as a comparison baseline to quantify how much cleaning is beneficial.


Implemented steps:

- lowercasing

- removal of URLs and email addresses

- Unicode and accent normalization

- Removal of HTML artifacts

- removal of punctuation and special characters

- normalization of excessive character repetitions (e.g. sooooo â†’ soo)

- stopword removal

- lemmatization

Advantages:

- strongly reduces vocabulary size

- beneficial for Naive Bayes and TF-IDF models by reducing sparsity

- produces more standardized and stable text representations

Drawbacks:

- may remove subtle toxic signals such as creative spellings or obfuscations

- can overly normalize the original text and reduce stylistic information

## Apply preprocessing and compare

We compare:
- average token length (words),
- fraction of empty outputs,
- vocabulary size proxy,
- qualitative examples (same raw text processed by both pipelines).


In [3]:
from preprocessing import preprocess_light_xlstm, preprocess

def light(text):
    return preprocess_light_xlstm(text, nlp=nlp, stopwords=stop_words)

def aggressive(text):
    return preprocess(text, nlp=nlp, stopwords=stop_words)


In [4]:
# --- Sanity check: very small sample ---
N_SANITY = 100

df_sanity = df_train.sample(N_SANITY, random_state=42).copy()

df_sanity["text_light"] = df_sanity["text"].apply(light)
df_sanity["text_aggressive"] = df_sanity["text"].apply(aggressive)

df_sanity[["text", "text_light", "text_aggressive"]].head(5)


Unnamed: 0,text,text_light,text_aggressive
89061,That would be welcome. I don't seem to have an...,would welcome . I seem anything would pd . one...,would welcome I seem anything would pd one ima...
29562,", You taking money from /r/gamerghazi and cont...",", take money /r / gamerghazi continue edit gam...",take money r gamerghazi continue edit gamergat...
3615,"ROMANIAN-AMERICANS, AGAIN YOU!!!!\nWhat is aga...","romanian - americans , ! ! ! ! problem list ro...",romanian americans problem list romanian ameri...
27958,you're a tyrant \n\nyou're only deleting my ed...,tyrant delete edit angelique 's surname carrin...,tyrant delete edit angelique s surname carring...
118482,"""Hi Paul\n\nGiven your previous edits of this ...",""" hi paul give previous edit article I wary ne...",hi paul give previous edit article I wary neut...


In [5]:
### --- Apply light preprocessing to datasets --- (running time : 25 minutes) 
## Train set
docs = nlp.pipe(
    df_train_small["text"].tolist(),
    batch_size=512,
    n_process=4
)

df_train_small["text_light"] = [
    preprocess_light_xlstm_from_doc(doc, stop_words)
    for doc in docs
]

print("Finished preprocessing train set.")


## Validation set
docs = nlp.pipe(
    df_valid_small["text"].tolist(),
    batch_size=512,
    n_process=4
)

df_valid_small["text_light"] = [
    preprocess_light_xlstm_from_doc(doc, stop_words)
    for doc in docs
]

print("Finished preprocessing validation set.")


## Test set
docs = nlp.pipe(
    df_test_small["text"].tolist(),
    batch_size=512,
    n_process=4
)

df_test_small["text_light"] = [
    preprocess_light_xlstm_from_doc(doc, stop_words)
    for doc in docs
]

print("Finished preprocessing test set.")

Finished preprocessing train set.
Finished preprocessing validation set.
Finished preprocessing test set.


In [None]:
### --- Apply aggressive preprocessing to datasets ---  (running time : 27 minutes)
## Train set
docs = nlp.pipe(
    df_train_small["text"].tolist(),
    batch_size=512,
    n_process=4
)

df_train_small["text_aggressive"] = [
    preprocess_from_doc(doc, stop_words)
    for doc in docs
]

print("Finished preprocessing train set.")


## Validation set
docs = nlp.pipe(
    df_valid_small["text"].tolist(),
    batch_size=512,
    n_process=4
)

df_valid_small["text_aggressive"] = [
    preprocess_from_doc(doc, stop_words)
    for doc in docs
]

print("Finished preprocessing validation set.")


## Test set
docs = nlp.pipe(
    df_test_small["text"].tolist(),
    batch_size=512,
    n_process=4
)

df_test_small["text_aggressive"] = [
    preprocess_from_doc(doc, stop_words)
    for doc in docs
]

print("Finished preprocessing test set.")

Finished preprocessing train set.
Finished preprocessing validation set.
Finished preprocessing test set.


## Export processed datasets

We export two versions:
- `civil_comments_light_xlstm.(csv|parquet)`
- `civil_comments_aggressive.(csv|parquet)`

They are stored in `data/processed/` and excluded from Git versioning.


In [13]:
df_train_small.to_parquet("../data/processed/train_cleaned.parquet", index=False)
df_valid_small.to_parquet("../data/processed/valid_cleaned.parquet", index=False)
df_test_small.to_parquet("../data/processed/test_cleaned.parquet", index=False)

## Comparison of preprocessing strategies

We compare the two pipelines using text length statistics and qualitative examples.


In [8]:
df_train_small["len_light"] = df_train_small["text_light"].str.split().apply(len)
df_train_small["len_aggr"]  = df_train_small["text_aggressive"].str.split().apply(len)

df_train_small[["len_light", "len_aggr"]].describe()


Unnamed: 0,len_light,len_aggr
count,127656.0,127656.0
mean,49.585801,37.934629
std,77.583008,58.908626
min,1.0,0.0
25%,13.0,10.0
50%,26.0,20.0
75%,54.0,41.0
max,4568.0,1381.0


## Qualitative comparison: raw vs light vs aggressive preprocessing

To better understand the impact of preprocessing, we display a concrete example
of a raw comment alongside its light (xLSTM-style) and aggressive cleaned versions.
This qualitative comparison highlights how different preprocessing choices
transform the same input text.


In [9]:
# Select one example comment
example_text = df_train.loc[0, "text"]

# Apply both preprocessing pipelines
example_light = light(example_text)
example_aggressive = aggressive(example_text)

print("RAW COMMENT:")
print(example_text)

print("\n" + "="*100 + "\n")

print("LIGHT PREPROCESSING (xLSTM-style):")
print(example_light)

print("\n" + "="*100 + "\n")

print("AGGRESSIVE PREPROCESSING:")
print(example_aggressive)


RAW COMMENT:
Missing Champions 

This article should have ALL the TNA World Heavyweight Champions, from Ken Shamrock. Weird that it doesn't.


LIGHT PREPROCESSING (xLSTM-style):
miss champion article tna world heavyweight champion , ken shamrock . weird .


AGGRESSIVE PREPROCESSING:
miss champion article tna world heavyweight champion ken shamrock weird


## Conclusion

This notebook compared light and aggressive preprocessing strategies.
The resulting datasets will be used to train and evaluate multiple models,
allowing us to assess the impact of preprocessing choices on performance.
