# Notebook 1 - Data Cleaning & Noise Removal
**LLM Data Processing Pipeline · Stage 1 of 3**

This notebook covers the foundational steps that make raw text usable for model training:
- Removing duplicates
- Removing irrelevant content
- Correcting spelling errors
- Stripping noise (ads, symbols, corrupted text)


## 1.1 Setup & Sample Data

We start with a small corpus of raw text that mimics what you'd scrape from the web —
duplicates, noise, typos and all.


In [1]:
import re
import pandas as pd

# -------------------------------------------------------------------
# Sample raw corpus — intentionally messy
# -------------------------------------------------------------------
raw_texts = [
    "The quikc brown fox jumpd over the lazy dog.",
    "LLMs learn from massive volumes of text data.",
    "CLICK HERE!!! Buy now >>> ad.example.com <<<",
    "LLMs learn from massive volumes of text data.",   # duplicate
    "Neural networks are the backbone of modern AI.",
    "##$$$random symbols%%^^&&** corrupted entry ###",
    "Transformers changed natural language processing forever.",
    "",                                                 # empty string
    "   ",                                              # whitespace only
    "Deep learning models require large datasets for training.",
    "Visit our site: spam-site.xyz — FREE OFFERS!!!",
]

df = pd.DataFrame({"text": raw_texts})
print(f"Raw corpus size: {len(df)} entries")
df


Raw corpus size: 11 entries


Unnamed: 0,text
0,The quikc brown fox jumpd over the lazy dog.
1,LLMs learn from massive volumes of text data.
2,CLICK HERE!!! Buy now >>> ad.example.com <<<
3,LLMs learn from massive volumes of text data.
4,Neural networks are the backbone of modern AI.
5,##$$$random symbols%%^^&&** corrupted entry ###
6,Transformers changed natural language processi...
7,
8,
9,Deep learning models require large datasets fo...


## 1.2 Remove Duplicates

Duplicates inflate the apparent dataset size and can cause the model to over-weight
certain patterns — a subtle but real training problem.


In [2]:
# -------------------------------------------------------------------
# Drop exact duplicate rows and reset the index
# -------------------------------------------------------------------
df_deduped = df.drop_duplicates(subset="text").reset_index(drop=True)

removed = len(df) - len(df_deduped)
print(f"Removed {removed} duplicate(s). Remaining: {len(df_deduped)} entries")
df_deduped


Removed 1 duplicate(s). Remaining: 10 entries


Unnamed: 0,text
0,The quikc brown fox jumpd over the lazy dog.
1,LLMs learn from massive volumes of text data.
2,CLICK HERE!!! Buy now >>> ad.example.com <<<
3,Neural networks are the backbone of modern AI.
4,##$$$random symbols%%^^&&** corrupted entry ###
5,Transformers changed natural language processi...
6,
7,
8,Deep learning models require large datasets fo...
9,Visit our site: spam-site.xyz — FREE OFFERS!!!


## 1.3 Remove Empty & Whitespace-Only Entries

Blank rows contribute nothing to learning and can cause tokenizer errors.


In [3]:
# -------------------------------------------------------------------
# Strip whitespace, then filter rows that are empty after stripping
# -------------------------------------------------------------------
df_deduped["text"] = df_deduped["text"].str.strip()
df_clean = df_deduped[df_deduped["text"].str.len() > 0].reset_index(drop=True)

print(f"After removing empties: {len(df_clean)} entries")
df_clean


After removing empties: 8 entries


Unnamed: 0,text
0,The quikc brown fox jumpd over the lazy dog.
1,LLMs learn from massive volumes of text data.
2,CLICK HERE!!! Buy now >>> ad.example.com <<<
3,Neural networks are the backbone of modern AI.
4,##$$$random symbols%%^^&&** corrupted entry ###
5,Transformers changed natural language processi...
6,Deep learning models require large datasets fo...
7,Visit our site: spam-site.xyz — FREE OFFERS!!!


## 1.4 Noise Removal

Noise = advertisements, random symbols, corrupted text, spam URLs.
We use regex rules to flag and remove them. In production you'd combine regex with
a trained classifier for higher precision.


In [4]:
# -------------------------------------------------------------------
# Define noise patterns
# -------------------------------------------------------------------
NOISE_PATTERNS = [
    r"https?://\S+",           # URLs
    r"\w+\.(xyz|com|net|org)",  # bare domain names
    r"(?:CLICK HERE|BUY NOW|FREE OFFER|>>>|<<<)",  # ad phrases
    r"[#$%^&*]{3,}",           # clusters of special characters
    r"(?i)!!!",                # excessive exclamation
]

def is_noisy(text):
    for pat in NOISE_PATTERNS:
        if re.search(pat, text, re.IGNORECASE):
            return True
    return False

df_clean["is_noise"] = df_clean["text"].apply(is_noisy)

print("Flagged as noise:")
print(df_clean[df_clean["is_noise"]]["text"].tolist())

df_no_noise = df_clean[~df_clean["is_noise"]].drop(columns="is_noise").reset_index(drop=True)
print(f"\nAfter noise removal: {len(df_no_noise)} entries")


Flagged as noise:
['CLICK HERE!!! Buy now >>> ad.example.com <<<', '##$$$random symbols%%^^&&** corrupted entry ###', 'Visit our site: spam-site.xyz — FREE OFFERS!!!']

After noise removal: 5 entries


## 1.5 Spelling Correction

Typos fragment the vocabulary — "quikc" and "quick" become separate tokens,
wasting model capacity. We correct common misspellings using `pyspellchecker`.

> **Note:** Spell-checking is expensive at scale. In practice, sub-word tokenizers
> like BPE partially mitigate the impact of typos, so this step is dataset-dependent.


In [5]:
# -------------------------------------------------------------------
# Spell-check word by word, correcting known misspellings
# Install: pip install pyspellchecker
# -------------------------------------------------------------------
try:
    from spellchecker import SpellChecker
    spell = SpellChecker()

    def correct_spelling(text):
        words = text.split()
        corrected = []
        for word in words:
            # Only correct purely alphabetic words to avoid breaking code/names
            if word.isalpha():
                correction = spell.correction(word)
                corrected.append(correction if correction else word)
            else:
                corrected.append(word)
        return " ".join(corrected)

    df_no_noise["text_corrected"] = df_no_noise["text"].apply(correct_spelling)

    # Show changes
    changed = df_no_noise[df_no_noise["text"] != df_no_noise["text_corrected"]]
    print("Spelling corrections made:")
    for _, row in changed.iterrows():
        print(f"  BEFORE: {row['text']}")
        print(f"  AFTER : {row['text_corrected']}")

    df_final = df_no_noise[["text_corrected"]].rename(columns={"text_corrected": "text"})

except ImportError:
    print("pyspellchecker not installed — skipping spell correction.")
    print("Run: pip install pyspellchecker")
    df_final = df_no_noise[["text"]]

df_final.reset_index(drop=True)


Spelling corrections made:
  BEFORE: The quikc brown fox jumpd over the lazy dog.
  AFTER : The quick brown fox jump over the lazy dog.
  BEFORE: LLMs learn from massive volumes of text data.
  AFTER : alms learn from massive volumes of text data.


Unnamed: 0,text
0,The quick brown fox jump over the lazy dog.
1,alms learn from massive volumes of text data.
2,Neural networks are the backbone of modern AI.
3,Transformers changed natural language processi...
4,Deep learning models require large datasets fo...


## 1.6 Summary

| Step | Input rows | Output rows | Removed |
|------|-----------|-------------|---------|
| Raw data | 11 | 11 | — |
| Deduplication | 11 | 10 | 1 duplicate |
| Empty removal | 10 | 8 | 2 blanks |
| Noise removal | 8 | 6 | 2 noisy rows |
| Spell correction | 6 | 6 | — (text fixed) |

The cleaned corpus is ready to be passed to **Notebook 2 — Text Normalization & Handling Missing Data**.
