# ðŸ““ Notebook 2 â€” Normalization, Missing Data & Bias Removal
**LLM Data Processing Pipeline Â· Stage 2 of 3**

Picks up after cleaning. This notebook covers:
- Text normalization (lowercase, punctuation, contractions)
- Detecting and handling missing / incomplete data
- Identifying and mitigating dataset bias


## 2.1 Setup & Cleaned Input

We simulate the output from Notebook 1 â€” a small, clean corpus ready for normalization.


In [1]:
import re
import pandas as pd

# -------------------------------------------------------------------
# Simulated output from Notebook 1 (cleaned corpus)
# -------------------------------------------------------------------
cleaned_texts = [
    "The quick brown fox jumped over the lazy dog.",
    "LLMs learn from massive volumes of text data.",
    "Neural networks are the backbone of modern AI.",
    "Transformers changed natural language processing forever.",
    "Deep learning models require large datasets for training.",
    "Don't underestimate the importance of data quality!",
    "I can't believe how fast AI has progressed.",
    None,                   # simulated missing entry
    "It's a great time to be working in machine learning.",
]

df = pd.DataFrame({"text": cleaned_texts})
print(f"Input corpus size: {len(df)} entries")
df


Input corpus size: 9 entries


Unnamed: 0,text
0,The quick brown fox jumped over the lazy dog.
1,LLMs learn from massive volumes of text data.
2,Neural networks are the backbone of modern AI.
3,Transformers changed natural language processi...
4,Deep learning models require large datasets fo...
5,Don't underestimate the importance of data qua...
6,I can't believe how fast AI has progressed.
7,
8,It's a great time to be working in machine lea...


## 2.2 Handling Missing & Incomplete Data

Missing values (`None`, `NaN`) and very short fragments must be detected
and either corrected or removed before normalization â€” otherwise downstream
string operations will raise errors.


In [2]:
# -------------------------------------------------------------------
# Step 1: Identify missing values
# -------------------------------------------------------------------
print("Missing values per column:")
print(df.isnull().sum())

# Step 2: Remove rows with missing text (can't impute free text reliably)
df_no_missing = df.dropna(subset=["text"]).reset_index(drop=True)
print(f"\nDropped {len(df) - len(df_no_missing)} missing row(s).")

# Step 3: Flag and remove very short fragments (< 10 chars after stripping)
df_no_missing["length"] = df_no_missing["text"].str.strip().str.len()
too_short = df_no_missing[df_no_missing["length"] < 10]
if not too_short.empty:
    print(f"Removed {len(too_short)} fragment(s) that are too short.")

df_complete = df_no_missing[df_no_missing["length"] >= 10].drop(columns="length").reset_index(drop=True)
print(f"Complete corpus: {len(df_complete)} entries")
df_complete


Missing values per column:
text    1
dtype: int64

Dropped 1 missing row(s).
Complete corpus: 8 entries


Unnamed: 0,text
0,The quick brown fox jumped over the lazy dog.
1,LLMs learn from massive volumes of text data.
2,Neural networks are the backbone of modern AI.
3,Transformers changed natural language processi...
4,Deep learning models require large datasets fo...
5,Don't underestimate the importance of data qua...
6,I can't believe how fast AI has progressed.
7,It's a great time to be working in machine lea...


## 2.3 Expanding Contractions

Contractions like "don't" map to different token sequences than "do not".
Expanding them reduces vocabulary fragmentation.


In [3]:
# -------------------------------------------------------------------
# Contraction expansion dictionary (extend as needed)
# -------------------------------------------------------------------
CONTRACTIONS = {
    r"can't":    "cannot",
    r"won't":    "will not",
    r"don't":    "do not",
    r"doesn't":  "does not",
    r"didn't":   "did not",
    r"it's":     "it is",
    r"i'm":      "i am",
    r"i've":     "i have",
    r"we're":    "we are",
    r"they're":  "they are",
    r"there's":  "there is",
    r"that's":   "that is",
}

def expand_contractions(text):
    for pattern, replacement in CONTRACTIONS.items():
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text

df_complete["text"] = df_complete["text"].apply(expand_contractions)
print("After contraction expansion:")
df_complete


After contraction expansion:


Unnamed: 0,text
0,The quick brown fox jumped over the lazy dog.
1,LLMs learn from massive volumes of text data.
2,Neural networks are the backbone of modern AI.
3,Transformers changed natural language processi...
4,Deep learning models require large datasets fo...
5,do not underestimate the importance of data qu...
6,I cannot believe how fast AI has progressed.
7,it is a great time to be working in machine le...


## 2.4 Lowercasing

Case folding ensures "AI", "Ai", and "ai" are treated as the same token.
(Skip this if you need case-sensitive information, e.g., for NER tasks.)


In [4]:
# -------------------------------------------------------------------
# Lowercase all text
# -------------------------------------------------------------------
df_complete["text"] = df_complete["text"].str.lower()
print("After lowercasing:")
df_complete


After lowercasing:


Unnamed: 0,text
0,the quick brown fox jumped over the lazy dog.
1,llms learn from massive volumes of text data.
2,neural networks are the backbone of modern ai.
3,transformers changed natural language processi...
4,deep learning models require large datasets fo...
5,do not underestimate the importance of data qu...
6,i cannot believe how fast ai has progressed.
7,it is a great time to be working in machine le...


## 2.5 Punctuation Removal

For many tasks (classification, generation), punctuation adds noise.
For sequence tasks (translation, summarization), it may be preserved.
Here we show both approaches.


In [5]:
# -------------------------------------------------------------------
# Option A: Remove all punctuation
# -------------------------------------------------------------------
def remove_punctuation(text):
    return re.sub(r"[^\w\s]", "", text)

# Option B: Keep sentence-ending punctuation (. ! ?)
def remove_non_terminal_punctuation(text):
    return re.sub(r"[^\w\s.!?]", "", text)

# We'll use Option B (preserve sentence boundaries)
df_complete["text"] = df_complete["text"].apply(remove_non_terminal_punctuation)
print("After punctuation removal (terminal punctuation kept):")
df_complete


After punctuation removal (terminal punctuation kept):


Unnamed: 0,text
0,the quick brown fox jumped over the lazy dog.
1,llms learn from massive volumes of text data.
2,neural networks are the backbone of modern ai.
3,transformers changed natural language processi...
4,deep learning models require large datasets fo...
5,do not underestimate the importance of data qu...
6,i cannot believe how fast ai has progressed.
7,it is a great time to be working in machine le...


## 2.6 Whitespace Normalization

Multiple spaces, tabs, and newlines left over from HTML scraping should be collapsed.


In [6]:
# -------------------------------------------------------------------
# Collapse all whitespace runs to a single space, strip edges
# -------------------------------------------------------------------
df_complete["text"] = df_complete["text"].apply(lambda t: re.sub(r"\s+", " ", t).strip())
print("After whitespace normalization:")
df_complete


After whitespace normalization:


Unnamed: 0,text
0,the quick brown fox jumped over the lazy dog.
1,llms learn from massive volumes of text data.
2,neural networks are the backbone of modern ai.
3,transformers changed natural language processi...
4,deep learning models require large datasets fo...
5,do not underestimate the importance of data qu...
6,i cannot believe how fast ai has progressed.
7,it is a great time to be working in machine le...


## 2.7 Bias Detection

Bias in training data leads to biased model outputs. A simple first pass:
flag sentences containing a curated list of potentially biased terms.
In production, dedicated fairness tools (e.g. `fairlearn`, custom classifiers)
are used for richer analysis.


In [7]:
# -------------------------------------------------------------------
# Simplified bias lexicon â€” extend with domain-specific terms
# -------------------------------------------------------------------
BIAS_INDICATORS = [
    "always", "never", "all women", "all men", "obviously", "everyone knows",
    "of course", "naturally", "simply", "just a",
]

def flag_bias(text):
    text_lower = text.lower()
    hits = [term for term in BIAS_INDICATORS if term in text_lower]
    return hits if hits else None

df_complete["bias_flags"] = df_complete["text"].apply(flag_bias)

flagged = df_complete[df_complete["bias_flags"].notna()]
print(f"{len(flagged)} sentence(s) flagged for potential bias:")
if not flagged.empty:
    for _, row in flagged.iterrows():
        print(f"  TEXT : {row['text']}")
        print(f"  FLAGS: {row['bias_flags']}")
else:
    print("  None flagged in this small corpus.")

# For the pipeline, we keep flagged items but mark them for human review
df_complete["needs_review"] = df_complete["bias_flags"].notna()


0 sentence(s) flagged for potential bias:
  None flagged in this small corpus.


## 2.8 Summary

| Step | Notes |
|------|-------|
| Missing data | Drop `None`/`NaN`; remove fragments < 10 chars |
| Contractions | Expand "don't" â†’ "do not" etc. |
| Lowercase | Reduce vocabulary size |
| Punctuation | Remove non-terminal marks |
| Whitespace | Collapse to single spaces |
| Bias flags | Mark sentences for human review |

The normalized corpus is ready for **Notebook 3 â€” Tokenization, Labeling & Data Splitting**.
