Normalization standardizes messy tokens from tokenization, making text consistent for ML models by reducing variations like case or word forms. 

## Why Normalize?
Raw tokens have noise (e.g., "Home Run" vs "home run"). Normalization cuts variations, shrinks data size, boosts model efficiency, and handles human quirks like spelling errors. Key goal: Turn diverse text into a "standard" form before feature extraction (tokens → numbers).

## Overview: Normalization Techniques

Common steps (apply after tokenization):
- **Case folding:** All lowercase (HOME → home).
- **Numbers:** Convert to words ("5" → "five") or remove.
- **Punctuation/special chars:** Remove $, @, #, accents.
- **Whitespace:** Trim extras.
- **Abbreviations:** Expand ("can't" → "cannot").
- **Stop words:** Drop uninformative ones ("the", "is", "and").
- **Canonicalizing:** Standardize dates, etc.
- **Stemming/Lemmatization:** Reduce to root forms.

## Normalization

Applies operations sequentially.

### 1. POS Tagging (`pos_tag()`)
Part-of-speech (POS) tagging involves assigning labels to each word in a sentence based on its role in the sentence (noun, verb, adjective, etc). This helps in understanding the grammatical structure and meaning of the text.
Identifies grammar: "The" = determiner (DT), "movie" = noun (NN), "was" = verb (VBD).  
Names like "Tom Cruise" = proper nouns (NNP). Check tags: `nltk.help.upenn_tagset()`. PRP$ = possessive pronoun.

### 2. Named Entity Recognition (NER) (`ne_chunk()`)
Named entity recognition involves identifying and classifying entities in text into predefined categories (names, organizations, locations, dates, etc)
Tags real-world items: PERSON (Tom Cruise), ORGANIZATION.  
Use: Remove entities for privacy (loses specifics but keeps sentiment).

### 3. Lowercase + Remove Stop Words
- Lower: "I enjoyed 'minority report'..."
- Stop words (`nltk.corpus.stopwords.words('english')`): Drops "I", "was", "it" → 36 → 19 tokens (50% smaller).  
(telegram-style, sentiment intact).

### 4. Stemming (`PorterStemmer`)
Stemming is the process of reducing a word to its base or root form, often by removing suffixes or prefixes.
Rule-based: Chops suffixes → stems (may not be real words).  
- joy/joyful/joyfully → joy  
- joyous → "joyou", geese → "gees" (OK for ML; models learn patterns).  
Review shrinks more; good for sentiment analysis.

### 5. Lemmatization (`WordNetLemmatizer`)
Lemmatization is the process of reducing words to their base or dictionary form, considering the word’s context and its part of speech. Unlike stemming, lemmatization uses vocabulary and morphological analysis to return the base or canonical form of a word.
Dictionary-based: Real words only, needs POS context. Uses WordNet database.  
- geese → goose (correct)  
- joy/joyful/joyfully/joyous → themselves (no change here).  
Better for grammar-heavy tasks like summarization.

**Stem vs Lemma:** Stem = faster/cruder; Lemma = accurate/slower.

> Normalization cleans tokens perfectly for modeling!