# Natural Language Processing (NLP)

<p> Before 2010, if you wanted to build a spam filter or a chatbot, you didn't use a "Brain." You used a Calculator.</p>
<p>Must of time we spent on Data cleaning part.</p>

## 1. The Preprocessing Flow

<p>Input: "The 2 dogs were running fast!!"</p>

| Step | Action | Result (Current State) |
|:-----------|:------------:|------------:|
| 1. Lowercasing       | Normalize case            | "the 2 dogs were running fast!!"|
| 2. Noise Removal     | Remove symbols/nums       | "the dogs were running fast"           |
| 3. Tokenization      | Split into list           | ["the", "dogs", "were", "running", "fast"]          |
| 4. Stop Words        | Remove "the", "were"      | ["dogs", "running", "fast"]          |
| 5. Stemming          | Chop suffixes             | ["dog", "run", "fast"]          |


----

1. **Stemming**

- Stemming is a Rule-Based approach. 
- It doesn't know English; it just knows patterns. It uses a list of if/else rules to strip suffixes.

**Rules**
- Rule: If a word ends in "ing", remove "ing".
- Rule: If a word ends in "ed", remove "ed".

Examples of Failure (Over-chopping)
- Input: "Universities"
- Stemmer: "Universit" (Not a real word).
- Input: "Ponies"
- Stemmer: "Poni" (Not a real word).

**Why do we use it?**

It is incredibly fast and efficient for search engines. If you search for "Fishing," a stemmer instantly matches it to documents containing "Fish." The fact that "Fish" isn't exactly "Fishing" doesn't matter for search.


2. **Lemmatization**

- Lemmatization is a Linguistic approach. It performs a "Morphological Analysis."
- It looks at the word, determines its Part of Speech (Noun, Verb, Adjective), and then looks up the root form (Lemma) in a database (like WordNet).
- Lemma - The "Base Form" of a word. Example:

| Word Variation (Inflection) | The Lemma | 
|:-----------|------------:|
| 1. Running, Ran, Runs       | Run            | 
| 2. Better       | Good            | 
| 3. Mice       | Mouse            | 
| 4. Corpora           | Corpus            | 


In [2]:
# Dataset preprocessing

import re

def simple_preprocess(text):
    """
    Input: "The Dog run!!"
    Output: ['dog', 'run']
    """
    # 1. Lowercase
    text = text.lower()
    # 2. Remove punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    # 3. Tokenize (Split by space)
    tokens = text.split()
    return tokens

input = "The Dog run!!"
print(simple_preprocess(input))

['the', 'dog', 'run']


In [6]:
# 05_preprocessing_deep_dive.ipynb

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the dictionary (Run this once)
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Initialize the tools
stemmer = PorterStemmer()      # Stammening tool
lemmatizer = WordNetLemmatizer() # lamentization tool

# --- Test Data ---
words = ["running", "flies", "universities", "better", "ate"]

print(f"{'Original':<15} | {'Stemming ':<20} | {'Lemmatization':<20}")
print("-" * 65)

for word in words:
    # 1. Stemming
    stem = stemmer.stem(word)
    
    # 2. Lemmatization (We assume everything is a Verb 'v' or Noun 'n' for demo)
    # Note: 'ate' requires knowing it is a verb to become 'eat'
    lemma = lemmatizer.lemmatize(word, pos='v') 
    
    print(f"{word:<15} | {stem:<20} | {lemma:<20}")

Original        | Stemming             | Lemmatization       
-----------------------------------------------------------------
running         | run                  | run                 
flies           | fli                  | fly                 
universities    | univers              | universities        
better          | better               | better              
ate             | ate                  | eat                 
