# üåü [**Lemmatization**](https://www.analyticsvidhya.com/blog/2022/06/stemming-vs-lemmatization-in-nlp-must-know-differences/) üöÄ

## üìå **1. What is Lemmatization?**
Lemmatization is a process in **Natural Language Processing (NLP)** that finds the **base or dictionary form** of a word, called a **lemma**, by analyzing its **context** and **part of speech (POS)**.

‚úÖ **Why is Lemmatization Important?**  
- It ensures words are in their correct base form.  
- It improves text analysis by maintaining proper meaning.  
- It is widely used in **search engines, AI chatbots, and sentiment analysis**.

---

## üè∑ **2. What is Part of Speech (POS)?**
**Part of Speech (POS)** refers to the **category** a word belongs to based on its grammatical function in a sentence.

üîπ Words are classified into different parts of speech:

| üè∑ **POS**      | üìñ **Definition**                | üîç **Example**            |
|--------------|------------------------------|------------------------|
| **Noun**     | A person, place, or thing.    | **dog, city, book**    |
| **Verb**     | An action or state of being.  | **run, eat, is**       |
| **Adjective**| Describes a noun.            | **big, happy, red**    |
| **Adverb**   | Describes a verb or adjective. | **quickly, very, well** |
| **Pronoun**  | Replaces a noun.              | **he, she, they**      |
| **Preposition** | Shows relationships.      | **on, under, at**      |
| **Conjunction** | Connects words/phrases.   | **and, but, because**  |
| **Interjection** | Expresses emotion.       | **Wow!, Oh!, Oops!**   |

---

## üõ† **3. How Does Lemmatization Work?**
Lemmatization follows a **structured** approach:

1Ô∏è‚É£ **Tokenization** ‚Üí Break text into words.  
2Ô∏è‚É£ **POS Tagging** ‚Üí Identify each word's role.  
3Ô∏è‚É£ **Apply Lemmatization** ‚Üí Convert words to base form using a **linguistic database** (like WordNet).  
4Ô∏è‚É£ **Output** ‚Üí Get meaningful and standardized words.  

üìå **Example:**

| üî§ **Word**  | üìù **Lemmatized Form** |
|-------------|------------------|
| **Running** | **Run** |
| **Studies** | **Study** |
| **Happily** | **Happy** |
| **Better**  | **Good** |

---

# üéØ **4. Lemmatization vs. Stemming: Key Differences**
Both techniques reduce words to their root forms, but **lemmatization is more precise!**  

| ‚ö° **Feature**      | üîç **Lemmatization** | üîß **Stemming** |
|----------------|-----------------|-------------|
| **Definition**  | Converts words to their dictionary form, considering **context** and **POS**. | Removes **suffixes** without considering meaning. |
| **Accuracy**    | ‚úÖ **High** ‚Äì produces valid words. | ‚ùå **Lower** ‚Äì may produce gibberish. |
| **Speed**       | ‚ùå **Slower** (requires NLP rules). | ‚úÖ **Faster** (rule-based). |
| **Use Case**    | **Chatbots, text understanding, AI.** | **Search engines, keyword matching.** |
| **Example**     | "better" ‚Üí **"good"**, "running" ‚Üí **"run"** | "better" ‚Üí **"bet"**, "running" ‚Üí **"run"** |

üìå **Comparison Example:**

| üî§ **Word**  | üìù **Lemmatization** | üîß **Stemming** |
|-------------|------------------|-------------|
| **Running** | **Run** | **Run** |
| **Studies** | **Study** | **Studi** |
| **Happily** | **Happy** | **Happi** |
| **Better**  | **Good**  | **Bet** |

‚úÖ **Lemmatization = More accurate, preserves meaning.**  
‚ùå **Stemming = Faster, but may distort words.**

---

# üßê **5. Should I Use Both Stemming and Lemmatization?**
Follow these **5 simple steps** to decide which method fits your NLP task.  

### **Step 1: Define Your Needs**
üîπ **What is your goal?**  
- Need to group **similar concepts** (e.g., synonyms)?  
- Need to **preserve exact meaning** for **AI embeddings**?  

üîπ **What matters more‚ÄîSpeed or Accuracy?**  
- **Large dataset?** ‚Üí **Speed is crucial.**  
- **Precise meaning?** ‚Üí **Accuracy is more important.**  

---

### **Step 2: Consider the Trade-Offs**
| ‚ö° **Feature**      | üîß **Stemming** | üìù **Lemmatization** |
|----------------|-------------|----------------|
| **Speed**     | ‚úÖ **Fast** | ‚ùå **Slower** |
| **Accuracy**  | ‚ùå May distort words | ‚úÖ Preserves meaning |
| **Output**    | ‚ùå Can create non-words | ‚úÖ Produces real words |
| **Best Use**  | **Search engines, large datasets** | **AI, chatbots, sentiment analysis** |

---

### **Step 3: Choose Based on Your Needs**
‚úÖ **Use Stemming if:**  
- You need **fast processing**.  
- You can tolerate **some loss of meaning**.  

‚úÖ **Use Lemmatization if:**  
- **Accuracy** is essential.  
- You need grammatically correct base words.  

---

### **Step 4: Experiment with Both**  
If unsure, try both on a **small dataset** and compare:  
- **Does stemming distort words too much?**  
- **Does lemmatization slow down processing significantly?**  

---

### **Step 5: Refine Your Approach**  
- **If speed is critical**, use **stemming**.  
- **If accuracy is more important**, use **lemmatization**.  
- **Hybrid approach?** Try **stemming first** for quick reduction, then **lemmatization** for refinement.  

üí° **Final Tip:** The best choice depends on your specific **NLP task**! üöÄ  


##**üìå Real-World Use Cases**
## **1. Sentiment Analysis (Product Reviews, Social Media Monitoring)**

### ‚úÖ **Best Choice: Lemmatization**  

- **Why?** Lemmatization helps **normalize words** while maintaining their correct meaning.  
- **Example:**  
  - If a user writes: **"This product is better than I expected!"**  
  - Lemmatization converts **"better" ‚Üí "good"**, helping the sentiment model recognize it as **positive sentiment**.  

---

### üí° **Example Sentiment Analysis:**

| üìù **Raw Text**                     | üéØ **Lemmatized Text**         | üòÉ **Sentiment**  |
|--------------------------------------|--------------------------------|------------------|
| "I loved the movies"                 | "I love the movie"            | **Positive** üëç  |
| "This phone is worse than before"    | "This phone be bad than before" | **Negative** üëé  |
| "Running is exhausting"              | "Run be exhaust"              | **Neutral** üòê  |

---

### **Why Use Lemmatization for Sentiment Analysis?**
‚úî **Maintains correct word meaning** (e.g., "better" ‚Üí "good")  
‚úî **Reduces variations of words** for better text processing  
‚úî **More accurate sentiment classification**  

----

# **2. üìÇ Search Engine for a Company‚Äôs Internal Reports**  

**Scenario:** A company with thousands of internal reports (e.g., project updates, meeting notes) needs a search tool to help employees find documents quickly. Speed is critical because employees need results in seconds, and the database is large.

## ‚úÖ **Best Choice: Stemming**  

### **üöÄ Why Stemming?**  
- **Speed is critical** ‚Äì Employees need results **instantly**.  
- **Large database** ‚Äì Processing must be **fast and efficient**.  
- **Employees use different word forms** ‚Äì Stemming helps match them quickly.  

---

## **üí° Example: Searching for "meeting"**
| **Search Query**     | **Stemming Applied**  | **Matches Found**        |
|----------------------|----------------------|--------------------------|
| "meeting notes"     | "meet"               | ‚úÖ "meeting notes"       |
| "meetings summary"  | "meet"               | ‚úÖ "meetings summary"    |
| "met with team"     | "met" ‚Üí "meet"       | ‚úÖ "met with team"       |

üìå **Stemming helps match related words, making searches broader and faster!**  

---

## **‚ùå Why Not Lemmatization?**  
- **Slower processing** ‚Üí Lemmatization requires **linguistic analysis**, increasing search time.  
- **Overkill for simple searches** ‚Üí Employees need **fast** keyword-based results, not deep semantic understanding.  

---

## **üîπ Final Decision: Use Stemming**  
‚úî **Fast & efficient for large internal databases**  
‚úî **Broadens search results by matching word variations**  
‚úî **Ideal when speed matters more than perfect accuracy**  


### [Lemmatization using NLTK](https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258)

In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Basic lemmatization
print(lemmatizer.lemmatize("running", pos="v"))
print(lemmatizer.lemmatize("better", pos="a"))

run
good


### [Lemmatization using Spacy](https://spacy.io/api/lemmatizer)

In [None]:
import spacy

# Load English NLP model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("She is running better than her friend.")

# Print lemmatized words
for token in doc:
    print(token.text, "‚Üí", token.lemma_)

She ‚Üí she
is ‚Üí be
running ‚Üí run
better ‚Üí well
than ‚Üí than
her ‚Üí her
friend ‚Üí friend
. ‚Üí .


# **Remove contractions**
# **üìå Expanding Contractions in NLP**  

## **üöÄ What Are Contractions?**  
**Contractions** are **shortened word forms** created by combining two words and **removing some letters**, replaced with an **apostrophe (`'`)**.  

üîπ **Example:**  
- `"I'm"` ‚Üí `"I am"`  
- `"Don't"` ‚Üí `"Do not"`  
- `"It's"` ‚Üí `"It is"`  

---

## **üìå Why Expand Contractions in NLP?** ü§ñ  
Many **Natural Language Processing (NLP) models** struggle with contractions because they make text **informal and inconsistent**.  
Expanding them helps in **better text preprocessing**, making it easier for AI models to analyze and understand sentences.  


---

## **üìå Common Contractions & Their Expansions** üîÑ  

| **Contraction** | **Expanded Form**  |  
|---------------|----------------|  
| I'm          | I am           |  
| You're       | You are        |  
| It's         | It is          |  
| He's         | He is          |  
| She's        | She is         |  
| We're        | We are         |  
| They're      | They are       |  
| Isn't        | Is not         |  
| Aren't       | Are not        |  
| Can't        | Cannot        |  
| Won't        | Will not       |  


---



**Install the Library**

In [None]:
%%capture
!pip install contractions

In [None]:
import contractions

text = "I'm learning NLP, but I won't give up!"
expanded_text = contractions.fix(text)

print(f"Before: {text}")
print(f"After : {expanded_text}")


Before: I'm learning NLP, but I won't give up!
After : I am learning NLP, but I will not give up!


# **Remove Punctuation**

In [None]:
import string
#This is a constant in Python's string module that contains all standard punctuation characters.
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

[string_maketrans](https://www.w3schools.com/python/ref_string_maketrans.asp)

In [None]:
#The maketrans() method returns a mapping table that can be used with the translate() method to replace specified characters.
txt = "Hello Sam!"
mytable = str.maketrans("S", "P")
print(mytable)

{83: 80}


In [None]:
# Use a mapping table to replace many characters
txt = "Hi Sam!"
x = "mSa"
y = "eJo"
mytable = str.maketrans(x, y)
print(txt.translate(mytable))

Hi Joe!


In [None]:
# The third parameter in the mapping table describes characters that you want to remove from the string
txt = "Good night Sam!"
x = "mSa"
y = "eJo"
z = "odnght"
mytable = str.maketrans(x, y, z)
print(txt.translate(mytable))

G i Joe!


In [None]:
import string

text = "Hello, World! How's everything?"
translator = str.maketrans('', '', string.punctuation)
clean_text = text.translate(translator)

print(f'Original: {text}')
print(f'Cleaned : {clean_text}')


Original: Hello, World! How's everything?
Cleaned : Hello World Hows everything
