# Stemming vs Lemmatization

Stemming and Lemmatization are text normalization techniques used in NLP to reduce words to a base form. They help models treat related word forms as similar.

**1. Stemming**
Stemming reduces a word to its root form by removing prefixes or suffixes using simple rules. The resulting word may not be a real word.

In [2]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "played", "studies", "better"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'play', 'studi', 'better']


**studi** is not a real word, Stemming ignores grammar and meaning

**2. Lemmatization**
What it is

Lemmatization reduces a word to its dictionary base form (lemma) by considering grammar and context. The result is always a valid word.

In [3]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "played", "studies", "better"]

lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(lemmatized_words)


[nltk_data] Downloading package wordnet to C:\Users\Arshad
[nltk_data]     Ziban\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to C:\Users\Arshad
[nltk_data]     Ziban\AppData\Roaming\nltk_data...


['run', 'play', 'study', 'better']


## When to Use What

### Use Stemming when:

* Speed is more important than accuracy
* Building simple search or keyword-matching systems

### Use Lemmatization when:

* Meaning and correctness matter
* Working with NLP pipelines, chatbots, or search systems

---

## Important Note for LLMs

Modern **Large Language Models (LLMs)** do **not** rely on stemming or lemmatization.
They use **subword tokenization**, which makes these techniques mostly unnecessary in LLM pipelines.

However, understanding stemming and lemmatization is still important for **NLP fundamentals**.

---

## Key Takeaway

* **Stemming** is fast but rough
* **Lemmatization** is slower but accurate
* **LLMs** go beyond both by using **subword tokenization**
