# 📝 NLP Notes: Stemming

## 🔹 What is Stemming?
Stemming is the process of reducing words to their base or root form. For example, the words "playing", "played", and "plays" can all be reduced to the root word "play".

Stemming is used to:
- Normalize words
- Improve text matching and information retrieval
- Reduce the dimensionality of text data

⚠️ Note: The result may not always be a valid English word (e.g., "studies" → "studi").

## ✅ 1. Porter Stemmer (Most Common)
Uses a series of rules to remove common suffixes from words in English.

In [7]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["python", "pythoning", "pythoned", "pythonly", "studies", "studying"]

for word in words:
    print(f"{word} → {ps.stem(word)}")

python → python
pythoning → python
pythoned → python
pythonly → pythonli
studies → studi
studying → studi


## ✅ 2. Lancaster Stemmer
A more aggressive stemmer compared to Porter.

In [8]:
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
for word in words:
    print(f"{word} → {ls.stem(word)}")

python → python
pythoning → python
pythoned → python
pythonly → python
studies → study
studying → study


## ✅ 3. Regexp Stemmer
Uses regular expressions to remove specific suffixes from words.

You can define your own rules using the `RegexpStemmer` class.

In [9]:
from nltk.stem import RegexpStemmer

# Remove common suffixes like -ing, -ed, -s
regexp_stemmer = RegexpStemmer('ing$|ed$|s$', min=4)
for word in words:
    print(f"{word} → {regexp_stemmer.stem(word)}")

python → python
pythoning → python
pythoned → python
pythonly → pythonly
studies → studie
studying → study


### 📘 RegexpStemmer Class Explanation:
`RegexpStemmer` works by removing suffixes that match a regular expression.

- **Syntax:** `RegexpStemmer(pattern, min=0)`
    - `pattern`: A regex pattern to match suffixes
    - `min`: Minimum length of the remaining stem

**Example:** To remove -ing, -ly, or -ed:
```python
RegexpStemmer("ing$|ly$|ed$", min=3)
```

## ✅ 4. Snowball Stemmer
Supports multiple languages and is more accurate than Porter for some cases.

In [10]:
from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')
for word in words:
    print(f"{word} → {snowball.stem(word)}")

python → python
pythoning → python
pythoned → python
pythonly → python
studies → studi
studying → studi


In [12]:
ps.stem("pythoned")

'python'

## 🧾 Summary Comparison Table

| Word       | Porter | Lancaster | Regexp | Snowball |
|------------|--------|-----------|--------|----------|
| studies    | studi  | study     | studi  | studi    |
| studying   | studi  | study     | study  | studi    |
| pythoned   | python | python    | python | python   |
| pythoning  | python | python    | python | python   |
| pythonly   | python | python    | python | python   |