# Stemming 
**Stemming** is a process used in Natural Language Processing (NLP) to reduce words to their root or base form. The main idea is to remove suffixes (like *-ing*, *-ed*, *-ly*) so that different variations of a word can be treated as the same. For example, words like **"playing," "played,"** and **"plays"** all come from the base word **"play"** — stemming reduces them to that root. This helps machines understand that those words carry the same core meaning, even though they look different. However, stemming doesn’t always produce real words — for instance, "connection" might be stemmed to "connect," but "better" might turn into "bet." It’s a rough, rule-based approach to normalize words and is especially helpful when building search engines, text classification systems, or chatbots where matching similar word meanings is important.

In [3]:
# Sentiment analysis: checking whether product comments are positive or negative.
# Words in reviews can appear in different forms like eating, eaten, eat — or going, gone, goes.
# Every word gets converted into a vector (a numeric representation).
# Instead of treating all similar words separately, we simplify them to a common base — for example, "go" instead of "goes" or "gone".

In [2]:
words = ["running", "runs", "runner", "easily", "fairly", "played", "playing", "play", "better", "faster", "going", "gone", "ate", "eating","programming", "programs"]

# Porterstemmer 
The **Porter Stemmer** is one of the most widely used stemming algorithms in Natural Language Processing (NLP). It was developed by **Martin Porter** in 1980, and its main job is to **reduce words to their root or base form** by removing common suffixes like *-ing*, *-ed*, *-es*, *-ly*, etc.

---

###  Simple Explanation:
When we analyze text, we often find many variations of the same word. For example:
- "connect", "connected", "connecting", "connection"

All of these words carry the same core idea — so instead of treating them as separate words, **Porter Stemmer** reduces them to a common root (in this case, **"connect"**), making analysis more consistent.

In [4]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
stemming

<PorterStemmer>

In [11]:
for word in words:
    print(word,'---->',stemming.stem(word))

running ----> run
runs ----> run
runner ----> runner
easily ----> easili
fairly ----> fairli
played ----> play
playing ----> play
play ----> play
better ----> better
faster ----> faster
going ----> go
gone ----> gone
ate ----> ate
eating ----> eat
programming ----> program
programs ----> program


### ✅ **Advantages of Porter Stemmer:**
- Reduces different word forms to a common base (e.g., *playing → play*).
- Speeds up processing by reducing vocabulary size.
- Good for search engines and basic NLP tasks.
- Fast and easy to use.

---

### ❌ **Disadvantages:**
- Sometimes gives incorrect stems (*easily → easili*).
- Doesn’t always reduce similar words (*run* ≠ *runner*).
- Output may not be real words.
- Only works well for English.


In [13]:
stemming.stem('congratulations')
# changing the form of the word, which does not having a meaning

'congratul'

The **`RegexpStemmer`** (short for *Regular Expression Stemmer*) is a class in the **NLTK** library that lets you perform **custom stemming** using **regular expression (regex) rules**. Unlike rule-based stemmers like Porter or Snowball, which have pre-defined logic, `RegexpStemmer` gives you full control to define **which patterns to strip** from the ends of words.

---

### Simple Explanation:

Imagine you want to remove specific endings like `"ing"`, `"ly"`, or `"ed"` from words. Instead of relying on built-in stemmers, you can **define your own rules using regex** — and `RegexpStemmer` will apply those.

### Parameters:
- `pattern='ing$|ed$|ly$'`: Regex pattern to match suffixes at the end of words.
- `min=4`: Minimum length a word must have **after** removing the suffix — prevents over-stemming short words.


### Use Case:
Best when you:
- Want **simple**, rule-based stemming.
- Need to **customize** what gets stripped off.
- Want **more control** than Porter or Snowball stemmers provide.

In [14]:
from nltk.stem import RegexpStemmer

In [27]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$',min=4)

In [28]:
reg_stemmer

<RegexpStemmer: 'ing$|s$|e$|able$'>

In [29]:
print(reg_stemmer.stem('eating'))
print(reg_stemmer.stem('eatable'))
print(reg_stemmer.stem('eats'))
print(reg_stemmer.stem('ingeating'))
print(reg_stemmer.stem('discrete'))

eat
eat
eat
ingeat
discret


In [30]:
reg_stemmer = RegexpStemmer('ing|s$|e$|able$',min=4)
print(reg_stemmer.stem('ingeating'))

eat


### ❄️ **Snowball Stemmer (also called Porter2 Stemmer)**

The **Snowball Stemmer** is an **improved version of the Porter Stemmer**, developed by **Martin Porter** himself. It’s more consistent and accurate than the original Porter stemmer and supports **multiple languages**.

---

### In Simple Words:
Snowball stemmer cuts words down to their **root form**, just like Porter stemmer, but it does it **more cleanly** and with **fewer mistakes**.

### ✅ Advantages:
- More accurate than Porter Stemmer
- Supports many languages (French, German, Spanish, etc.)
- Produces more consistent results

### ❌ Disadvantages:
- Slightly slower than Porter stemmer
- Still may not return real dictionary words


In [32]:
from nltk.stem import SnowballStemmer

In [35]:
snowballstemmer = SnowballStemmer(language='english')

In [36]:
snowballstemmer

<nltk.stem.snowball.SnowballStemmer at 0x267b3c78dd0>

In [39]:
for word in words:
    print(word,'--->',snowballstemmer.stem(word))

running ---> run
runs ---> run
runner ---> runner
easily ---> easili
fairly ---> fair
played ---> play
playing ---> play
play ---> play
better ---> better
faster ---> faster
going ---> go
gone ---> gone
ate ---> ate
eating ---> eat
programming ---> program
programs ---> program


In [44]:
print('with porter stemmer\t',stemming.stem('fairly'))
print('-'*100)
print('with snowball stemmer\t',snowballstemmer.stem('fairly'))

with porter stemmer	 fairli
----------------------------------------------------------------------------------------------------
with snowball stemmer	 fair


In [48]:
print('but not works best on all words\n')

print('with porter stemmer\t',stemming.stem('goes'))
print('-'*100)
print('with snowball stemmer\t',snowballstemmer.stem('goes'))

but not works best on all words

with porter stemmer	 goe
----------------------------------------------------------------------------------------------------
with snowball stemmer	 goe
