# Stemming and Its Types - Text Preprocessing

## Introduction to Stemming

Stemming is a process in Natural Language Processing (NLP) where words are reduced to their base or root form. The primary goal of stemming is to ensure that words with the same meaning are treated as a single item. 

For example:
- "running" → "run"
- "better" → "good"
- "playing" → "play"

Stemming is often used in information retrieval, text classification, and other NLP tasks where reducing variations of words helps improve model performance.

## Types of Stemming

There are several types of stemming algorithms, each with different levels of complexity and approaches. The most commonly used stemming algorithms are:

1. **Porter Stemmer**
2. **Lancaster Stemmer**
3. **Snowball Stemmer**
4. **RegexpStemmer**

Let’s go over these one by one.

### 1. Porter Stemmer

The Porter Stemmer algorithm is one of the most widely used stemming techniques. It works by applying a set of rules to remove suffixes from words. While it’s efficient, it can be aggressive and sometimes removes more than necessary.

#### Example:

In [1]:
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

words = ["running", "runner", "runs", "easily"]
stemmed_words = [porter_stemmer.stem(word) for word in words]

print(stemmed_words)

['run', 'runner', 'run', 'easili']


### 2. Snowball Stemmer
The Snowball Stemmer, also known as the "English Stemmer," is an improvement over the Porter Stemmer. It is more consistent and handles a wider range of words without being too aggressive.

In [3]:
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer("english")

words = ["running", "runner", "runs", "easily"]
stemmed_words = [snowball_stemmer.stem(word) for word in words]

print(stemmed_words)

['run', 'runner', 'run', 'easili']


### 3. Lancaster Stemmer
The Lancaster Stemmer is a more aggressive stemming technique compared to the Porter Stemmer. It uses a more extensive set of rules, often leading to a more shortened form of words.

In [2]:
from nltk.stem import LancasterStemmer

lancaster_stemmer = LancasterStemmer()

words = ["running", "runner", "runs", "easily"]
stemmed_words = [lancaster_stemmer.stem(word) for word in words]

print(stemmed_words)


['run', 'run', 'run', 'easy']


### 4. RegexpStemmer
The RegexpStemmer class uses regular expressions (regex) to identify and remove suffixes from words. This stemmer allows for custom rules to be defined, providing more flexibility compared to standard stemmers.

#### How it Works
The RegexpStemmer works by applying regular expression patterns to match suffixes and then removing them from the input word.

#### Example:

In [7]:
from nltk.stem import RegexpStemmer

# Define a combined pattern for stemming
pattern = r'(?i)(ing|er|ly)$'

# Create a RegexpStemmer object with the pattern
regexp_stemmer = RegexpStemmer(pattern)

# Words to stem
words = ["running", "runner", "runs", "easily"]

# Apply stemming
stemmed_words = [regexp_stemmer.stem(word) for word in words]

print(stemmed_words)

['runn', 'runn', 'runs', 'easi']


### Conclusion
Stemming is an essential part of text preprocessing in NLP. The choice of stemming algorithm depends on the task at hand. Here's a summary of the stemmers discussed:

1. **Porter Stemmer**: Widely used and efficient but may be too aggressive.
2. **Snowball Stemmer**: A balance between precision and recall, often preferred for general use.
3. **Lancaster Stemmer**: More aggressive than the Porter Stemmer and produces even shorter stems.
4. **RegexpStemmer**: Highly customizable, allowing you to define your own rules using regular expressions.

### Shortcomings:
   - **Over-Stemming**: It can result in words being reduced to a stem that doesn't make sense (e.g., "better" becomes "bet").
   - **Loss of Meaning**: Stemming removes derivational affixes (e.g., "running" → "run"), sometimes losing important nuances in word meanings.
   - **Irregular Words**: Words that don’t follow regular stemming rules may be incorrectly stemmed (e.g., "goose" becomes "goos").