# Stemming and Its Types - Text Preprocessing in NLP

### Stemming Overview
Stemming is a text preprocessing technique used in Natural Language Processing (NLP) to reduce words to their root or base form, known as the "stem." The primary goal of stemming is to group together different inflected forms of a word so they can be analyzed as a single item. This helps in reducing the dimensionality of the text data and improves the performance of various NLP tasks such as information retrieval, text classification, and sentiment analysis.

##### Classification Stemming problems
Example: Comments of the product review is +ve review or -ve review
Reviews: eating, eaten, eats -> eat, liked, liking, likes -> like
In the above example, different forms of the words "eat" and "like" are reduced to their root forms "eat" and "like," respectively. This allows the NLP model to treat all these variations as the same word, simplifying the analysis and improving efficiency.

In [24]:
words = ['eats', 'eaten', 'eating', 'liked', 'liking', 'likes', 'programming', 'programmed', 'programmer', 'running', 'runner', 'ran', 'better', 'best', 'good', 'history']

### Porter Stemmer
The Porter Stemmer is one of the most widely used stemming algorithms. It was developed by Martin Porter in 1980. The algorithm applies a series of rules to iteratively remove common morphological and inflectional endings from words in the English language. The Porter Stemmer is known for its simplicity and effectiveness, making it a popular choice for many NLP applications.

In [25]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [26]:
for word in words:
    print(word +" ----> "+ stemming.stem(word))

eats ----> eat
eaten ----> eaten
eating ----> eat
liked ----> like
liking ----> like
likes ----> like
programming ----> program
programmed ----> program
programmer ----> programm
running ----> run
runner ----> runner
ran ----> ran
better ----> better
best ----> best
good ----> good
history ----> histori


In [None]:
stemming.stem("congratulations")  ## no meaning give

'congratul'

In [28]:
stemming.stem("sitting")

'sit'

### Regex Stemmer Class
Regex Stemmer is a stemming technique that utilizes regular expressions (regex) to identify and remove specific patterns from words in order to reduce them to their root form. This method allows for more flexibility in defining custom stemming rules based on the specific requirements of the text data being processed.


In [29]:
from nltk.stem import RegexpStemmer

reg_stemming = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [30]:
reg_stemming.stem('eating')

'eat'

In [31]:
reg_stemming.stem('ingeating')

'ingeat'

In [32]:
## if we remove $ form last then it remove every ing
reg_stemming = RegexpStemmer('ing|s$|e$|able$', min=4)
reg_stemming.stem('ingeating')

'eat'

### Snowball Stemmer class
The Snowball Stemmer, also known as the Porter2 Stemmer, is an improved version of the original Porter Stemmer. It was developed by Martin Porter and is designed to be more efficient and accurate in stemming words across multiple languages. The Snowball Stemmer uses a more systematic approach with a set of defined rules for each language, making it suitable for a wider range of applications in NLP.

In [34]:
from nltk.stem import SnowballStemmer

snowball_stem = SnowballStemmer('english')

In [35]:
for word in words:
    print(word+" ---> "+snowball_stem.stem(word))

eats ---> eat
eaten ---> eaten
eating ---> eat
liked ---> like
liking ---> like
likes ---> like
programming ---> program
programmed ---> program
programmer ---> programm
running ---> run
runner ---> runner
ran ---> ran
better ---> better
best ---> best
good ---> good
history ---> histori


In [37]:
stemming.stem('fairly'), stemming.stem('sportingly')

('fairli', 'sportingli')

In [38]:
snowball_stem.stem('fairly'), snowball_stem.stem('sportingly')

('fair', 'sport')

In [39]:
snowball_stem.stem('goes'), stemming.stem('goes')

('goe', 'goe')

Disadvantages of Stemming:
1. Overstemming: Stemming algorithms may sometimes reduce words to the same stem even when they have different meanings, leading to loss of information.
2. Understemming: Conversely, stemming may fail to reduce words that should be grouped together, resulting in missed connections between related terms.
3. Language Dependency: Stemming algorithms are often language-specific, and a stemmer designed for one language may not work well for another.
4. Lack of Context: Stemming does not consider the context in which a word is used, which can lead to incorrect interpretations of the stemmed words.

It is removed in Lemmatization.