## Stemming and its types: - Preprocessing

Stemming and Lemmatization are two common techniques used in Natural Language Processing (NLP) to reduce words to their base or root form. This process helps in normalizing the text and improving the performance of various NLP tasks such as text classification, sentiment analysis, and information retrieval.

1. Stemming:
   Stemming is the process of reducing a word to its base or root form by removing suffixes or prefixes. The resulting stem may not necessarily be a valid word in the language. For example:
   - "running" -> "run"
   - "happiness" -> "happi"

   Stemming algorithms, such as the Porter Stemmer or Snowball Stemmer, are commonly used for this purpose.

2. Lemmatization:
   Lemmatization, on the other hand, reduces a word to its base or dictionary form (lemma) by considering the context and the part of speech. The resulting lemma is always a valid word in the language. For example:
   - "running" -> "run"
   - "better" -> "good"

   Lemmatization is generally more accurate than stemming but also more computationally intensive. It often requires the use of a vocabulary and morphological analysis of words.

In summary, both stemming and lemmatization aim to reduce words to their base form, but they differ in their approach and the quality of the output. Stemming is faster and simpler, while lemmatization is more accurate and context-aware.

In [None]:
'''
Stemming:

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of 
words known as a lemma.

Stemming is important in natural language understanding (NLU) and Natural Language Processing (NLP) tasks, 
as it helps in reducing the dimensionality of the text data and improving the efficiency of various algorithms.
'''

## PorterStemmer

In [1]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [5]:
# stemming sample words list
words = ['running', 'eating', 'eat','eaten', 'runner', 'run', 'eats', 'eatingly', 'eatenly', 
         'programming', 'program', 'finally', 'finalized','writes','writing']


In [6]:
for word in words:
    print(word+"-->"+stemming.stem(word))

running-->run
eating-->eat
eat-->eat
eaten-->eaten
runner-->runner
run-->run
eats-->eat
eatingly-->eatingli
eatenly-->eatenli
programming-->program
program-->program
finally-->final
finalized-->final
writes-->write
writing-->write


In [None]:
'''
disadvantages of stemming:
1. Stemming can sometimes produce non-words or incorrect base forms (e.g., "running" -> "run", but "eatenly" -> "eatenli").
2. It may not account for the context in which a word is used, leading to loss of meaning.
3. Different words with similar meanings may be reduced to the same stem, causing ambiguity.
4. Stemming algorithms may not handle irregular forms well (e.g., "went" -> "go").
'''

## RegexpStemmer Class

In [None]:
'''
NLTK has a RegexpStemmer class that allows for stemming based on regular expressions.
It basically takes a single regular expression pattern as input and applies it to the words to be stemmed.

'''

In [7]:
from nltk.stem import RegexpStemmer

In [18]:
regexp_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [19]:
regexp_stemmer.stem('running')  # Example usage

'runn'

In [20]:
regexp_stemmer.stem('ingrunning')  # Example usage

'ingrunn'

## Snowball Stemmer

In [None]:
'''
Snowball Stemmer:
The Snowball Stemmer is a more advanced stemming algorithm that supports multiple languages. 
It uses a set of heuristics to iteratively remove suffixes from words, 
making it more effective than simple rule-based stemmers.
'''

In [21]:
from nltk.stem import SnowballStemmer

In [22]:
snowball_stemmer = SnowballStemmer(language='english')

In [23]:
for word in words:
    print(word + "-->" + snowball_stemmer.stem(word))

running-->run
eating-->eat
eat-->eat
eaten-->eaten
runner-->runner
run-->run
eats-->eat
eatingly-->eat
eatenly-->eaten
programming-->program
program-->program
finally-->final
finalized-->final
writes-->write
writing-->write


In [None]:
stemming.stem('fairly'), stemming.stem('sportingly')

('fairli', 'sportingli')

In [25]:
snowball_stemmer.stem('fairly'), snowball_stemmer.stem('sportingly')

('fair', 'sport')

In [None]:
'''
Which is better:
1. Snowball Stemmer is generally more effective than Porter Stemmer for many languages, as it uses a more sophisticated set of rules.
2. The Snowball Stemmer is a more advanced stemming algorithm that supports multiple languages.


Still better option is to use lemmatization, which considers the context and converts words to their base form.
'''