In [None]:
from nltk.stem import PorterStemmer, RegexpStemmer, SnowballStemmer

## Stemming

Stemming is a Natural Language Processing (NLP) technique used to reduce words to their base or root form, known as the stem, by removing prefixes and suffixes. This process is important for text normalization as it helps group together different variations of a word (e.g., "running," "runs," "ran") so they can be treated as a single term ("run") in applications like search engines and text analysis.

In [None]:
words = [  
    "running", "runs", "ran", "runner",                             
    "jumps", "jumping", "jumped",                                   
    "studies", "studying", "study", "student",    
    "cries", "crying", "cried",   
    "happily", "happiness", "happy",                                                
    "organization", "organize", "organizing", "organizer",    
    "agreement", "agreed", "agreeing", "agree",    
    "computation", "computing", "computed", "computer",                 
    "policy", "policies",                                               
    "connection", "connected", "connecting", "connects",                                       
    "history", "historical",   
    "finally", "final", "finalize",    
    "cats", "cat",   
    "dogs", "dog",    
    "ponies", "pony",    
    "trouble", "troubling", "troubled",     
    "playing", "plays", "played",                                               
    "corpora", "corpus",                                                    
    "better", "best", "good",     
    "wolves", "wolf",     
    "beautiful", "beautifully",     
    "generous", "generously", "generation",      
]                                                 

## Porter Stemmer
The Porter Stemmer, proposed by Martin Porter in 1980, is one of the most widely used and influential stemming algorithms, particularly for English
* Mechanism: It is a suffix-stripping algorithm that applies a series of about 50 rules across five sequential steps to remove common morphological and inflectional suffixes from English words.
* Characteristics:It is known for its simplicity and speed.
<t>It is considered a gentle or less aggressive stemmer, meaning it tends to produce stems that are more linguistically accurate (closer to a real word), although the resulting stem may not always be an actual dictionary word (e.g., "argues" $\rightarrow$ "argu").
* Limitation: It is primarily designed only for English.6

In [None]:
# PorterStemmer
stemming = PorterStemmer()

for word in words: 
    print(word+"--->"+stemming.stem(word)) 

running--->run
runs--->run
ran--->ran
runner--->runner
jumps--->jump
jumping--->jump
jumped--->jump
studies--->studi
studying--->studi
study--->studi
student--->student
cries--->cri
crying--->cri
cried--->cri
happily--->happili
happiness--->happi
happy--->happi
organization--->organ
organize--->organ
organizing--->organ
organizer--->organ
agreement--->agreement
agreed--->agre
agreeing--->agre
agree--->agre
computation--->comput
computing--->comput
computed--->comput
computer--->comput
policy--->polici
policies--->polici
connection--->connect
connected--->connect
connecting--->connect
connects--->connect
history--->histori
historical--->histor
finally--->final
final--->final
finalize--->final
cats--->cat
cat--->cat
dogs--->dog
dog--->dog
ponies--->poni
pony--->poni
trouble--->troubl
troubling--->troubl
troubled--->troubl
playing--->play
plays--->play
played--->play
corpora--->corpora
corpus--->corpu
better--->better
best--->best
good--->good
wolves--->wolv
wolf--->wolf
beautiful--->beau

In [5]:
# Regexp Stemmer class - easily implement regular expression stemmer algorithms. It's basically takes a single regex and remove any prefix or suffix that matches the expression

In [9]:
regexstem = RegexpStemmer('ing$|s$|e$|able$', min=4) # $ implies only suffix

st = RegexpStemmer("ing")

regexstem.stem("eating")
regexstem.stem('constable')

'const'

In [10]:
st.stem("ingeating")

'eat'

Snowball Stemmer (Porter2 Stemmer)
----------------------------------

The **Snowball Stemmer**, also developed by Martin Porter, is an **enhanced and more effective successor** to the original Porter Stemmer. It is often referred to as the **Porter2 Stemmer**.

*   **Mechanism:** It operates based on a set of rules and algorithms, similar to the Porter Stemmer, but with improved and more comprehensive rules that address some of the original stemmer's limitations.
    
*   **Characteristics:**
    
    *   The most significant difference is its **multilingual support**. It can stem text in numerous languages besides English (e.g., Dutch, German, French, Russian), as it is built using the Snowball language, a string processing language for creating stemming algorithms.

In [None]:
# Snowball Stemmer
stemmer = SnowballStemmer("english")

for word in words: 
    print(f"Word: {word}\t Stem: {stemmer.stem(word)}") 

Word: running	 Stem: run
Word: runs	 Stem: run
Word: ran	 Stem: ran
Word: runner	 Stem: runner
Word: jumps	 Stem: jump
Word: jumping	 Stem: jump
Word: jumped	 Stem: jump
Word: studies	 Stem: studi
Word: studying	 Stem: studi
Word: study	 Stem: studi
Word: student	 Stem: student
Word: cries	 Stem: cri
Word: crying	 Stem: cri
Word: cried	 Stem: cri
Word: happily	 Stem: happili
Word: happiness	 Stem: happi
Word: happy	 Stem: happi
Word: organization	 Stem: organ
Word: organize	 Stem: organ
Word: organizing	 Stem: organ
Word: organizer	 Stem: organ
Word: agreement	 Stem: agreement
Word: agreed	 Stem: agre
Word: agreeing	 Stem: agre
Word: agree	 Stem: agre
Word: computation	 Stem: comput
Word: computing	 Stem: comput
Word: computed	 Stem: comput
Word: computer	 Stem: comput
Word: policy	 Stem: polici
Word: policies	 Stem: polici
Word: connection	 Stem: connect
Word: connected	 Stem: connect
Word: connecting	 Stem: connect
Word: connects	 Stem: connect
Word: history	 Stem: histori
Word: hist

In [None]:
# comparison between Porter and SnowBall Stemmer

# Porter
print(stemming.stem("fairly"), stemming.stem("sportingly"))

# Snowball
print(stemmer.stem("fairly"), stemmer.stem("sportingly"))

fairli sportingli
fair sport


<h1>Stemming</h1> has it's limitations and it cannot be used for preprocessing text to train chatbots, lemmatisation and other techniques are better suited for that