### Stemming
- Stemming is a text processing technique used in Natural Language Processing (NLP) that reduces words to their root/base form by removing prefixes or suffixes.
- e.g.1: if corpus = [eat, eaten, eating], word root = [eat]
- e.g.2: if corpus = [go,going,gone,goes], word root = [go]
- Python classes used for Stemming
    - 1) PorterStemmer
    - 2) RegexpStemmer class
        - This class takes a single regular expression and removes prefix or suffixes that match the expression.
    - 3) Snowball Stemmer
        - Performs better than PorterStemmer
        - Multi-language support (e.g. English, Arabic, German, French, etc)
- *** Purpose of Stemming ***
    - With Stemming we will create a common Vector instead of a seperate Vector for each of the similar words.


### Disadvantage of Stemming
- We cannot use Stemming for Chatbots because most Stemming classes do not give perfect word_roots for all the words.
- Better technique for Chatbots is Lemmitization.

### Comparison of PorterStemmer vs RegexpStemmer vs Snowball Stemmer
<table border="1" cellpadding="8" cellspacing="0">
  <thead>
    <tr>
      <th>Aspect</th>
      <th>Porter Stemmer</th>
      <th>Regexp Stemmer</th>
      <th>Snowball Stemmer</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Type</td>
      <td>Rule-based algorithmic stemmer</td>
      <td>Pattern-based stemmer</td>
      <td>Improved rule-based stemmer (Porter v2)</td>
    </tr>
    <tr>
      <td>Underlying Approach</td>
      <td>Applies a fixed sequence of linguistic rules and suffix stripping</td>
      <td>Uses user-defined regular expressions</td>
      <td>Applies optimized and standardized stemming rules</td>
    </tr>
    <tr>
      <td>Language Support</td>
      <td>English only</td>
      <td>Language-agnostic (regex dependent)</td>
      <td>Multiple languages</td>
    </tr>
    <tr>
      <td>Configurability</td>
      <td>Not configurable</td>
      <td>Highly configurable</td>
      <td>Limited (predefined per language)</td>
    </tr>
    <tr>
      <td>Accuracy</td>
      <td>Moderate</td>
      <td>Low to Moderate</td>
      <td>High</td>
    </tr>
    <tr>
      <td>Consistency</td>
      <td>Reasonably consistent</td>
      <td>Depends on regex quality</td>
      <td>Very consistent</td>
    </tr>
    <tr>
      <td>Handling Edge Cases</td>
      <td>Average</td>
      <td>Weak</td>
      <td>Strong</td>
    </tr>
    <tr>
      <td>Risk of Over-Stemming</td>
      <td>Medium</td>
      <td>High</td>
      <td>Low</td>
    </tr>
    <tr>
      <td>Risk of Under-Stemming</td>
      <td>Medium</td>
      <td>Medium</td>
      <td>Low</td>
    </tr>
    <tr>
      <td>Output Quality</td>
      <td>May produce non-words</td>
      <td>Often produces non-words</td>
      <td>Cleaner stems</td>
    </tr>
    <tr>
      <td>Performance / Speed</td>
      <td>Fast</td>
      <td>Very fast</td>
      <td>Fast</td>
    </tr>
    <tr>
      <td>Ease of Use</td>
      <td>Easy</td>
      <td>Requires regex expertise</td>
      <td>Easy</td>
    </tr>
    <tr>
      <td>Best Use Case</td>
      <td>General-purpose English text preprocessing</td>
      <td>Custom domain-specific normalization</td>
      <td>Production-grade and multilingual NLP pipelines</td>
    </tr>
    <tr>
      <td>Example (studies)</td>
      <td>studi</td>
      <td>studi / study (depends on regex)</td>
      <td>studi</td>
    </tr>
     <tr>
      <td>Case</td>
      <td>Changes case to Lowercase</td>
      <td>Does not change Case</td>
      <td>Changes case to Lowercase</td>
    </tr>
  </tbody>
</table>


### Comparsion of Stemming and Lemmatization
<table>
<tr>
    <td></td>
    <td>Stemming</td>
    <td>Lemmatization</td>
</tr>
<tr>
    <td>Definition</td>
    <td>A text normalization technique that reduces words to their root form by removing prefixes or suffixes</td>
    <td>A text normalization technique that reduces words to their dictionary (base) form, called a <b>lemma</b></td>
</tr>
<tr>
    <td>Linguistic Knowldege</td>
    <td>Does not use linguistic rules or vocabulary</td>
    <td>Use linguistic rules, morphlogy and vocabulary</td>
</tr>
<tr>
    <td>Output (words)</td>
    <td>May produce non-meaningful or invalid words</td>
    <td>Always produces valid dictionary words</td>
</tr>
<tr>
    <td>Accracy</td>
    <td>Lower accuracy</td>
    <td>Higher accuracy</td>
</tr>
<tr>
    <td>Context Awareness</td>
    <td>Context free</td>
    <td>Context aware (often uses POS tagging)</td>
</tr>
<tr>
    <td>Part of Speech (POS)</td>
    <td>POS not required</td>
    <td>POS tagging often required</td>
</tr>
<tr>
    <td>Speed / Performance</td>
    <td>Faster</td>
    <td>Slower due to linguistic analysis</td>
</tr>
<tr>
    <td>Example</td>
    <td>running -> run, studies -> studi</td>
    <td>running -> run, studies -> study</td>
</tr>
<tr>
    <td>Handling irregular words</td>
    <td>Poor</td>
    <td>Excellent</td>
</tr>
<tr>
    <td>Language dependency</td>
    <td>Mostly language agnostic</td>
    <td>Language dependent</td>
</tr>
<tr>
    <td>Usecases</td>
    <td>Search engines, indexing, quick text processing</td>
    <td>Chatbots, semantic analysis, NLP pipelines requiring precision, Q&A, Text summarization</td>
</tr>
<tr>
    <td>Common Libraries</td>
    <td>NLTK - PortStemmer, SnowballStemmer</td>
    <td>NLTK - WordNetLemmatizer, Spacey</td>
</tr>
<tr>
    <td>Case</td>
    <td>After Stemming, PortStemmer & SnowballStemmer changes the Case of the Stemmed words to Lowercase (Preferred)</td>
    <td>After Lemmitization,WordNetLemmatizer does not change the Case of the Lemmitized words to Lowercase</td>
</tr>
<tr>
    <td>Preferred Approach</td>
    <td>Stemming is not the preferred approach as compared to Lemmitization</td>
    <td>Lemmitization is the preferred approach mainly because it has (1) Higher Accuracy than Stemming (2) Uses Context with Language (e.g. English)</td>
</tr>

</table>




In [4]:
from nltk.stem import PorterStemmer
from nltk.stem import RegexpStemmer
from nltk.stem import SnowballStemmer

In [5]:
#1) Stemming using PorterStemmer
# For some words PorterStemmer does not give a valid word_root, e.g. word = congratulations, word_root = congratul
corpus=['eating','eats','eaten','writing','writes','programming','programs','congratulations']

porter_stemmer=PorterStemmer()
for word in corpus:
    word_root = porter_stemmer.stem(word)
    print(f"word = {word}, word_root = {word_root}")


word = eating, word_root = eat
word = eats, word_root = eat
word = eaten, word_root = eaten
word = writing, word_root = write
word = writes, word_root = write
word = programming, word_root = program
word = programs, word_root = program
word = congratulations, word_root = congratul


In [6]:
#2) Stemming using RegexpStemmer
# $ = wild card character, 
# - ing$ = words ending with 'ing' will be stemmed
corpus=['eating','eats','eaten','writing','writes','programming','programs','congratulations']
regexp_stemmer = RegexpStemmer('ing$|s$|able$',min=4)

word = "eating"
word_root = regexp_stemmer.stem(word)
print (f"word = {word}, word_root={word_root}",word,word_root)

word = "ingeating"
word_root = regexp_stemmer.stem(word)
print (f"word = {word}, word_root={word_root}",word,word_root)
# regexp_stemmer.stem('ingeating')

word = eating, word_root=eat eating eat
word = ingeating, word_root=ingeat ingeating ingeat


In [9]:
#3) Stemming using SnowballStemmer
corpus=['eating','eats','eaten','writing','writes','programming','programs','congratulations']

snowball_stemmer=SnowballStemmer("english")
for word in corpus:
    word_root = snowball_stemmer.stem(word)
    print(f"word = {word}, word_root = {word_root}")

word = eating, word_root = eat
word = eats, word_root = eat
word = eaten, word_root = eaten
word = writing, word_root = write
word = writes, word_root = write
word = programming, word_root = program
word = programs, word_root = program
word = congratulations, word_root = congratul
