# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the stem for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision [here](https://github.com/explosion/spaCy/issues/327). We discuss the virtues of *lemmatization* in the next section.

Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/

## Porter Stemmer
One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined, such as:

<img src="../stemming1.png" width="400">

From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, `caresses` reduces to `caress` but not `cares`.

More sophisticated phases consider the length/complexity of the word before applying a rule. For example:

<img src="../stemming2.png" width='500'>

Here `m>0` describes the "measure" of the stem, such that the rule is applied to all but the most basic stems.

In [3]:
# Import the tool-kit and full porter stemmer library
import nltk
from nltk.stem.porter import PorterStemmer

In [4]:
# create an object
p_stemmer=PorterStemmer()

words=["run",'runner',"ran","runs","running","easily","fairly"]

for word in words:
    print(word + "--->" + p_stemmer.stem(word))


run--->run
runner--->runner
ran--->ran
runs--->run
running--->run
easily--->easili
fairly--->fairli


<font color=blue>Note how the stemmer recognizes "runner" as a noun, not a verb form or participle. Also, the adverbs "easily" and "fairly" are stemmed to the unusual root "easili" and "fairli"</font>
___

## Snowball Algorithm
Snowball is the name of a stemming language also developed by Martin Porter.

The algorithm used here is more accurately called the "English Stemmer" or "Porter 2 Stemmer".

It offers a slight improvement over the original Porter Stemmer, both in logic and speed.


In [7]:
# Now let's check for snowball algo
from nltk.stem.snowball import SnowballStemmer
s_stemmer=SnowballStemmer(language='english')

for word in words:
    print(word + "---->" + s_stemmer.stem(word))

run---->run
runner---->runner
ran---->ran
runs---->run
running---->run
easily---->easili
fairly---->fair


In [8]:
words1 = ['generous','generation','generously','generate']
for word in words1:
    print(word + "---->" + s_stemmer.stem(word))

generous---->generous
generation---->generat
generously---->generous
generate---->generat
