# WWC - Stemming and Lemmatization

**Stemming** is a technique for removing affixes from a word, ending up with a **stem**. For example, the stem of "cooking" is "cook".

Stemming is most used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, reducing the size of index  while increasing retrieval accuracy.

## PorterStemmer

The **Porter Stemming Algorithm** by Martin Porter, is designed to remove and replace well known suffixes of English words. The PorterStemmer knows a number of regular word forms and suffixes, and uses the knowledge to ransform the imput word to a final stem through a series of steps.

The resulting stem is often a shorter word, or at least a common form of the word, that has the same root meaning.


In [1]:
from nltk.stem import PorterStemmer

In [2]:
stemmer = PorterStemmer()

In [3]:
stemmer.stem('cooking')

'cook'

In [4]:
stemmer.stem('cookery')

'cookeri'

## LancasterStemmer

The **Lancaster Stemming Algorithm** was developed at Lancaster University. NLTK includes it as the LancasterStemmer class.

In [5]:
from nltk.stem import LancasterStemmer

In [7]:
stemmer_LS = LancasterStemmer()

In [8]:
stemmer_LS.stem('cooking')

'cook'

In [9]:
stemmer_LS.stem('cookery')

'cookery'

## RegexpStemmer

You can also build a stemmer using the RegexStemmer. It takes a single regular expression and will remove any prefix or suffix that matches.

In [10]:
from nltk.stem import RegexpStemmer

In [12]:
stemmer_Rex = RegexpStemmer('ing')

In [13]:
stemmer_Rex.stem('cooking')

'cook'

In [15]:
stemmer_Rex.stem('cookery')

'cookery'

In [16]:
stemmer_Rex.stem('ingleside')

'leside'

## SnowballStemmer

The SnowballStemmer supports 13 non-English languages. To use it, you have to create an instance with the name of the language you are using, and then call the stemm() method.

In [17]:
from nltk.stem import SnowballStemmer 

In [18]:
SnowballStemmer.languages  # Supported languages

('danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [19]:
spanish_stemmer = SnowballStemmer('spanish')

In [20]:
spanish_stemmer.stem('hola')

'hol'