# Lemmatizing and Stemming
One of the most useful operations in Natural Language Processing is finding the
base form of a word. This is such a common operation because it’s so useful in so
many situations. Let’s take the case of a search engine which should provide similar
results for different queries (e.g. “write book” / “writing books”). In order for the
search engine to provide similar results, it has to find the base form of the words.
There are two distinct approaches to this task:
a. Stemming, which is an algorithmic method of a shaving off prefixes and
suffixes from a given word form
b. Lemmatizing, which is a corpus-based method for finding the base form of a
word


There are fundamental differences between these two methods, each having advantages
and disadvantages. Here’s a short comparison:
- Both stemmers and lemmatizers bring inflected words to the same form, or at
least they try to
- Stemmers can’t guarantee a valid word as a result (in fact, the result usually
isn’t a valid word)
- Lemmatizers always return a valid word (the dictionary form)
- Stemmers are faster
- Lemmatizers depend on a corpus (basically a dictionary) to perform the task


## How stemmers work
When stemming, what you actually do is to apply various rules to a specific word
form until you reduce the word to its basic form and no other rule can be applied.
This method is fast and also, the best choice for most real-world applications. The
downside of stemming is the incertitude of the result because it usually isn’t a valid
word, but on the bright side, you can have two related words which can resolute to
the same stem. Let’s see an example:




In [1]:
#Snowball Stemmer
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
# all resolve to `friend`
print(stemmer.stem('friend'))
print(stemmer.stem('friends'))
print(stemmer.stem('friendly'))
# Not working well with irregulars
print(stemmer.stem('drink')) # `drink`
print(stemmer.stem('drunk')) # `drunk`
# Not a proper word (`slowli`)
print(stemmer.stem('slowly'))
# Works with non-existing words

friend
friend
friend
drink
drunk
slowli


## How lemmatizers work
You can think of lemmatizers as a huge Python dictionary where the word forms
are the keys and the base form of the words are the values. Because two words
with different parts of speech can have the same form, we need to specify the part
of speech.

*Wordnet Lemmatizer*

In [3]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Different parts of speech with the same form
print(lemmatizer.lemmatize('haunting', 'n')) # `haunting`
print(lemmatizer.lemmatize('haunting', 'v')) # `haunt`
# Resolve to `friend`
print(lemmatizer.lemmatize('friend', 'n'))
print(lemmatizer.lemmatize('friends', 'n'))
# Resolving to `friendly` because it's a different part of speech
print(lemmatizer.lemmatize('friendly', 'a'))
# Working well with irregulars
print(lemmatizer.lemmatize('drink', 'v')) # `drink`
print(lemmatizer.lemmatize('drunk', 'v')) # `drink`
# Always a proper word (`slowly`)



print(lemmatizer.lemmatize('slowly', 'r'))
# Not working with non-existing words (`xyzing`)
print(lemmatizer.lemmatize('xyzing', 'v'))

haunting
haunt
friend
friend
friendly
drink
drink
slowly
xyzing
