# Stemming & Lemmatization

Both stemming and lemmatization are the methods to normalize documents on a syntactical level. Often the same words are used in different forms depending due to grammatical reasons. Consider the following to sentences:

- Dogs make the best friends.
- A dog makes a good friend.

Semantically, both sentences are essentially conveying the same message, but syntactically they are very different since the vocabulary is different: "dog" vs. "dog", "make" vs. "makes", "friends" vs. "friend". This is a big problem when comparing documents or when searching for documents in a database. For example, when one uses "dog" as search term, both sentences should be return and not just the second one.

While the goals of stemming and lemmatization are similar, there a basic differences: 

 - **Stemmming:** Usually just applying crude heuristics that chop off the end of words. This may result in terms that are no longer proper words.
 - **Lemmatization:** Using vocabularies and morphological analysis of words to derive the root word for a term.

## Import all important packages

In [None]:
import string

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

from nltk.stem import WordNetLemmatizer

from utils.nlputil import remove_punctuation

In [None]:
print (remove_punctuation("Test, 123."))

## Stemming

We first define a few stemmers provided by NLTK.

For more stemmer, see http://www.nltk.org/api/nltk.stem.html

In [None]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

# Put all stemmers into a list to make their use easier
stemmer_list = [porter_stemmer, snowball_stemmer, lancaster_stemmer]

In [None]:
word_list = ['dogs', 'cats', 'running', 'phones', 'viewed', 'presumably', 'crying', 'went', 'packed', 'worse', 'best', 'mice', 'friends', 'makes']

In [None]:
for word in word_list:
    print (word + ':')
    for stemmer in stemmer_list:
        stemmed_word = stemmer.stem(word)
        print ('\t', stemmed_word)

## Lemmatization

The output of a lemmatizer, in general, depends on the type of word (noun, verb, or adjective). For example, when used as an adjective "running" (e.g., "a running tap") the word is already in its base form. However, "running" used as a verb (e.g., "he was running away") then the base form is "run"

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

In [None]:
word_type_list = ['n', 'v', 'a']

for word in word_list:
    print (word + ':')
    for word_type in word_type_list:
        lemmatized_word = wordnet_lemmatizer.lemmatize(word, pos=word_type) # default is 'n'
        print ('\t', word, '=[{}]=>'.format(word_type), lemmatized_word)

To show a complete example, we already look ahead and use a Part-of-Speech (POS) tagger that tells use the type for each word in sentence (see the follow-up tutorial for more details).

In [None]:
from nltk import word_tokenize
from nltk import pos_tag

In [None]:
sentence = "The newest study has shown that cats have a better sense of smell than dogs."
#sentence = "Dogs make the best friends."

In [None]:
# First, tokenize sentence
token_list = word_tokenize(sentence)

# Second, calculate POS tags for each token
pos_tag_list = pos_tag(token_list)

print (pos_tag_list)

The POS tagger distinguishes several dozens of word types. However, we are only interested wether a word is a noun, verb, or adjective. We therefore need to map the output of the POS tagger to the 3 valid options "n", "v", and "a"

In [None]:
print ('\nOutput of NLTK lemmatizer:\n')
for token, tag in pos_tag_list:
    word_type = 'n'
    tag_simple = tag[0].lower() # Converts, e.g., "VBD" to "c"
    if tag_simple in ['n', 'v']:
        # If the POS tag starts with "n" or "v", we know it's a noun or verb
        word_type = tag_simple 
    elif tag_simple in ['j']:
        # If the POS tag starts with a "j", we know it's an adjective
        word_type = 'a' 
    lemmatized_token = wordnet_lemmatizer.lemmatize(token.lower(), pos=word_type)
    print(token, '=[{}]==[{}]=>'.format(tag, word_type), lemmatized_token)

## Lemmatization with spaCy

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

spaCy already performs lemmatization by default when processing a document without any additional commands.

In [None]:
print ('\nOutput of spaCy lemmatizer:')
doc = nlp(sentence) # doc is an object, not just a simple list
# Let's create a list so the output matches the previous ones
token_list = []
for token in doc:
    print (token.text, '={}=>'.format(token.pos_), token.lemma_) # token is also an object, not a string


Notice that the spaCy lemmatizer, compared to the NLTK lemmatizer, does not convert "better" to "good" although correctly identified as adjective. On the other hand, "newest" gets converted to "new". The spaCy lemmatizer also converts all tokens/word to lowercase, which is typically does not matter.

## Application use case: document similarity

The following two methods take a document as input and return a set of words (i.e., no duplicates). `create_stemmed_word_set()` stems each word; `create_lemmatized_word_set()` lemmatizes each word. The methods simply put together all the individial steps as previously shown.

In [None]:
from utils.nlputil import preprocess_text

Print some example output for both methods.

In [None]:
# Show example output of create_stemmed_word_set() method
print (preprocess_text(sentence, stemmer=porter_stemmer))

# Show example output of create_lemmatized_word_set() method
print (preprocess_text(sentence, lemmatizer=wordnet_lemmatizer))

To caluclate the similarity between two documents, let's define a second sentence that is sematically similar to the first one, but not syntactically.

In [None]:
# sentence = "The newest study has shown that cats have a better sense of smell than dogs."
sentence_2 = "Some studies show that a cat can smell better than a dog."

For both sentences, we can caluculate all 3 different word sets:
- naive (only simple tokenizing)
- stemmed
- lemmatized

In [None]:
naive_word_set_1 = set(word_tokenize(sentence.lower()))
naive_word_set_2 = set(word_tokenize(sentence_2.lower()))

stemmed_word_set_1 = preprocess_text(sentence, stemmer=porter_stemmer, return_type='set')
stemmed_word_set_2 = preprocess_text(sentence_2, stemmer=porter_stemmer, return_type='set')

lemmatized_word_set_1 = preprocess_text(sentence, lemmatizer=wordnet_lemmatizer, return_type='set')
lemmatized_word_set_2 = preprocess_text(sentence_2, lemmatizer=wordnet_lemmatizer, return_type='set')

print (naive_word_set_1)
print (stemmed_word_set_1)
print (lemmatized_word_set_1)

In [None]:
def jaccard_similarity(word_set_1, word_set_2):
    union_set = word_set_1.union(word_set_2)
    intersection_set = word_set_1.intersection(word_set_2)
    similarity = len(intersection_set) / len(union_set)
    return similarity
    

To qunatify the similarity between two word sets A and B, we can use the *Jaccard Similarity* J(A,B) as defined as:

$$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

Inuitively, if A and B are completely different, the size interesection $|A\cap B|$ is 0, making the similarity 0. If A and B are identical both the size intersection and the size of the union are the same, making the similarity 1.0.

In [None]:
print (jaccard_similarity(naive_word_set_1, naive_word_set_2))
print (jaccard_similarity(stemmed_word_set_1, stemmed_word_set_2))
print (jaccard_similarity(lemmatized_word_set_1, lemmatized_word_set_2))