## Day23 - NLP

**Stemming** is the process of producing morphological variants of a root/base word. For example [likes,liked,liking'] become all 'like'. It is extremely useful for reducing the number of distinct words and for retrieving more precise information. There are several algorithms, the most common ones are the **porter** and the **snowball**. The porter's stemmer is based on the idea that the suffixes in English are made up of a combination of simpler suffixes. Then, the snowball stemmer is considered the evolution of the porter's one, because it can be also used for non-english words!

In [49]:
from nltk import word_tokenize, PorterStemmer, SnowballStemmer, WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer(language="english")
snowball_ita = SnowballStemmer(language="italian")

In [50]:
# first results on Google news
text = "NASA completes major test on rocket that could take humans back to the moon"
text_ita = "Ceo non bastava: Elon Musk si autoproclama «Technoking» di Tesla"

tokens = word_tokenize(text)
tokens_ita = word_tokenize(text_ita)
porter_tokens = [porter.stem(t) for t in tokens]
snowball_tokens = [snowball.stem(t) for t in tokens]

print(porter_tokens)
print(snowball_tokens)

['nasa', 'complet', 'major', 'test', 'on', 'rocket', 'that', 'could', 'take', 'human', 'back', 'to', 'the', 'moon']
['nasa', 'complet', 'major', 'test', 'on', 'rocket', 'that', 'could', 'take', 'human', 'back', 'to', 'the', 'moon']


In [51]:
snowball_tokens_ita = [snowball_ita.stem(t) for t in tokens_ita]
porter_tokens_ita = [porter.stem(t) for t in tokens_ita]

print(porter_tokens_ita)
print(snowball_tokens_ita)

['ceo', 'non', 'bastava', ':', 'elon', 'musk', 'si', 'autoproclama', '«', 'technok', '»', 'di', 'tesla']
['ceo', 'non', 'bast', ':', 'elon', 'musk', 'si', 'autoproclam', '«', 'technoking', '»', 'di', 'tesl']


**Lemmatization** is the process of grouping together different inflected words. It is similar to stemming, but here we obtain meaningful words. It is more precise, but in contrary, the computation is slower.

In [54]:
lemmatokenizer = WordNetLemmatizer()
lt_tokens = [lemmatokenizer.lemmatize(t) for t in tokens]

print(lt_tokens)

['NASA', 'completes', 'major', 'test', 'on', 'rocket', 'that', 'could', 'take', 'human', 'back', 'to', 'the', 'moon']
