***
Welcome to this notebook where we embark on a fascinating exploration of the NLTK Stemming and Lemmatization functions. Our focus will be on both sentence and word tokenization. Stemming and lemmatization are powerful techniques used in natural language processing to reduce words to their base or root forms, enabling us to analyze text more effectively. Throughout this journey, we will gain insights into the inner workings of these functions and understand their importance in various NLP tasks. By the end, you will have a solid understanding of how stemming and lemmatization can enhance your text analysis capabilities.
<br>
***

### Index:

[1.1 - Stemming](#1.1---Stemming)
<br>
[1.2 - Lemmatization](#1.2---Lemmatization)

In [1]:
import nltk

Let's start by defining the same sentence that we've seen in the first notebook:

In [2]:
eu_definition = '''
The European Union (EU) is a political and economic union of 27 member states that are located primarily 
in Europe. Its members have a combined area of 4,233,255.3 km2 (1,634,469.0 sq mi) and an estimated total 
population of about 447 million. The EU has developed an internal single market through a standardised system of 
laws that apply in all member states in those matters, and only those matters, where members have agreed to act as one. 
EU policies aim to ensure the free movement of people, goods, services and capital within the internal market;
enact legislation in justice and home affairs; and maintain common policies on trade, agriculture, 
fisheries and regional development. Passport controls have been abolished for travel within the Schengen Area.
A monetary union was established in 1999, coming into full force in 2002, and is composed of 19 EU 
member states which use the euro currency. The EU has often been described as a sui generis political entity 
(without precedent or comparison).
'''

### 1 - Stemming and Lemmatization

Stemming and lemmatization are really important processes of the NLP process - as we've discussed in the past, computers store strings as binary numeric objects. So words such as "run" and "running" are completely different for any system.
<br>
<br>
Sometimes, this is convenient and sometimes it is not. With the rise of Neural Networks and powerful language models, some researchers argue that, in the near future there won't be a need to stem or lemmatize.
<br>
<br>
Nevertheless, for some applications today it is still an important process of the NLP pipeline.
<br>
<br>

### 1.1 - Stemming

***
First let's understand what Stemming is. Stemming is a more "raw" technique than lemmatization and consists of word "cutting" to the first few characters that represent a "suffix" of the word. This is not based on a semantic root (meaning) but on a "written" root called *suffix stripping*.

There are different stemmers out there - the most famous are:
- PorterStemmer;
- SnowballStemmer;
- LancasterStemmer;

They are sorted by order of "aggressiveness" - the main problems occuring from stemming are **under-stemming** or **over-stemming**.
***

`nltk` has some nice implementations of the three stemmers named above:

In [3]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

In [4]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

Let's see an example between the stemmers with different words:

In [5]:
word = 'controversial'

print(lancaster.stem(word))
print(snowball.stem(word))
print(porter.stem(word))

controvers
controversi
controversi


In [6]:
word = 'controversy'

print(lancaster.stem(word))
print(snowball.stem(word))
print(porter.stem(word))

controversy
controversi
controversi


In [7]:
word = 'extraordinaire'

print(lancaster.stem(word))
print(snowball.stem(word))
print(porter.stem(word))

extraordinair
extraordinair
extraordinair


In [8]:
word = 'extraordinary'

print(lancaster.stem(word))
print(snowball.stem(word))
print(porter.stem(word))

extraordin
extraordinari
extraordinari


Lancaster Stemmer usually over-stemms the words. There isn't a big difference between Snowball and Porter stemmer (they sometimes differ in certain words). A huge difference is that `nltk` implementation of Snowball supports multi-language while Porter Stemmer does not.

**Let's stem an entire sentence using Snowball:**

In [9]:
# look, we are compounding learnings now, great! Applying a word stemmer:
tokenized_eu_definition = nltk.tokenize.word_tokenize(eu_definition)

In [10]:
# Let's not apply our snowball stemmer to every word on the eu_definition
stemmed_eu_definition = [snowball.stem(word) for word in tokenized_eu_definition]

**Our text gets a bit.. funky:**

In [11]:
' '.join(stemmed_eu_definition)

'the european union ( eu ) is a polit and econom union of 27 member state that are locat primarili in europ . it member have a combin area of 4,233,255.3 km2 ( 1,634,469.0 sq mi ) and an estim total popul of about 447 million . the eu has develop an intern singl market through a standardis system of law that appli in all member state in those matter , and onli those matter , where member have agre to act as one . eu polici aim to ensur the free movement of peopl , good , servic and capit within the intern market ; enact legisl in justic and home affair ; and maintain common polici on trade , agricultur , fisheri and region develop . passport control have been abolish for travel within the schengen area . a monetari union was establish in 1999 , come into full forc in 2002 , and is compos of 19 eu member state which use the euro currenc . the eu has often been describ as a sui generi polit entiti ( without preced or comparison ) .'

Although our text is almost unreadable for a human, for a computer it may be better to understand it in this way (particularly for simpler applications) - imagine you want to classify some text as "Positive" (a common case in sentiment analysis) then words such as `positive`, `positivity` or `positiviness` might have some impact in that classification and you may want to consider them the same type of "expression".
<br>
<br>
Us, as humans, can understand that these words are derived from the suffix "positiv", but a computer does not as it only sees 1's and 0's. Hence, stemming explicitly tells the computer that it should treat positive, positivity or positiviness as the same *word*.

This has some advantages and some disadvantages.
<br>
<br>
Here are some advantages of stemming:
- Word generalization;
- Increased Data Consistency;
- Reduced Vocabulary Size;

On the downside, stemming has the following setbacks:
- Is simplistic and may produce errors;
- Cause loss of information;
- With the rise of better language models, it may become obsolete;

Let's check the amount of information retained from the original sentence (in terms of original characters):

In [12]:
stemmed_number_characters = len(''.join(stemmed_eu_definition))
non_stemmed_number_characters = len(''.join(tokenized_eu_definition))

After stemming, we retain ~90% of the original information of the description:

In [13]:
stemmed_number_characters/non_stemmed_number_characters

0.8956109134045077

Information loss is about **10%** (1 minus the original information retained).

### 1.2 - Lemmatization

Lemmatization is the process of obtaining the **root** of the word. It's different from Stemming as it is related to the morphological characteristic of the word.
<br>
<br>
Another important part is that it is related to a `POS(Part-of-Speech)` of the word. We'll understand what a POS really is in subsequent lectures.
<br>
Let's go for **lemmatization**:

In [14]:
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() 

Wordnet is a lexical database for the English Language that estabilish semantic relationships between words.

In [15]:
lemmatizer.lemmatize('creating')

'creating'

In [16]:
snowball.stem('creating')

'creat'

It seems weird - while our stemmer is able to create a "root" of the word create - **lemmatize seems to fail on that simple task.**
<br>
<br>
That's because the lemmatizer needs to know what's the class (or the part-of-speech tag) of the word. We need to give that tag to the lemmatizer:

In [17]:
lemmatizer.lemmatize('creating', 'v')

'create'

A list of the pos classes we can pass to lemmatize:
- Adjectives: `'a'`
- Noun: `'n'`
- Verb: `'v'`
<br>

There is also `'r'` which is adverb  - we have to map the word in the sentence to have a correct lemma.

An example on obtaining the *lemma* of an adjective - this is where the lemmatization really shines:

In [18]:
lemmatizer.lemmatize("worse", pos="a")

'bad'

In [19]:
lemmatizer.lemmatize("better", pos="a")

'good'

And an example with plural nouns:

In [20]:
lemmatizer.lemmatize("cats", pos='n')

'cat'

***
Here we are passing the POS tag on our own - is there a way to obtain this type of POS class automatically? 
<br>
Yes, there is! That's what we are going to learn in the next notebook!
***