# Lemmatization

Here we will use [spaCy](https://spacy.io/) to see the effect of lemmatization words.

In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 12.2 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import spacy

# loading the small English model
nlp = spacy.load("en_core_web_sm")

In [3]:
text = "At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction."
text

'At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction.'

Let's lemmatize all token we can find.

In [4]:
lemmas = [token.lemma_ for token in nlp(text.lower())]
" ".join(lemmas)

'at first , historical linguistic serve as the cornerstone of comparative linguistic primarily as a tool for linguistic reconstruction.[5 ] scholar be concern chiefly with establish language family and reconstruct prehistoric proto - language , use the comparative method and internal reconstruction .'

You can see that "were" was correctly lemmatized to "be".

Note that the result is strongly affected by the quality of the tokenizer. For example `reconstruction.[5` was badly tokenized. You can add a 's' at the end of "reconstruction" and see that it's not lemmatized correctly.

Another example with "went". 

In [5]:
import re

re_word = re.compile(r"^\w+$")
text = " I went to the cinema"
lemmas = [token.lemma_ for token in nlp(text.lower()) if re_word.match(token.text)]
" ".join(lemmas)

'I go to the cinema'

## Speed

Let's compare the speed and number of tokens generated using a lemmatizer.

In [6]:
from torchtext.datasets import PennTreebank
train, valid, test = PennTreebank()

In [7]:
from tqdm import tqdm

nb_unique_token = set()
nb_unique_lemma = set()
for text in tqdm(train, total=len(train)):
    for token in nlp(text):
        if re_word.match(token.text):
            nb_unique_token.add(token.text)
            nb_unique_lemma.add(token.lemma_)

100%|█████████████████████████████████████████████████████████| 42068/42068 [03:02<00:00, 230.90it/s]


In [8]:
print(f"nb unique token: {len(nb_unique_token)} vs nb unique lemma: {len(nb_unique_lemma)}")

nb unique token: 9616 vs nb unique lemma: 7786


Ask yourself the following questions
* Why is it much slower than stemming?
* How come we have more unique lemmas than stems?

## Going further

spaCy provides [models](https://spacy.io/usage/models#languages) of different size for 18 languages (and two multilingual models). Some of these models support operations such as part-of-speech tagging and named entity recognition. You can learn more about the library following their [interactive tutorial](https://course.spacy.io/en/) (though the tutorial uses spaCy 2, and not 3 yet).