<a href="https://colab.research.google.com/github/drpetros11111/NLP_Portilia/blob/NLP_Spacy_Basics_1/03_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [None]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 4690420944186131903 	 I
am 	 AUX 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 4690420944186131903 	 I
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 4690420944186131903 	 I
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


# SCONJ stands for Subordinating Conjunction.

A subordinating conjunction connects a subordinate (dependent) clause to a main (independent) clause.

Here are some examples of subordinating conjunctions:

    after
    although
    as
    because
    before
    even if
    even though
    if
    once
    since
    so that
    than
    that
    though
    unless
    until
    when
    whenever
    where
    whereas
    wherever
    while
For example, in the sentence "I will eat if I am hungry", "if" is a subordinating conjunction.

Do you have any other questions about spaCy or NLP?

<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.</font>

### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [None]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

# Neatly Displays Lemmas
This Python function show_lemmas(text) neatly displays the lemmas of each token in a given text using spaCy.

---------------------
Here's a breakdown:

##def show_lemmas(text):

This line defines a function named show_lemmas that takes a spaCy Doc object (text) as input.

##for token in text:

This line starts a loop that iterates through each token in the input Doc object.

   print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
   
This line prints information about each token using an f-string for formatting.

##token.text:{12}:

The token's original text, formatted to take up 12 spaces (left-aligned).

##token.pos_:{6}:

The token's part-of-speech tag (e.g., 'NOUN', 'VERB'), formatted to take up 6 spaces.

##token.lemma:<{22}:

The token's lemma (base form), formatted to take up 22 spaces (left-aligned).

##token.lemma_:

The lemma's hash value, which is an integer representation.

This function helps to visualize the key attributes of each token, making it easier to understand the lemmatization process.

For example, if you call show_lemmas with the following code:


    doc = nlp(u"I saw eighteen mice today!")
    show_lemmas(doc)

-----------------------
#This would be the output:


    I            PRON   I                 5687
    saw          VERB   see               11925
    eighteen      NUM    eighteen          12435
    mice         NOUN   mouse             5705
    today        NOUN   today             12066
    !            PUNCT  !                 975

-------------
#Summary
This function is particularly helpful for demonstrating how spaCy lemmatizes words, providing a clear comparison between the original word form, its part of speech, and its base form (lemma).

Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

In [None]:
doc2 = nlp(u"I saw eighteen mice today!")

show_lemmas(doc2)

I            PRON   4690420944186131903    I
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>

In [None]:
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")

show_lemmas(doc3)

I            PRON   4690420944186131903    I
am           AUX    10382539506755952630   be
meeting      VERB   6880656908171229526    meet
him          PRON   1655312771067108281    he
tomorrow     NOUN   3573583789758258062    tomorrow
at           ADP    11667289587015813222   at
the          DET    7425985699627899538    the
meeting      NOUN   14798207169164081740   meeting
.            PUNCT  12646065887601541794   .


<font color=green>Here the lemma of `meeting` is determined by its Part of Speech tag.</font>

In [None]:
doc4 = nlp(u"That's an enormous automobile")

show_lemmas(doc4)

That         PRON   4380130941430378203    that
's           AUX    10382539506755952630   be
an           DET    15099054000809333061   an
enormous     ADJ    17917224542039855524   enormous
automobile   NOUN   7211811266693931283    automobile


<font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>

We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.

## Next up: Stop Words