<a href="https://colab.research.google.com/github/dinuka-rp/L6-AI/blob/main/Prasan_Yapa/Day2-NLP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization, Stemming, and Lemmatization

In [1]:
!pip install spacy



We
will be using the English language model. The language model is used to perform a variety of
NLP tasks,

In [2]:
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 27.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


import libraries

In [3]:
import spacy

## load the spaCy language model.

In [4]:
sp = spacy.load('en_core_web_sm')

## Create a small document using this model.
A document can be a sentence or a group of sentences and can have unlimited length. The following script creates a simple spaCy document.

In [5]:
sentence = sp('Manchester United is looking to sign a forward for $90 million')

A token simply refers to an individual part of a sentence having some semantic value. Let's see
what tokens we have in our document.

In [6]:
for word in sentence:
  print(word.text)

Manchester
United
is
looking
to
sign
a
forward
for
$
90
million


We can also see the parts of speech
of each of these tokens using the .pos_ attribute shown below.

*"Manchester" has been tagged as a proper noun,*

In [7]:
for word in sentence:
  print(word.text, word.pos_)

Manchester PROPN
United PROPN
is AUX
looking VERB
to PART
sign VERB
a DET
forward NOUN
for ADP
$ SYM
90 NUM
million NUM


In addition to printing the words, you can also print sentences from a document.

In [8]:
document = sp('Hello all, welcome to Natural Language Processing class. What is the content for today?')
for sentence in document.sents:
  print(sentence)

Hello all, welcome to Natural Language Processing class.
What is the content for today?


## spaCy tokenization in detail.

Create a new document using the following script.

In [9]:
sentence1 = sp('"They\'re leaving U.K. for U.S.A."')
print(sentence1)

"They're leaving U.K. for U.S.A."


This sentence contains quotes at the beginning and at the end. It also contains
punctuation marks in abbreviations "U.K" and "U.S.A."

In [10]:
for word in sentence1:
  print(word.text)

"
They
're
leaving
U.K.
for
U.S.A.
"


another tokenization example.

In [11]:
sentence2 = sp("Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com")
print(sentence2)

for word in sentence2:
  print(word.text)

Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com
Hello
,
I
am
non
-
vegetarian
,
email
me
the
menu
at
abc-xyz@gmai.com


the output that spaCy was actually able to detect the email and it did not
tokenize it despite having a "-". On the other hand, the word "non-vegetarian" was tokenized.

In addition to tokenizing the documents to words, you can also find if the word is an entity
such as a company, place, building, currency, institution, etc. To get the named entities from a
document, you have to use the ents attribute. Let's retrieve the named entities from the above
sentence. Execute the following script.

we print the text of the entity, the label of the entity and the detail of the
entity.

In [12]:
sentence3 = sp('Manchester United is looking to sign Harry Kane for $90 million')
for entity in sentence3.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' +
str(spacy.explain(entity.label_)))

Manchester United - PERSON - People, including fictional
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


Nouns can also be detected.

A noun can be a named entity as well and vice versa.

In [13]:
sentence4 = sp('Latest Rumours: Manchester United is looking to sign Harry Kane for $90 million')
for noun in sentence4.noun_chunks:
  print(noun.text)

Manchester United
Harry Kane


## Stemming using NTLK

Stemming refers to reducing a word to its root form. While performing natural language
processing tasks, you will encounter various scenarios where you find different words with the
same root. For instance, compute, computer, computing, computed, etc. You may want to
reduce the words to their root form for the sake of uniformity. This is where stemming comes
in to play.
It might be surprising to you but spaCy doesn't contain any function for stemming as it relies
on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. Both of
them have been implemented using different algorithms.

### Porter Stemmer

In [14]:
from nltk.stem.porter import *

stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
  print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


You can see that all the four words have been reduced to "comput" which actually isn't a word
at all.

Snowball Stemmer is a slightly improved version of the Porter Stemmer and is usually preferred
over the latter. Let's see Snowball Stemmer in action.

### Snowball Stemmer

In [15]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')
tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
  print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


You can see that the results are the same. We still got "comput" as the stem. Again, this word
"comput" actually isn't a dictionary word.

## Lemmatization

Lemmatization converts words in the second or third forms to their first form variants.

Though we could not perform stemming with spaCy, we can perform lemmatization using
spaCy. To do so, we need to use the `lemma_attribute` on the spaCy document.

In [18]:
sentence5 = sp('compute computer computed computing')

for word in sentence5:
  print(word.text, word.lemma_)

compute compute
computer computer
computed compute
computing computing


You can see that unlike stemming where the root we got was "comput", the roots that we got
here are actual words in the dictionary.

In [19]:
sentence6 = sp('A letter has been written, asking him to be released')

for word in sentence6:
  print(word.text + ' ===>', word.lemma_)

A ===> a
letter ===> letter
has ===> have
been ===> be
written ===> write
, ===> ,
asking ===> ask
him ===> -PRON-
to ===> to
be ===> be
released ===> release


You can clearly see from the output that the words in second and third forms, such as "written",
"released", etc. have been converted to the first form i.e. "write" and "release".