# Importing Libraries 

The spaCy library is one of the most popular NLP libraries along with NLTK. 
NLTK contains a wide variety of algorithms to solve one problem 
whereas spaCy contains only one, but the best algorithm to solve a problem.

In [1]:
import spacy
import en_core_web_sm
sp= en_core_web_sm.load()

import nltk
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer

from spacy import displacy

# Token Lematization

In [2]:
sentence = sp(u'Manchester United is looking to sign a forward for $90 million')
for word in sentence:
    print(word.text)

Manchester
United
is
looking
to
sign
a
forward
for
$
90
million


We can see the parts of speech of each of these tokens using the .pos_ attribute shown below

In [3]:
for word in sentence:
    print(word.text,  word.pos_)

Manchester PROPN
United PROPN
is VERB
looking VERB
to PART
sign VERB
a DET
forward NOUN
for ADP
$ SYM
90 NUM
million NUM


We can also print sentences from a document

In [4]:
document = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')
for sentence in document.sents:
    print(sentence)

Hello from Stackabuse.
The site with the best Python Tutorials.
What are you looking for?


## spaCy tokenization

In [5]:
sentence3 = sp(u'"They\'re leaving U.K. for U.S.A."')
print(sentence3)
for word in sentence3:
    print(word.text)

"They're leaving U.K. for U.S.A."
"
They
're
leaving
U.K.
for
U.S.A.
"


In [6]:
sentence4 = sp(u"Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com")
print(sentence4)
for word in sentence4:
    print(word.text)

Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com
Hello
,
I
am
non
-
vegetarian
,
email
me
the
menu
at
abc-xyz@gmai.com


It is evident from the output that spaCy was 
actually able to detect the email and it did not tokenize it 
despite having a "-". 
On the other hand, the word "non-vegetarian" was tokenized.

nouns can also be detected. To do so, the noun_chunks attribute is used

In [7]:
sentence5 = sp(u'Latest Rumours: Manchester United is looking to sign Harry Kane for $90 million')  
for noun in sentence5.noun_chunks:
    print(noun.text)

Latest Rumours
Manchester United
Harry Kane


## Stemming 

refers to reducing a word to its root form

#### There are two types of stemmers in NLTK:
Porter Stemmer
Snowball stemmers

In [8]:
stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


In [9]:
stemmer = SnowballStemmer(language='english')
tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


You can see that all the 4 words have been reduced to "comput" which actually isn't a word at all.
This is where lemmatization comes handy.

## Lemmatization 
reduces the word to its stem as it appears in the dictionary

In [10]:
sentence6 = sp(u'compute computer computed computing')
for word in sentence6:
    print(word.text,  word.lemma_)

compute compute
computer computer
computed compute
computing computing


##### Lemmatization converts words in the second or third forms to their first form variants

In [11]:
sentence7 = sp(u'A letter has been written, asking him to be released')
for word in sentence7:
    print(word.text + '  ===>', word.lemma_)

A  ===> a
letter  ===> letter
has  ===> have
been  ===> be
written  ===> write
,  ===> ,
asking  ===> ask
him  ===> -PRON-
to  ===> to
be  ===> be
released  ===> release


# Parts of Speech 

In [12]:
#Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence
#parts of speech tagging is performed at the token level.
sen = sp(u"I like to play football. I hated it in my childhood though")
print(sen.text)
print(sen[7])
print(sen[7].pos_)
print(sen[7].tag_)

I like to play football. I hated it in my childhood though
hated
VERB
VBD


In [13]:
#To see what VBD means, we can use spacy.explain()
print(spacy.explain(sen[7].tag_))

verb, past tense


In [14]:
#print the text, coarse-grained POS tags, fine-grained POS tags,
#and the explanation for the tags for all the words in the sentence
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival to
play         VERB       VB       verb, base form
football     NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           ADJ        PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       ADP        IN       conjunction, subordinating or preposition


In [15]:
#POS tagging can be really useful.The word "google" can be used as both a noun and verb, depending upon the context
sen = sp(u'Can you google it?')
word = sen[2]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       VERB       VB       verb, base form


In [16]:
sen = sp(u'Can you search it on google?')
word = sen[5]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       PROPN      NNP      noun, proper singular


In [17]:
#Finding the Number of POS Tags
sen = sp(u"I like to play football. I hated it in my childhood though")
num_pos = sen.count_by(spacy.attrs.POS)
num_pos

{96: 1, 99: 3, 84: 2, 83: 1, 91: 2, 93: 1, 94: 3}

In [18]:
for k,v in sorted(num_pos.items()):
    print(f'{k}. {sen.vocab[k].text:{8}}: {v}')

83. ADJ     : 1
84. ADP     : 2
91. NOUN    : 2
93. PART    : 1
94. PRON    : 3
96. PUNCT   : 1
99. VERB    : 3


In [19]:
#Visualizing Parts of Speech Tags

sen = sp(u"I like to play football. I hated it in my childhood though")
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

# Named Entity Representation 

In [20]:
sp= en_core_web_sm.load()
sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million')

In [21]:
#To find the named entity we use the ents attribute
print(sen.ents)

(Manchester United, Harry Kane, $90 million)


In [22]:
#To see the detail of each named entity, you use the text, label, and the spacy.explain
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


In [23]:
#Adding New Entities
sen = sp(u'HHHHH is setting up a new company in India')
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

India - GPE - Countries, cities, states


In [24]:
#Now to add "HHHHH" as an entity of type "ORG" to our document
from spacy.tokens import Span
ORG = sen.vocab.strings[u'ORG']
new_entity = Span(sen, 0, 1, label=ORG)
sen.ents = list(sen.ents) + [new_entity]

for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

HHHHH - ORG - Companies, agencies, institutions, etc.
India - GPE - Countries, cities, states


In [25]:
#Counting Entities
sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit
David - PERSON - People, including fictional
100 Million Dollars - MONEY - Monetary values, including unit


In [26]:
len([ent for ent in sen.ents if ent.label_=='ORG'])

1

In [27]:
#Visualizing Named Entities

sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
displacy.render(sen, style='ent', jupyter=True)

In [28]:
filter = {'ents': ['ORG']}
displacy.render(sen, style='ent', jupyter=True, options=filter)