<a href="https://colab.research.google.com/github/dinuka-rp/L6-AI/blob/main/Prasan_Yapa/Day2-NLP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parts-of-Speech Tagging

In [1]:
import spacy

# load the spaCy language model.
sp = spacy.load('en_core_web_sm')

In [3]:
sentence1 = sp("I like to play football. I hated it in my childhood though")

print(sentence1.text)

I like to play football. I hated it in my childhood though


We will print the POS tag of the word "hated", which is actually the seventh token in the sentence.

In [4]:
print(sentence1[7].pos_)

VERB


print the fine-grained POS tag for the word “hated”.

To see what VBD means, we can use `spacy.explain()` method as shown below.

In [5]:
print(sentence1[7].tag_)
print(spacy.explain(sentence1[7].tag_))

VBD
verb, past tense


Let's print the text, coarse-grained POS tags, fine-grained POS tags, and the explanation for
the tags for all the words in the sentence.

In [9]:
for word in sentence1:
  print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival "to"
play         VERB       VB       verb, base form
football     NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           DET        PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       SCONJ      IN       conjunction, subordinating or preposition


*In the script above we improve the readability and formatting by adding 12 spaces between the
text and coarse-grained POS tag and then another 10 spaces between the coarse-grained POS
tags and fine-grained POS tags.*

## *Why POS Tagging is Useful?*
POS tagging can be really useful, particularly if you have words or tokens that can have
multiple POS tags. For instance, the word “google” can be used as both a noun and verb,
depending upon the context. While processing natural language, it is important to identify this
difference. Fortunately, the spaCy library comes pre-built with machine learning algorithms
that, depending upon the context, it is capable of returning the correct POS tag for the word.

In [10]:
sentence2 = sp('Can you google it?')
word = sentence2[2]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       VERB       VB       verb, base form


Here the word “google” is being used as a verb.

In [11]:
sentence3 = sp('Can you search it on google?')
word = sentence3[5]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       PROPN      NNP      noun, proper singular


Here in the above script the word “google” is being used as a noun.

You can find the number of occurrences of each POS tag by calling the `count_by` on the spaCy
document object. The method takes `spacy.attrs.POS` as a parameter value.

In [12]:
sentence4 = sp("I like to play football. I hated it in my childhood though")
num_pos = sentence4.count_by(spacy.attrs.POS)
print(num_pos)

{95: 3, 100: 3, 94: 1, 92: 2, 97: 1, 85: 1, 90: 1, 98: 1}


## Named Entity Recognition
Named entity recognition refers to the identification of words in a sentence as an entity e.g. the
name of a person, place, organization, etc. Let's see how the spaCy library performs named
entity recognition.

In [13]:
sentence5 = sen = sp('Manchester United is looking to sign Harry Kane for $90 million')
print(sentence5.ents)

(Manchester United, Harry Kane, $90 million)


You can see that three named entities were identified. 

To see the detail of each named entity,
you can use the text, label, and the spacy.explain method which takes the entity object as a
parameter.

In [14]:
for entity in sentence5.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' +
str(spacy.explain(entity.label_)))

Manchester United - PERSON - People, including fictional
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


You can also add new entities to an existing document. For instance, in the following example,
“Virtusa” is not identified as a company by the spaCy library.

In [15]:
sentence6 = sp('Virtusa is a setting up company in Sweden')
for entity in sentence6.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' +
str(spacy.explain(entity.label_)))

Sweden - GPE - Countries, cities, states


## Add a new entity to a Spacy document

add “Virtusa” as an entity of type “ORG” to our document,

In [16]:
from spacy.tokens import Span

ORG = sen.vocab.strings['ORG']
new_entity = Span(sentence6, 0, 1, label=ORG)
sentence6.ents = list(sentence6.ents) + [new_entity]

for entity in sentence6.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' +
str(spacy.explain(entity.label_)))

Virtusa - ORG - Companies, agencies, institutions, etc.
Sweden - GPE - Countries, cities, states


In the case of POS tags, we could count the frequency of each POS tag in a document using a
special method sen.count_by. However, for named entities, no such method exists. We can
manually count the frequency of each entity type. Suppose we have the following document
along with its entities.

In [18]:
sentence7 = sp('Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')

for entity in sentence7.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' +
str(spacy.explain(entity.label_)))

Manchester United - PERSON - People, including fictional
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit
David - PERSON - People, including fictional
100 Million Dollars - MONEY - Monetary values, including unit


To count the person type entities in the above document, we can use the following script.

In [19]:
print(len([ent for ent in sentence7.ents if ent.label_=='PERSON']))

3
