In [1]:
!pip install nltk



### Named Entity Extraction in NLTK

In [3]:
# The following code will convert each word into a token and it would pass to pos_tag function which will tag each token and print them.
# The nltk.ne_chunk() function that is already a pre-trained classifier will recognize the named entity using POS tag done above.
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')


from nltk import word_tokenize,pos_tag

text = "NASA awarded Elon Musk’s SpaceX a $2.9 billion contract to build the lunar lander."
tokens = word_tokenize(text)
tag=pos_tag(tokens)
print(tag)

ne_tree = nltk.ne_chunk(tag)
print(ne_tree)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[('NASA', 'NNP'), ('awarded', 'VBD'), ('Elon', 'NNP'), ('Musk', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('SpaceX', 'NNP'), ('a', 'DT'), ('$', '$'), ('2.9', 'CD'), ('billion', 'CD'), ('contract', 'NN'), ('to', 'TO'), ('build', 'VB'), ('the', 'DT'), ('lunar', 'NN'), ('lander', 'NN'), ('.', '.')]
(S
  (ORGANIZATION NASA/NNP)
  awarded/VBD
  (PERSON Elon/NNP Musk/NNP)
  ’/NNP
  s/VBD
  (ORGANIZATION SpaceX/NNP)
  a/DT
  $/$
  2.9/CD
  billion/CD
  contract/NN
  to/TO
  build/VB
  the/DT
  lunar/NN
  lander/NN
  ./.)


Observation: In the above text provided as input to the POS_tag function has correctly identified the Organization NASA and SpaceX, the Person as Elon Musk and other details $ and 2.9 currency which is preety descent tagging.

In [4]:
nltk.download('treebank')
sent = nltk.corpus.treebank.tagged_sents()

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


In [5]:
print(nltk.ne_chunk(sent[0]))

(S
  (PERSON Pierre/NNP)
  (ORGANIZATION Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)


Observation: In this case using ne_chunk the predefined sentence was also tag well, with Pierre as Person, Vinken as Organization, 61 and 29 as CD.

### NER using Sapcy

In [6]:
# Install the required spacy dependency
!pip install spacy



In [9]:
# Upgrade the typing-extensions and spacy
!pip install --upgrade typing-extensions
!pip install --upgrade spacy



In [8]:
# The following code will perform the tokenization and print the token entity annotations, and the entity types of the token.
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("NASA awarded Elon Musk’s SpaceX a $2.9 billion contract to build the lunar lander.")
for token in doc:
    print(token.text, token.ent_iob_, token.ent_type_)

NASA B ORG
awarded O 
Elon B PERSON
Musk I PERSON
’s I PERSON
SpaceX O 
a O 
$ B MONEY
2.9 I MONEY
billion I MONEY
contract O 
to O 
build O 
the O 
lunar O 
lander O 
. O 


## **Exercise 2**

In [10]:
# This code has 2 custom paragraph that is taken from a book called Kishor Satsang Pravesh from BAPS Sanstha, which is used for tokenization.
# The following code will convert each word into a token and it would pass to pos_tag function which will tag each token and print them.
# The nltk.ne_chunk() function that is already a pre-trained classifier will recognize the named entity using POS tag done above.

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')


from nltk import word_tokenize,pos_tag

text = """Ahimsa paramo dharmaha’ – Non-violence is the highest ethical code laid down in all the shastras. No one should harm any creature
by body, mind or speech; nor should anyone intentionally kill insects such as lice, bugs, etc. Even for the purpose of performing
yagnas, none should kill animals. This is because it is a sin to kill animals and offer them as sacrifices. Shriji Maharaj even refused to
pluck spinach leaves in Jagannathpuri on the grounds that it also has life. Even for women, wealth or kingdom, one should never, in
any way, harm or kill any person (11, 12, 13). King Uparicharvasu, even though he ruled the whole world, practiced ahimsa. Shriji
Maharaj has explained in the Vachanamrut that non-violence is the dharma by which one is led to liberation (Vachanamrut Gadhada
I-69). Harsh words which create mental pain also tantamount to himsa.
Always speak in a truthful, loving and beneficial manner. Never speak untruth. One should never tell a lie even for financial or other
gains. One should never utter truth which may cause danger to one’s life or to another’s. For example, if a butcher chasing a cow
to kill it asks, “Where has the cow gone?” one should tell a lie to save the life of the cow.
"""
tokens = word_tokenize(text)
tag=pos_tag(tokens)
print(tag)

ne_tree = nltk.ne_chunk(tag)
print(ne_tree)

[('Ahimsa', 'NNP'), ('paramo', 'NN'), ('dharmaha', 'NN'), ('’', 'NNP'), ('–', 'NNP'), ('Non-violence', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('highest', 'JJS'), ('ethical', 'JJ'), ('code', 'NN'), ('laid', 'VBD'), ('down', 'RB'), ('in', 'IN'), ('all', 'PDT'), ('the', 'DT'), ('shastras', 'NNS'), ('.', '.'), ('No', 'DT'), ('one', 'NN'), ('should', 'MD'), ('harm', 'VB'), ('any', 'DT'), ('creature', 'NN'), ('by', 'IN'), ('body', 'NN'), (',', ','), ('mind', 'NN'), ('or', 'CC'), ('speech', 'NN'), (';', ':'), ('nor', 'CC'), ('should', 'MD'), ('anyone', 'NN'), ('intentionally', 'RB'), ('kill', 'VB'), ('insects', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('lice', 'NN'), (',', ','), ('bugs', 'NNS'), (',', ','), ('etc', 'FW'), ('.', '.'), ('Even', 'RB'), ('for', 'IN'), ('the', 'DT'), ('purpose', 'NN'), ('of', 'IN'), ('performing', 'VBG'), ('yagnas', 'NNS'), (',', ','), ('none', 'NN'), ('should', 'MD'), ('kill', 'VB'), ('animals', 'NNS'), ('.', '.'), ('This', 'DT'), ('is', 'VBZ'), ('because', 'IN'), ('

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
# This code has 2 custom paragraph that is taken from a book called Kishor Satsang Pravesh from BAPS Sanstha, which is used for tokenization.
# The following code will perform the tokenization and print the token entity annotations, and the entity types of the token.
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(""" Ahimsa paramo dharmaha’ – Non-violence is the highest ethical code laid down in all the shastras. No one should harm any creature
by body, mind or speech; nor should anyone intentionally kill insects such as lice, bugs, etc. Even for the purpose of performing
yagnas, none should kill animals. This is because it is a sin to kill animals and offer them as sacrifices. Shriji Maharaj even refused to
pluck spinach leaves in Jagannathpuri on the grounds that it also has life. Even for women, wealth or kingdom, one should never, in
any way, harm or kill any person (11, 12, 13). King Uparicharvasu, even though he ruled the whole world, practiced ahimsa. Shriji
Maharaj has explained in the Vachanamrut that non-violence is the dharma by which one is led to liberation (Vachanamrut Gadhada
I-69). Harsh words which create mental pain also tantamount to himsa.
Always speak in a truthful, loving and beneficial manner. Never speak untruth. One should never tell a lie even for financial or other
gains. One should never utter truth which may cause danger to one’s life or to another’s. For example, if a butcher chasing a cow
to kill it asks, “Where has the cow gone?” one should tell a lie to save the life of the cow.""")
for token in doc:
    print(token.text, token.ent_iob_, token.ent_type_)

  O 
Ahimsa O 
paramo O 
dharmaha O 
’ O 
– O 
Non O 
- O 
violence O 
is O 
the O 
highest O 
ethical O 
code O 
laid O 
down O 
in O 
all O 
the O 
shastras O 
. O 
No O 
one O 
should O 
harm O 
any O 
creature O 

 O 
by O 
body O 
, O 
mind O 
or O 
speech O 
; O 
nor O 
should O 
anyone O 
intentionally O 
kill O 
insects O 
such O 
as O 
lice O 
, O 
bugs O 
, O 
etc O 
. O 
Even O 
for O 
the O 
purpose O 
of O 
performing O 

 O 
yagnas O 
, O 
none O 
should O 
kill O 
animals O 
. O 
This O 
is O 
because O 
it O 
is O 
a O 
sin O 
to O 
kill O 
animals O 
and O 
offer O 
them O 
as O 
sacrifices O 
. O 
Shriji B PERSON
Maharaj I PERSON
even O 
refused O 
to O 

 O 
pluck O 
spinach O 
leaves O 
in O 
Jagannathpuri B GPE
on O 
the O 
grounds O 
that O 
it O 
also O 
has O 
life O 
. O 
Even O 
for O 
women O 
, O 
wealth O 
or O 
kingdom O 
, O 
one O 
should O 
never O 
, O 
in O 

 O 
any O 
way O 
, O 
harm O 
or O 
kill O 
any O 
person O 
( O 
11 B CARDINAL
, O 
12 B DA

The following code perform the tokenization and print the token entity annotations, and the entity types of the token. It did a good job in tokenization each and every word in the paragraph but it was a bit difficult for the model to identiy the entities in the paragraph as it. Still it did a descent work and gave a nearby good results.