* Named Entity Recognition (NER) - sometimes referred to as entity chunking, extraction, or identification is the task of identifying and categorizing key information(entities) in text.

* An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category.

* For example, an NER machine learning(ML) model might detect the word "MITU Skillogies" in a text and classify it as a "Company".

#### Install necessary support

In [20]:
# !pip install spacy

In [21]:
import nltk

In [22]:
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


True

In [23]:
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('words')

# Sample text
text = 'Sachin Tendulkar was born in Mumbai, India on April 24, 1974'

# Tokenize the text into words
tokens = word_tokenize(text)

# POS tagging
tagged_tokens = pos_tag(tokens)

# Perform Named Entity Recognition(NER)
ner_tree = ne_chunk(tagged_tokens)

# Display the NER tree
print(ner_tree)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


(S
  (PERSON Sachin/NNP)
  (PERSON Tendulkar/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Mumbai/NNP)
  ,/,
  (GPE India/NNP)
  on/IN
  April/NNP
  24/CD
  ,/,
  1974/CD)


In [24]:
ner_tree.draw

In [25]:
ner_tree.pos()

[(('Sachin', 'NNP'), 'PERSON'),
 (('Tendulkar', 'NNP'), 'PERSON'),
 (('was', 'VBD'), 'S'),
 (('born', 'VBN'), 'S'),
 (('in', 'IN'), 'S'),
 (('Mumbai', 'NNP'), 'GPE'),
 ((',', ','), 'S'),
 (('India', 'NNP'), 'GPE'),
 (('on', 'IN'), 'S'),
 (('April', 'NNP'), 'S'),
 (('24', 'CD'), 'S'),
 ((',', ','), 'S'),
 (('1974', 'CD'), 'S')]

In [26]:
for noun in ner_tree.pos():
  if noun[0][1].startswith('NN'):
    print(noun)


# # Another method
# for row in ner:
#   if row[1] != 'S':
#       print(row[0][0], row[1])

(('Sachin', 'NNP'), 'PERSON')
(('Tendulkar', 'NNP'), 'PERSON')
(('Mumbai', 'NNP'), 'GPE')
(('India', 'NNP'), 'GPE')
(('April', 'NNP'), 'S')


#### NER using spacy

In [27]:
import spacy

In [28]:
# !python3 -m spacy download en_core_web_sm

In [29]:
nlp = spacy.load('en_core_web_sm')

In [30]:
sent = nlp('''Mark Zuckerberg will meet Aditya Joshi in New York, USA on Monday 21, 2024 for a $3 Trillion deal.''')

In [31]:
sent

Mark Zuckerberg will meet Aditya Joshi in New York, USA on Monday 21, 2024 for a $3 Trillion deal.

In [32]:
# list the entities in the sentence
sent.ents

(Mark Zuckerberg, Aditya Joshi, New York, USA, Monday 21, 2024, $3 Trillion)

In [35]:
# for wrod in sent.ents:
#   print(word.ents[0], '->', word.label_)

for entity in sent.ents:
    print(entity.text, '->', entity.label_)


Mark Zuckerberg -> PERSON
Aditya Joshi -> PERSON
New York -> GPE
USA -> GPE
Monday 21, 2024 -> DATE
$3 Trillion -> MONEY


In [36]:
sent = nlp(text)

In [37]:
sent.ents

(Sachin Tendulkar, Mumbai, India, April 24, 1974)

In [38]:
for word in sent.ents:
  print(word.text, '->', word.label_)

Sachin Tendulkar -> PERSON
Mumbai -> GPE
India -> GPE
April 24, 1974 -> DATE


In [39]:
spacy.explain('PERSON')

'People, including fictional'

In [40]:
spacy.explain('MONEY')

'Monetary values, including unit'

In [41]:
spacy.explain('DATE')

'Absolute or relative dates or periods'

In [42]:
raw_text = '''Indigenous people have lived in Alaska for thousands of years, and it is widely believed that the region served as the entry point for the initial settlement of North America by way of the Bering land bridge. The Russian Empire was the first to actively colonize the area beginning in the 18th century, eventually establishing Russian America, which spanned most of the current state and promoted and maintained a native Alaskan Creole population.[7] The expense and logistical difficulty of maintaining this distant possession prompted its sale to the U.S. in 1867 for US$7.2 million (equivalent to $157 million in 2023). The area went through several administrative changes before becoming organized as a territory on May 11, 1912. It was admitted as the 49th state of the U.S. on January 3, 1959.[8]'''

In [43]:
raw_text

'Indigenous people have lived in Alaska for thousands of years, and it is widely believed that the region served as the entry point for the initial settlement of North America by way of the Bering land bridge. The Russian Empire was the first to actively colonize the area beginning in the 18th century, eventually establishing Russian America, which spanned most of the current state and promoted and maintained a native Alaskan Creole population.[7] The expense and logistical difficulty of maintaining this distant possession prompted its sale to the U.S. in 1867 for US$7.2 million (equivalent to $157 million in 2023). The area went through several administrative changes before becoming organized as a territory on May 11, 1912. It was admitted as the 49th state of the U.S. on January 3, 1959.[8]'

In [44]:
sent = nlp(raw_text)

In [45]:
sent.ents

(Alaska,
 thousands of years,
 North America,
 The Russian Empire,
 first,
 the 18th century,
 Russian America,
 Alaskan,
 U.S.,
 1867,
 US$7.2 million,
 $157 million,
 2023,
 May 11, 1912,
 49th,
 U.S.,
 January 3,
 1959.[8)

In [46]:
for word in sent.ents:
  print(word.text, '->', word.label_)

Alaska -> GPE
thousands of years -> DATE
North America -> LOC
The Russian Empire -> GPE
first -> ORDINAL
the 18th century -> DATE
Russian America -> LOC
Alaskan -> NORP
U.S. -> GPE
1867 -> DATE
US$7.2 million -> MONEY
$157 million -> MONEY
2023 -> DATE
May 11, 1912 -> DATE
49th -> ORDINAL
U.S. -> GPE
January 3 -> DATE
1959.[8 -> CARDINAL


In [47]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [48]:
spacy.explain('CARDINAL')

'Numerals that do not fall under another type'

In [49]:
spacy.explain('ORDINAL')

'"first", "second", etc.'

In [50]:
spacy.explain('LOC')

'Non-GPE locations, mountain ranges, bodies of water'

#### Display the NER in an interactive way

In [52]:
from spacy import displacy

In [53]:
displacy.render(sent, style='ent', jupyter=True)