# 4. Named Entity Recognition (NER)


Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying entities in text into predefined categories such as names of people, organizations, locations, dates, etc. NLTK provides tools and resources to perform NER easily, leveraging pre-trained models and chunking techniques.

In [1]:
!pip install nltk



In [2]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk


Download Required NLTK Data

In [3]:
# Download Required NLTK Data

nltk.download('punkt')   # For word and sentence tokenization
nltk.download('maxent_ne_chunker')  # Pre-trained NER chunker
nltk.download('words')  # Word list for NER
nltk.download('averaged_perceptron_tagger')  # Part-of-speech tagger


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [4]:
# Tokenize the Text

text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in Cupertino, California on April 1, 1976."

# Tokenize the text into words
tokens = word_tokenize(text)
print(tokens)


['Apple', 'Inc.', 'was', 'founded', 'by', 'Steve', 'Jobs', ',', 'Steve', 'Wozniak', ',', 'and', 'Ronald', 'Wayne', 'in', 'Cupertino', ',', 'California', 'on', 'April', '1', ',', '1976', '.']


In [5]:
# Part-of-Speech (POS) Tagging

#Before identifying named entities, we need to tag each word with its part of speech (POS). This helps the NER algorithm to understand the role of each word in the sentence:

# Apply POS tagging
pos_tags = pos_tag(tokens)
print(pos_tags)


[('Apple', 'NNP'), ('Inc.', 'NNP'), ('was', 'VBD'), ('founded', 'VBN'), ('by', 'IN'), ('Steve', 'NNP'), ('Jobs', 'NNP'), (',', ','), ('Steve', 'NNP'), ('Wozniak', 'NNP'), (',', ','), ('and', 'CC'), ('Ronald', 'NNP'), ('Wayne', 'NNP'), ('in', 'IN'), ('Cupertino', 'NNP'), (',', ','), ('California', 'NNP'), ('on', 'IN'), ('April', 'NNP'), ('1', 'CD'), (',', ','), ('1976', 'CD'), ('.', '.')]


In [6]:
#  Named Entity Recognition (NER)

# Perform Named Entity Recognition
named_entities = ne_chunk(pos_tags)
print(named_entities)


(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  was/VBD
  founded/VBN
  by/IN
  (PERSON Steve/NNP Jobs/NNP)
  ,/,
  (PERSON Steve/NNP Wozniak/NNP)
  ,/,
  and/CC
  (PERSON Ronald/NNP Wayne/NNP)
  in/IN
  (GPE Cupertino/NNP)
  ,/,
  (GPE California/NNP)
  on/IN
  April/NNP
  1/CD
  ,/,
  1976/CD
  ./.)


In [7]:
# Visualize or Extract Named Entities

for chunk in named_entities:
    if hasattr(chunk, 'label'):
        print(f"{chunk.label()}: {' '.join(c[0] for c in chunk)}")


PERSON: Apple
ORGANIZATION: Inc.
PERSON: Steve Jobs
PERSON: Steve Wozniak
PERSON: Ronald Wayne
GPE: Cupertino
GPE: California


Explanation of the Steps
  - Tokenization: We break the text into words, preparing it for further analysis.
  
  - POS Tagging: We tag each word with its grammatical role (noun, verb, etc.), which is essential for NER to work effectively.
  
  - NER with ne_chunk: This function uses the POS-tagged words and applies a chunking algorithm, detecting patterns that match common entity structures (like "First Last" for names).
  
  - Extracting Entities: Finally, we extract the named entities and their types from the Tree object returned by ne_chunk.


Applications of NER
  - Information Extraction: Automatically extracting information such as names, dates, and locations from large text corpora.

  - Document Categorization: Categorizing news articles or documents based on the entities they mention.

  - Knowledge Graph Construction: Building knowledge graphs that map relationships between entities.

Using NLTK’s NER capabilities allows you to automate tasks that require understanding specific entities in text, making it easier to process and analyze large volumes of data.






