# Lecture 70: Named Entity Recognition (NER)

This notebook demonstrates **Named Entity Recognition (NER)** using the `spaCy` library in Python. NER is an NLP task that identifies and classifies named entities in text, such as person names, locations, organizations, and more. We'll use a pre-trained spaCy model to process text, extract entities, and visualize them. The notebook covers:

- Setting up spaCy and loading a pre-trained model
- Processing text to extract named entities
- Visualizing entities using spaCy's visualization tools
- Handling custom text inputs for NER
- Exploring entity types and their applications

We'll use the `en_core_web_sm` model, a small English model suitable for general-purpose NER.

## Setup and Imports

Let's import the necessary libraries and set up spaCy with a pre-trained model. You'll need to install spaCy and download the model if you haven't already.

In [7]:
import spacy
from spacy import displacy
import pandas as pd

# Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_sm')

# Verify model loading
print("SpaCy model loaded successfully:", nlp.pipe_names)

SpaCy model loaded successfully: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


## Sample Text for NER

We'll start with a sample text containing various named entities (e.g., persons, locations, organizations) to demonstrate NER.

In [8]:
# Sample text
sample_text = """
Apple Inc. is planning to open a new store in London next year.
Elon Musk, the CEO of Tesla, visited Shanghai to discuss new factory plans.
The United Nations will hold a climate conference in Paris in 2025.
"""

# Process the text with spaCy
doc = nlp(sample_text)

# Extract entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Display entities in a DataFrame
entities_df = pd.DataFrame(entities, columns=['Entity', 'Type'])
print("Named Entities:")
print(entities_df)

Named Entities:
               Entity    Type
0          Apple Inc.     ORG
1              London     GPE
2           next year    DATE
3           Elon Musk  PERSON
4               Tesla     ORG
5            Shanghai     GPE
6  The United Nations     ORG
7               Paris     GPE
8                2025    DATE


## Visualizing Named Entities

SpaCy's `displacy` module allows us to visualize named entities directly in the text, highlighting their types (e.g., PERSON, GPE, ORG).

In [9]:
# Visualize entities
displacy.render(doc, style='ent', jupyter=True)

## Common Entity Types

The spaCy model identifies various entity types. Here are some common ones:

- **PERSON**: Names of people (e.g., Elon Musk)
- **GPE**: Geopolitical entities, such as countries or cities (e.g., London, Shanghai)
- **ORG**: Organizations or companies (e.g., Apple Inc., Tesla)
- **DATE**: Dates or time periods (e.g., next year, 2025)
- **EVENT**: Named events (e.g., climate conference)

For a full list, you can check spaCy's documentation or the model's labels.

In [10]:
# List all entity types supported by the model
print("Supported Entity Types:")
print(nlp.get_pipe('ner').labels)

Supported Entity Types:
('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')


## NER on Custom Text Inputs

Let's allow users to input custom text and extract named entities from it. We'll process a few example texts to demonstrate versatility.

In [11]:
# Custom text inputs
custom_texts = [
    "Barack Obama visited Tokyo to meet with Prime Minister Shigeru Ishiba.",
    "Google announced a new AI research lab in New York City.",
    "The Eiffel Tower in Paris is a popular tourist destination."
]

# Process and visualize each custom text
for i, text in enumerate(custom_texts, 1):
    print(f"\nCustom Text {i}: {text}")
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entities_df = pd.DataFrame(entities, columns=['Entity', 'Type'])
    print("Named Entities:")
    print(entities_df)
    print("Visualization:")
    displacy.render(doc, style='ent', jupyter=True)


Custom Text 1: Barack Obama visited Tokyo to meet with Prime Minister Shigeru Ishiba.
Named Entities:
           Entity    Type
0    Barack Obama  PERSON
1           Tokyo     GPE
2  Shigeru Ishiba  PERSON
Visualization:



Custom Text 2: Google announced a new AI research lab in New York City.
Named Entities:
          Entity Type
0         Google  ORG
1             AI  GPE
2  New York City  GPE
Visualization:



Custom Text 3: The Eiffel Tower in Paris is a popular tourist destination.
Named Entities:
             Entity Type
0  The Eiffel Tower  LOC
1             Paris  GPE
Visualization:


## Handling a Larger Text Corpus

To demonstrate NER on a larger corpus, we'll process multiple sentences and aggregate entity statistics (e.g., frequency of entity types).

In [12]:
# Larger sample text corpus
corpus = """
Microsoft was founded by Bill Gates and Paul Allen in Redmond, Washington.
The World Health Organization is headquartered in Geneva, Switzerland.
Greta Thunberg spoke at the United Nations in New York in 2019.
Amazon is expanding its operations in India, led by Jeff Bezos.
The Olympic Games will be held in Los Angeles in 2028.
"""

# Process the corpus
doc = nlp(corpus)

# Aggregate entity statistics
entity_counts = {}
for ent in doc.ents:
    entity_counts[ent.label_] = entity_counts.get(ent.label_, 0) + 1

# Display entity statistics
print("Entity Type Frequencies:")
for entity_type, count in entity_counts.items():
    print(f"{entity_type}: {count}")

# Visualize entities
displacy.render(doc, style='ent', jupyter=True)

Entity Type Frequencies:
ORG: 4
PERSON: 4
GPE: 7
DATE: 2
EVENT: 1


## Explanation

- **NER Overview**: Named Entity Recognition identifies and classifies entities like PERSON, GPE, ORG, etc., in text, useful for information extraction and knowledge graph construction.
- **SpaCy Model**: Used `en_core_web_sm`, a lightweight English model with pre-trained NER capabilities.
- **Processing**: Applied the spaCy pipeline to extract entities from sample texts, custom inputs, and a larger corpus.
- **Visualization**: Used `displacy` to highlight entities in text, making results intuitive and visually appealing.
- **Applications**: Demonstrated NER on varied texts, showing its utility for extracting structured information.
- **Entity Statistics**: Aggregated entity type frequencies to analyze the distribution of entities in a corpus.

To extend this work, consider:
- Using larger spaCy models (e.g., `en_core_web_lg`) for better accuracy
- Fine-tuning the NER model on a custom dataset for domain-specific entities
- Integrating NER with other NLP tasks (e.g., relation extraction)
- Handling multilingual texts with spaCy's multilingual models