# Named Entity Recognition (NER) Activity

This notebook demonstrates how to perform Named Entity Recognition (NER) using `spaCy`. We'll process a sample dataset, extract key entities like persons, organizations, and dates, and visualize the results. This exercise will help you understand how NER works and how it can be applied to analyze text data effectively.

### Objectives
1. Learn how to load and process text data with `spaCy`.
2. Extract named entities such as persons, organizations, and dates.
3. Visualize the extracted entities.
4. Reflect on the results with discussion questions.

## Step 1: Import Libraries
We'll start by importing the necessary libraries for this exercise.

In [1]:
# Import necessary libraries
import spacy
from spacy import displacy
import matplotlib.pyplot as plt

## Step 2: Create a Sample Dataset
Let's create a small dataset of press releases as a string for analysis.

In [2]:
# Sample dataset
sample_dataset = """
The Secretary of State, John Doe, announced a new policy during a press conference held in Washington, D.C.,
on January 15th, 2023. The Environmental Protection Agency (EPA) is set to implement this policy immediately.
In another event, President Jane Smith visited New York City on February 3rd, 2023, to discuss infrastructure funding.
"""

# Display dataset
print("Sample Dataset:")
print(sample_dataset)

Sample Dataset:

The Secretary of State, John Doe, announced a new policy during a press conference held in Washington, D.C., 
on January 15th, 2023. The Environmental Protection Agency (EPA) is set to implement this policy immediately.
In another event, President Jane Smith visited New York City on February 3rd, 2023, to discuss infrastructure funding.



## Step 3: Load `spaCy` Language Model
We'll load the pre-trained English language model (`en_core_web_sm`) for processing the text.

In [3]:
# Load spaCy language model
nlp = spacy.load("en_core_web_sm")
print("spaCy Model Loaded: en_core_web_sm")

spaCy Model Loaded: en_core_web_sm


## Step 4: Process the Dataset
We'll use the `spaCy` language model to process the text and extract named entities.

In [4]:
# Process the sample dataset
doc = nlp(sample_dataset)

# Extract and display named entities
print(f"{'Entity':<30} | {'Type':<15}")
print("="*50)
for ent in doc.ents:
    print(f"{ent.text:<30} | {ent.label_:<15}")

Entity                         | Type           
State                          | ORG            
John Doe                       | PERSON         
Washington                     | GPE            
D.C.                           | GPE            
January 15th, 2023             | DATE           
The Environmental Protection Agency | ORG            
EPA                            | ORG            
Jane Smith                     | PERSON         
New York City                  | GPE            
February 3rd, 2023             | DATE           


## Step 5: Organize Entities by Type
We'll group the extracted entities into categories such as persons, organizations, dates, and locations.

In [5]:
# Organize entities by type
persons = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
dates = [ent.text for ent in doc.ents if ent.label_ == "DATE"]
locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]

# Display categorized entities
print("\nPersons Detected:", persons)
print("Organizations Detected:", organizations)
print("Dates Detected:", dates)
print("Locations Detected:", locations)


Persons Detected: ['John Doe', 'Jane Smith']
Organizations Detected: ['State', 'The Environmental Protection Agency', 'EPA']
Dates Detected: ['January 15th, 2023', 'February 3rd, 2023']
Locations Detected: ['Washington', 'D.C.', 'New York City']


## Step 6: Visualize the Results
We'll use `displacy` from `spaCy` to render the named entities directly on the text.

In [6]:
# Render entities with displacy
displacy.render(doc, style="ent", jupyter=True)

## Step 7: Reflection Questions
1. What additional entities (other than persons, organizations, dates, and locations) could be extracted from this dataset?
2. How could the quality of the extracted entities improve with larger models?
3. What challenges might arise when applying NER to multilingual datasets?

## Step 8: Save Results to a File
Save the extracted entities to a text file for reference.

In [7]:
# Save results to a file
with open("ner_results.txt", "w") as file:
    file.write("Named Entity Recognition (NER) Results\n")
    file.write("="*40 + "\n")
    file.write(f"Persons: {persons}\n")
    file.write(f"Organizations: {organizations}\n")
    file.write(f"Dates: {dates}\n")
    file.write(f"Locations: {locations}\n")

print("Results saved to 'ner_results.txt'.")

Results saved to 'ner_results.txt'.
