# Named Entity Recognition (NER)

---

## Table of Contents
1. [Introduction to NER](#introduction)
2. [Understanding Named Entities](#understanding)
3. [NER with NLTK](#nltk-ner)
4. [NER with spaCy](#spacy-ner)
5. [Entity Types](#entity-types)
6. [Applications of NER](#applications)
7. [Advanced Topics](#advanced)
8. [Real-World Examples](#real-world)

---

## 1. Introduction to NER <a id='introduction'></a>

**Named Entity Recognition (NER)** is the task of identifying and classifying named entities in text into predefined categories.

### What is a Named Entity?

A **named entity** is a real-world object with a proper name, such as:
- **Person**: Barack Obama, Marie Curie
- **Organization**: Google, United Nations
- **Location**: Paris, Mount Everest
- **Date**: January 1, 2024
- **Time**: 3:00 PM
- **Money**: $100, €50
- **Percentage**: 25%, 3.14%

### Example:

```
Text: "Apple Inc. was founded by Steve Jobs in Cupertino in 1976."

Named Entities:
  - "Apple Inc." → ORGANIZATION
  - "Steve Jobs" → PERSON
  - "Cupertino" → LOCATION (GPE - Geopolitical Entity)
  - "1976" → DATE
```

### Why is NER Important?

1. **Information Extraction**
   - Extract structured information from unstructured text
   - Build knowledge graphs

2. **Question Answering**
   - "Who founded Apple?" → Need to identify person entities

3. **Content Classification**
   - Categorize articles by mentioned entities

4. **Search Enhancement**
   - Improve search by recognizing entity types

5. **Recommendation Systems**
   - Recommend articles mentioning similar entities

### NER Pipeline:

```
Raw Text → Tokenization → POS Tagging → NER → Entity Classification
```

In [None]:
# Setup: Import necessary libraries
import nltk
import spacy
from spacy import displacy
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# NLTK imports
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('maxent_ne_chunker_tab', quiet=True)
nltk.download('words', quiet=True)

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

print("✓ All libraries imported successfully!")

## 2. Understanding Named Entities <a id='understanding'></a>

### Common Entity Types:

| Entity Type | Description | Examples |
|-------------|-------------|----------|
| **PERSON** | People's names | Barack Obama, Marie Curie |
| **ORGANIZATION (ORG)** | Companies, agencies, institutions | Google, NASA, Harvard |
| **GPE** | Geopolitical entities (countries, cities) | USA, Paris, California |
| **LOCATION (LOC)** | Non-GPE locations | Mount Everest, Pacific Ocean |
| **DATE** | Absolute or relative dates | January 1, yesterday, 2024 |
| **TIME** | Times | 3:00 PM, morning |
| **MONEY** | Monetary values | $100, €50, £30 |
| **PERCENT** | Percentages | 25%, 0.5% |
| **PRODUCT** | Products | iPhone, Windows |
| **EVENT** | Named events | Olympics, World War II |
| **LANGUAGE** | Languages | English, Spanish |
| **WORK_OF_ART** | Books, songs, etc. | "Mona Lisa", "Hamlet" |

## 3. NER with NLTK <a id='nltk-ner'></a>

NLTK provides basic NER functionality using `ne_chunk()`.

In [None]:
# Basic NER with NLTK

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."

print(f"Text: '{text}'\n")
print("="*80)

# Step 1: Tokenize
tokens = word_tokenize(text)
print(f"\n1. Tokens: {tokens}")

# Step 2: POS Tagging
pos_tags = pos_tag(tokens)
print(f"\n2. POS Tags: {pos_tags}")

# Step 3: Named Entity Recognition
# binary=False returns entity types (PERSON, ORGANIZATION, etc.)
# binary=True just marks whether it's a named entity or not
named_entities = ne_chunk(pos_tags, binary=False)

print(f"\n3. Named Entities (tree structure):")
print(named_entities)

In [None]:
# Extract entities in a more readable format

def extract_entities_nltk(text):
    """
    Extract named entities using NLTK.
    
    Returns:
        list: List of (entity_text, entity_type) tuples
    """
    # Tokenize and POS tag
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    # Named entity recognition
    chunks = ne_chunk(pos_tags, binary=False)
    
    entities = []
    for chunk in chunks:
        if hasattr(chunk, 'label'):
            # This is a named entity
            entity_text = ' '.join(c[0] for c in chunk)
            entity_type = chunk.label()
            entities.append((entity_text, entity_type))
    
    return entities

# Test on sample text
text = """Microsoft was founded by Bill Gates and Paul Allen in Seattle. 
The company is now headquartered in Redmond, Washington. 
Satya Nadella became CEO in February 2014."""

entities = extract_entities_nltk(text)

print("Extracted Named Entities (NLTK):\n")
print("="*80)
print(f"Text: {text}\n")
print("-"*80)
print(f"\n{'Entity':<30} {'Type':<20}")
print("-"*50)

for entity, entity_type in entities:
    print(f"{entity:<30} {entity_type:<20}")

In [None]:
# Alternative format using tree2conlltags

def extract_entities_with_iob(text):
    """
    Extract entities in IOB (Inside-Outside-Beginning) format.
    IOB format marks the boundaries of chunks.
    """
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    chunks = ne_chunk(pos_tags)
    
    # Convert tree to IOB tags
    iob_tags = tree2conlltags(chunks)
    
    return iob_tags

# Example
text = "Barack Obama was born in Hawaii and became the 44th President of the United States."

iob_tags = extract_entities_with_iob(text)

print("IOB Format Tagging:\n")
print("="*80)
print(f"Text: {text}\n")
print(f"{'Word':<20} {'POS':<10} {'IOB Tag':<15}")
print("-"*50)

for word, pos, iob in iob_tags:
    print(f"{word:<20} {pos:<10} {iob:<15}")

print("\nIOB Tag Explanation:")
print("  O         = Outside (not a named entity)")
print("  B-TYPE    = Beginning of entity of TYPE")
print("  I-TYPE    = Inside (continuation) of entity of TYPE")

## 4. NER with spaCy <a id='spacy-ner'></a>

spaCy provides more accurate and comprehensive NER with more entity types.

In [None]:
# Basic NER with spaCy

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."

print(f"Text: '{text}'\n")
print("="*80)

# Process with spaCy
doc = nlp(text)

# Extract entities
print("\nNamed Entities (spaCy):\n")
print(f"{'Entity':<25} {'Type':<15} {'Description'}")
print("-"*70)

for ent in doc.ents:
    print(f"{ent.text:<25} {ent.label_:<15} {spacy.explain(ent.label_)}")

print("\n" + "="*80)
print("\nEntity Details:")
for ent in doc.ents:
    print(f"  '{ent.text}' → {ent.label_} (characters {ent.start_char}-{ent.end_char})")

In [None]:
# More comprehensive example

text = """On January 15, 2024, Elon Musk announced that Tesla would invest 
$5 billion in a new factory in Austin, Texas. The stock price rose by 12.5% 
following the announcement. SpaceX, another company founded by Musk, also 
reported record revenues of €2.3 billion last quarter."""

doc = nlp(text)

print("Comprehensive NER Example:\n")
print("="*80)
print(f"Text: {text}\n")
print("-"*80)

# Group entities by type
entities_by_type = {}
for ent in doc.ents:
    if ent.label_ not in entities_by_type:
        entities_by_type[ent.label_] = []
    entities_by_type[ent.label_].append(ent.text)

# Display grouped entities
for entity_type, entities in sorted(entities_by_type.items()):
    print(f"\n{entity_type} ({spacy.explain(entity_type)}):")
    for entity in entities:
        print(f"  - {entity}")

In [None]:
# Compare NLTK vs spaCy

test_text = """Google CEO Sundar Pichai announced a $10 billion investment in AI 
research at the company's headquarters in Mountain View, California on March 1, 2024."""

print("NLTK vs spaCy Comparison:\n")
print("="*80)
print(f"Text: {test_text}\n")
print("-"*80)

# NLTK entities
nltk_entities = extract_entities_nltk(test_text)
print("\nNLTK Entities:")
for entity, etype in nltk_entities:
    print(f"  {entity:<30} → {etype}")

# spaCy entities
spacy_doc = nlp(test_text)
print("\nspaCy Entities:")
for ent in spacy_doc.ents:
    print(f"  {ent.text:<30} → {ent.label_}")

print("\n" + "="*80)
print("\nObservations:")
print("  - spaCy generally provides more accurate and detailed entity recognition")
print("  - spaCy recognizes more entity types (MONEY, DATE, etc.)")
print("  - NLTK is simpler but less comprehensive")

## 5. Entity Types <a id='entity-types'></a>

Let's explore different entity types in detail.

In [None]:
# Examples of different entity types

examples = {
    'PERSON': "Albert Einstein, Marie Curie, and Isaac Newton were brilliant scientists.",
    'ORG': "Google, Microsoft, and Amazon are leading tech companies.",
    'GPE': "Paris, London, and Tokyo are major world cities.",
    'MONEY': "The house costs $500,000 or approximately €450,000.",
    'DATE': "The meeting is scheduled for January 15, 2024, next Monday.",
    'TIME': "We'll meet at 3:30 PM or around noon tomorrow.",
    'PERCENT': "The stock rose by 15.5% and fell by 2.3% the next day.",
    'PRODUCT': "I bought the new iPhone 15 Pro and a MacBook Air.",
}

print("Entity Type Examples:\n")
print("="*80)

for target_type, text in examples.items():
    doc = nlp(text)
    
    print(f"\n{target_type}:")
    print(f"Text: {text}")
    print("Entities found:")
    
    for ent in doc.ents:
        if ent.label_ == target_type or (target_type == 'GPE' and ent.label_ in ['GPE', 'LOC']):
            print(f"  ✓ '{ent.text}' → {ent.label_}")
    
    print("-"*80)

In [None]:
# Get all available entity types in spaCy

# Process a diverse text
diverse_text = """On June 15, 2024, at 10:00 AM, President Joe Biden met with 
Japanese Prime Minister at the White House in Washington, D.C. They discussed 
a $500 million trade agreement. The Nikkei 225 rose by 3.2% while the S&P 500 
gained 1.8%. Apple Inc. and Toyota Motor Corporation signed a partnership deal."""

doc = nlp(diverse_text)

# Get unique entity types
entity_types = set([ent.label_ for ent in doc.ents])

print("All Entity Types Found:\n")
print("="*80)

for etype in sorted(entity_types):
    explanation = spacy.explain(etype)
    entities = [ent.text for ent in doc.ents if ent.label_ == etype]
    print(f"\n{etype}: {explanation}")
    print(f"  Examples: {', '.join(entities)}")

## 6. Applications of NER <a id='applications'></a>

Real-world applications of Named Entity Recognition.

### Application 1: Information Extraction from News Articles

In [None]:
# Extract key information from a news article

news_article = """In a major development, OpenAI announced on December 10, 2024, that 
its latest language model has achieved unprecedented results. CEO Sam Altman stated 
that the company invested over $100 million in research and development. The 
announcement was made at the company's headquarters in San Francisco, California. 
Shares of Microsoft, a major investor in OpenAI, rose by 5.6% following the news. 
The technology is expected to be integrated into products by March 2025."""

doc = nlp(news_article)

print("News Article Analysis:\n")
print("="*80)
print(f"Article: {news_article[:100]}...\n")
print("-"*80)

# Extract structured information
info = {
    'Organizations': [],
    'People': [],
    'Locations': [],
    'Dates': [],
    'Money': [],
    'Percentages': []
}

for ent in doc.ents:
    if ent.label_ == 'ORG':
        info['Organizations'].append(ent.text)
    elif ent.label_ == 'PERSON':
        info['People'].append(ent.text)
    elif ent.label_ in ['GPE', 'LOC']:
        info['Locations'].append(ent.text)
    elif ent.label_ == 'DATE':
        info['Dates'].append(ent.text)
    elif ent.label_ == 'MONEY':
        info['Money'].append(ent.text)
    elif ent.label_ == 'PERCENT':
        info['Percentages'].append(ent.text)

# Display extracted information
print("\nExtracted Information:\n")
for category, items in info.items():
    if items:
        print(f"{category}:")
        for item in items:
            print(f"  - {item}")
        print()

### Application 2: Resume/CV Parsing

In [None]:
# Extract information from a resume

resume = """John Smith
Email: john.smith@email.com
Phone: +1-555-123-4567

EXPERIENCE:
Senior Data Scientist at Google (2020-2024)
- Led machine learning projects in Mountain View, California
- Managed a team of 5 engineers
- Increased model accuracy by 23%

Data Analyst at Microsoft (2018-2020)
- Analyzed user data in Redmond, Washington
- Worked with Python, SQL, and TensorFlow

EDUCATION:
Master of Science in Computer Science
Stanford University (2016-2018)
GPA: 3.9/4.0
"""

doc = nlp(resume)

print("Resume Parsing:\n")
print("="*80)

# Extract key entities
resume_info = {
    'Name': [],
    'Organizations': [],
    'Locations': [],
    'Dates': [],
    'Skills': [],  # Note: Would need custom NER for skills
}

for ent in doc.ents:
    if ent.label_ == 'PERSON' and len(ent.text.split()) >= 2:  # Full names
        resume_info['Name'].append(ent.text)
    elif ent.label_ == 'ORG':
        resume_info['Organizations'].append(ent.text)
    elif ent.label_ == 'GPE':
        resume_info['Locations'].append(ent.text)
    elif ent.label_ == 'DATE':
        resume_info['Dates'].append(ent.text)

# Display parsed information
for category, items in resume_info.items():
    if items:
        unique_items = list(set(items))  # Remove duplicates
        print(f"\n{category}:")
        for item in unique_items:
            print(f"  • {item}")

### Application 3: Entity-Based Text Summarization

In [None]:
# Create entity-based summary

def entity_summary(text):
    """
    Create a summary based on most frequently mentioned entities.
    """
    doc = nlp(text)
    
    # Count entity mentions
    entity_counts = Counter()
    entity_types = {}
    
    for ent in doc.ents:
        entity_counts[ent.text] += 1
        entity_types[ent.text] = ent.label_
    
    # Get most common entities
    top_entities = entity_counts.most_common(10)
    
    return top_entities, entity_types

# Example article
article = """Tesla CEO Elon Musk announced that Tesla will open a new manufacturing 
facility in Berlin, Germany. The facility will produce electric vehicles and batteries. 
Musk stated that Tesla has invested $5 billion in the project. The Berlin factory 
is expected to create 10,000 jobs. Tesla already operates factories in Fremont, 
California and Shanghai, China. The company aims to produce 500,000 vehicles 
annually at the Berlin facility by 2025."""

top_entities, entity_types = entity_summary(article)

print("Entity-Based Summary:\n")
print("="*80)
print(f"Article: {article[:100]}...\n")
print("-"*80)
print("\nKey Entities (by frequency):\n")
print(f"{'Entity':<25} {'Type':<15} {'Mentions'}")
print("-"*50)

for entity, count in top_entities:
    etype = entity_types[entity]
    print(f"{entity:<25} {etype:<15} {count}")

print("\nSummary: This article primarily discusses Tesla and Elon Musk's announcement")
print("about a new facility in Berlin, Germany.")

## 7. Advanced Topics <a id='advanced'></a>

### Entity Linking and Disambiguation

Sometimes the same name can refer to different entities (e.g., "Apple" as company vs. fruit).

In [None]:
# Example of entity ambiguity

ambiguous_texts = [
    "I ate an apple for breakfast.",  # apple = fruit
    "Apple released a new iPhone today.",  # Apple = company
    "Washington led the troops to victory.",  # Washington = person
    "I visited Washington last summer.",  # Washington = place
]

print("Entity Disambiguation Examples:\n")
print("="*80)

for text in ambiguous_texts:
    doc = nlp(text)
    
    print(f"\nText: '{text}'")
    
    entities_found = [(ent.text, ent.label_) for ent in doc.ents]
    
    if entities_found:
        print("Entities:")
        for ent_text, ent_label in entities_found:
            print(f"  - '{ent_text}' → {ent_label}")
    else:
        print("  No entities detected (likely common noun)")
    
    print("-"*80)

### Nested Entities

In [None]:
# Example of nested entities

text = "The CEO of Apple Inc., Tim Cook, announced the new product."

doc = nlp(text)

print("Nested Entities Example:\n")
print("="*80)
print(f"Text: '{text}'\n")
print("Entities Found:")

for ent in doc.ents:
    print(f"  '{ent.text}' → {ent.label_} (chars {ent.start_char}-{ent.end_char})")

print("\nNote: 'Apple Inc.' is an organization, 'Tim Cook' is a person")
print("Some NER systems can identify that Tim Cook is CEO of Apple Inc.")

## 8. Real-World Examples <a id='real-world'></a>

In [None]:
# Analyze a collection of news articles

articles = [
    """Amazon CEO Jeff Bezos announced a $2 billion investment in climate change 
    initiatives. The program will focus on renewable energy projects in Seattle.""",
    
    """Microsoft reported quarterly earnings of $51.7 billion, up 18% from last year. 
    CEO Satya Nadella credited cloud computing growth in the Q4 2024 results.""",
    
    """Tesla stock rose 7.2% after Elon Musk announced plans to build a factory in 
    Texas. The $1.1 billion facility will create 5,000 jobs by December 2025."""
]

print("News Article Collection Analysis:\n")
print("="*80)

all_entities = []
entity_by_article = {}

for i, article in enumerate(articles, 1):
    doc = nlp(article)
    
    article_entities = {}
    for ent in doc.ents:
        if ent.label_ not in article_entities:
            article_entities[ent.label_] = []
        article_entities[ent.label_].append(ent.text)
        all_entities.append((ent.text, ent.label_))
    
    entity_by_article[f"Article {i}"] = article_entities
    
    print(f"\nArticle {i}:")
    print(f"Text: {article[:80]}...")
    print("Entities:")
    for etype, entities in article_entities.items():
        print(f"  {etype}: {', '.join(entities)}")
    print("-"*80)

# Overall statistics
entity_counter = Counter([ent[1] for ent in all_entities])

print("\nOverall Entity Type Distribution:")
for etype, count in entity_counter.most_common():
    print(f"  {etype}: {count}")

In [None]:
# Visualize entity distribution

# Create DataFrame
df_entities = pd.DataFrame(entity_counter.most_common(), 
                           columns=['Entity Type', 'Count'])

# Plot
plt.figure(figsize=(12, 6))
plt.bar(df_entities['Entity Type'], df_entities['Count'], 
        color='steelblue', alpha=0.8)
plt.xlabel('Entity Type', fontsize=12, fontweight='bold')
plt.ylabel('Frequency', fontsize=12, fontweight='bold')
plt.title('Named Entity Distribution in News Articles', 
          fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nEntity Type Distribution:")
print(df_entities.to_string(index=False))

In [None]:
# Create a comprehensive NER analysis function

def analyze_text_entities(text, verbose=True):
    """
    Comprehensive named entity analysis.
    
    Args:
        text (str): Input text
        verbose (bool): Print detailed output
    
    Returns:
        dict: Analysis results
    """
    doc = nlp(text)
    
    # Collect all entities
    entities = [(ent.text, ent.label_, ent.start_char, ent.end_char) 
                for ent in doc.ents]
    
    # Group by type
    by_type = {}
    for ent_text, ent_label, _, _ in entities:
        if ent_label not in by_type:
            by_type[ent_label] = []
        by_type[ent_label].append(ent_text)
    
    # Count frequencies
    entity_freq = Counter([e[0] for e in entities])
    type_freq = Counter([e[1] for e in entities])
    
    # Calculate statistics
    stats = {
        'total_entities': len(entities),
        'unique_entities': len(set([e[0] for e in entities])),
        'entity_types': len(type_freq),
        'entities_by_type': by_type,
        'most_common_entity': entity_freq.most_common(1)[0] if entity_freq else None,
        'most_common_type': type_freq.most_common(1)[0] if type_freq else None,
    }
    
    if verbose:
        print("\nNER Analysis Results:")
        print("="*60)
        print(f"Total entities found: {stats['total_entities']}")
        print(f"Unique entities: {stats['unique_entities']}")
        print(f"Entity types: {stats['entity_types']}")
        
        if stats['most_common_entity']:
            entity, count = stats['most_common_entity']
            print(f"Most mentioned: '{entity}' ({count} times)")
        
        print("\nEntities by type:")
        for etype, entities_list in sorted(by_type.items()):
            print(f"  {etype}: {len(entities_list)} ({', '.join(set(entities_list))})")
    
    return stats

# Test the function
test_text = """On January 10, 2024, Microsoft CEO Satya Nadella met with 
OpenAI's Sam Altman in San Francisco to discuss a $10 billion partnership. 
The deal represents the largest AI investment to date. Shares of Microsoft 
rose 4.5% while the NASDAQ gained 1.2%."""

print("Testing Comprehensive NER Analysis:")
print("="*80)
print(f"Text: {test_text}\n")

results = analyze_text_entities(test_text)

## Summary

In this notebook, we covered:

✅ **Introduction to NER**: What named entities are and why they matter  
✅ **NER with NLTK**: Basic entity recognition using `ne_chunk()`  
✅ **NER with spaCy**: More accurate and comprehensive entity recognition  
✅ **Entity Types**: PERSON, ORG, GPE, MONEY, DATE, TIME, etc.  
✅ **Applications**: News analysis, resume parsing, text summarization  
✅ **Advanced Topics**: Entity disambiguation, nested entities  
✅ **Real-World Examples**: Analyzing collections of documents

### Key Takeaways:

1. **NER is essential** for extracting structured information from unstructured text
2. **spaCy > NLTK** for production NER applications
   - More accurate
   - Recognizes more entity types
   - Better handling of complex entities
3. **Entity types vary** by domain and use case
4. **Challenges remain**:
   - Entity disambiguation ("Apple" = company vs. fruit)
   - Nested entities
   - Domain-specific entities
   - New/emerging entities
5. **Applications are diverse**:
   - Information extraction
   - Question answering
   - Content recommendation
   - Knowledge graph construction

### Common NER Patterns:

- **Person**: [First Name] [Last Name]
- **Organization**: [Company Name] [Inc./Corp./Ltd.]
- **Location**: [City], [State/Province]
- **Date**: [Month] [Day], [Year]
- **Money**: [$€£] [Number] [million/billion]

---

**Next Notebook**: `05_Word_Embeddings.ipynb` - Introduction to word embeddings and their benefits

---