# 2. Named Entity Recognition (NER)

**Estimated Time**: ~2 hours

**Prerequisites**: Notebook 1 (Fill-Mask) - understanding of tokenization, pipelines, and confidence scores

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Define** named entity recognition and list common entity types (PER, ORG, LOC, MISC)
2. **Use** the `ner` pipeline with `grouped_entities=True` to extract entities from text
3. **Understand** the BIO tagging scheme (Beginning, Inside, Outside)
4. **Handle** entity aggregation and overlapping entities
5. **Evaluate** NER output quality on different text types

## Setup

Run this cell first. If you completed Notebook 1, you already have the models cached.

In [None]:
# Core imports
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import torch
from collections import Counter, defaultdict

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete!")

---

# Part 1: Conceptual Foundation

## What is Named Entity Recognition?

**In plain English**: NER is the task of finding and categorizing "important things" in text - like people's names, company names, places, and dates.

**Technical definition**: NER is a sequence labeling task where each token in a text is assigned a label indicating whether it's part of a named entity and what type of entity it is.

### Visual Example

```
Input:  "Elon Musk founded SpaceX in California."
         ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ  ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ  ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ  ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            PER       ORG      LOC      LOC
         (Person)  (Org)   (Location) (Location)

Output: [
    {"entity": "PER", "word": "Elon Musk"},
    {"entity": "ORG", "word": "SpaceX"},
    {"entity": "LOC", "word": "California"}
]
```

### Common Entity Types

Most NER models recognize these standard entity types:

| Type | Full Name | Examples |
|------|-----------|----------|
| **PER** | Person | Elon Musk, Marie Curie, John Smith |
| **ORG** | Organization | Google, United Nations, Harvard University |
| **LOC** | Location | Paris, Mount Everest, Pacific Ocean |
| **MISC** | Miscellaneous | English (language), Nobel Prize, COVID-19 |

Some specialized models include additional types:
- **DATE**: January 2024, next Monday
- **TIME**: 3:00 PM, noon
- **MONEY**: $500, 50 euros
- **PERCENT**: 25%, half

### How NER Models Work: Token Classification

Remember from Notebook 1 how we tokenized text? NER builds on that:

```
Text:    "Elon Musk works at SpaceX"
Tokens:  ["Elon", "Musk", "works", "at", "SpaceX"]
Labels:  [B-PER,  I-PER,   O,      O,    B-ORG  ]
```

Instead of predicting masked words (fill-mask), the model predicts a **label for each token**.

### The BIO Tagging Scheme

NER uses a special tagging scheme to handle multi-word entities:

| Tag | Meaning | Example |
|-----|---------|--------|
| **B-XXX** | Beginning of entity type XXX | "Elon" ‚Üí B-PER |
| **I-XXX** | Inside (continuation) of entity | "Musk" ‚Üí I-PER |
| **O** | Outside any entity | "works", "at" ‚Üí O |

This allows the model to handle:
- Multi-word entities: "New York City" ‚Üí B-LOC, I-LOC, I-LOC
- Adjacent entities: "Google Microsoft" ‚Üí B-ORG, B-ORG (two separate orgs)

### Connection to Notebook 1: Same Architecture, Different Task

NER models use the **same BERT architecture** from Notebook 1, but with a different "head":

```
Fill-Mask (Notebook 1):         NER (This Notebook):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   BERT Encoder   ‚îÇ            ‚îÇ   BERT Encoder   ‚îÇ
‚îÇ  (same weights)  ‚îÇ            ‚îÇ  (same weights)  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ                               ‚îÇ
         ‚ñº                               ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  MLM Head        ‚îÇ            ‚îÇ  Token Class Head‚îÇ
‚îÇ  (predict word)  ‚îÇ            ‚îÇ  (predict label) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

The pre-trained BERT knowledge transfers to help NER!

### Real-World Applications

NER powers many practical applications:

- **Information Extraction**: Pull structured data from unstructured text
- **Search Engines**: Understand queries like "restaurants near Eiffel Tower"
- **Customer Support**: Identify products, order numbers, and customer names
- **News Analysis**: Track mentions of companies, politicians, locations
- **Privacy/Redaction**: Find and mask PII (personally identifiable information)
- **Knowledge Graphs**: Build connections between entities

### Key Terminology

| Term | Definition |
|------|------------|
| **Named Entity** | A real-world object with a proper name (person, place, organization) |
| **Token Classification** | Assigning a label to each token in a sequence |
| **BIO Tagging** | Begin-Inside-Outside scheme for marking entity boundaries |
| **Entity Span** | The character positions where an entity starts and ends |
| **Entity Aggregation** | Combining B-/I- tags into complete entity strings |

### Check Your Understanding

Before moving on, try to answer these questions (answers at the end):

1. What does NER stand for?
   - A) Neural Entity Recognition
   - B) Named Entity Recognition
   - C) Natural Entity Resolution

2. In BIO tagging, what does the "I" prefix mean?
   - A) Initial token of an entity
   - B) Inside (continuation) of an entity
   - C) Identifier for the entity

3. Which entity type would "Harvard University" be?
   - A) PER (Person)
   - B) LOC (Location)
   - C) ORG (Organization)

4. How does NER relate to the BERT model from Notebook 1?
   - A) They're completely different architectures
   - B) NER uses BERT with a different output head
   - C) NER doesn't use neural networks

---

# Part 2: Basic Implementation

## Your First NER Pipeline

Let's create an NER pipeline and extract entities from text:

In [None]:
# Create an NER pipeline
# The default model is dbmdz/bert-large-cased-finetuned-conll03-english
ner = pipeline("ner", grouped_entities=True)

# Extract entities from a sentence
text = "Elon Musk founded SpaceX in California and later acquired Twitter."
entities = ner(text)

print(f"Text: '{text}'\n")
print("Extracted entities:")
for entity in entities:
    print(f"  {entity['word']:20} ‚Üí {entity['entity_group']:5} ({entity['score']:.2%})")

### Understanding the Output

Each entity in the result contains:

- `word`: The extracted entity text
- `entity_group`: The entity type (PER, ORG, LOC, MISC)
- `score`: Confidence score (like in Notebook 1!)
- `start`: Character position where entity begins
- `end`: Character position where entity ends

Let's examine an entity in detail:

In [None]:
# Examine the first entity in detail
first_entity = entities[0]

print("Detailed view of first entity:")
for key, value in first_entity.items():
    if key == 'score':
        print(f"  {key:12}: {value:.4f} ({value:.2%})")
    else:
        print(f"  {key:12}: {value}")

# Verify the span is correct
print(f"\nVerification: text[{first_entity['start']}:{first_entity['end']}] = '{text[first_entity['start']:first_entity['end']]}'")

### The Importance of `grouped_entities=True`

Without this parameter, you get raw BIO tags - one label per token. Let's compare:

In [None]:
# Without grouped_entities
ner_raw = pipeline("ner", grouped_entities=False)

text = "Elon Musk works at SpaceX."
raw_entities = ner_raw(text)

print("WITHOUT grouped_entities (raw BIO tags):")
print(f"Text: '{text}'\n")
for ent in raw_entities:
    print(f"  '{ent['word']:12}' ‚Üí {ent['entity']:10} ({ent['score']:.2%})")

print("\n" + "="*50 + "\n")

# With grouped_entities
grouped_entities = ner(text)

print("WITH grouped_entities (aggregated):")
for ent in grouped_entities:
    print(f"  '{ent['word']:12}' ‚Üí {ent['entity_group']:5} ({ent['score']:.2%})")

Notice:
- Raw output: "Elon" (B-PER) and "Musk" (I-PER) are separate
- Grouped output: "Elon Musk" is combined into one PER entity

For most applications, `grouped_entities=True` is what you want!

### Processing Different Types of Text

Let's see how NER performs on various text types:

In [None]:
# Test on different text types
test_texts = [
    # News headline
    "Apple CEO Tim Cook announced new iPhone features in Cupertino.",
    
    # Historical text
    "In 1969, Neil Armstrong became the first person to walk on the Moon.",
    
    # Sports news
    "The Los Angeles Lakers defeated the Boston Celtics at Madison Square Garden.",
    
    # Scientific text
    "Dr. Jane Goodall studied chimpanzees in Gombe Stream National Park in Tanzania.",
]

for text in test_texts:
    entities = ner(text)
    print(f"TEXT: '{text}'")
    if entities:
        for ent in entities:
            print(f"  ‚Üí {ent['word']:25} [{ent['entity_group']:4}] ({ent['score']:.0%})")
    else:
        print("  ‚Üí No entities found")
    print()

---

## Exercise 1: News Headline Entity Extraction (Guided)

**Difficulty**: Basic | **Time**: 10-15 minutes

**Your task**: Extract entities from news headlines and categorize them by type.

### Step 1: Run NER on these headlines

In [None]:
# Sample news headlines
headlines = [
    "Microsoft acquires Activision Blizzard for $69 billion",
    "President Biden visits Tokyo for G7 summit",
    "Tesla opens new Gigafactory in Berlin, Germany",
    "NASA's Perseverance rover discovers water on Mars",
    "Amazon founder Jeff Bezos announces Blue Origin mission"
]

# Process each headline
all_entities = []
for headline in headlines:
    entities = ner(headline)
    print(f"Headline: '{headline}'")
    for ent in entities:
        print(f"  {ent['entity_group']:4}: {ent['word']}")
        all_entities.append(ent)
    print()

### Step 2: Count entities by type

In [None]:
# Count entities by type
entity_counts = Counter(ent['entity_group'] for ent in all_entities)

print("Entity type distribution:")
for entity_type, count in entity_counts.most_common():
    bar = '‚ñà' * count
    print(f"  {entity_type:5}: {count:2} {bar}")

### Step 3: Try your own headlines

Add 3 of your own news headlines and run NER on them:

In [None]:
# YOUR CODE HERE
# Add your own headlines
my_headlines = [
    # Replace with your own headlines
    "Your headline 1 here",
    "Your headline 2 here",
    "Your headline 3 here",
]

for headline in my_headlines:
    entities = ner(headline)
    print(f"Headline: '{headline}'")
    for ent in entities:
        print(f"  {ent['entity_group']:4}: {ent['word']} ({ent['score']:.0%})")
    print()

---

# Part 3: Intermediate Exploration

## Handling Edge Cases and Errors

NER isn't perfect. Let's explore common issues:

In [None]:
# Edge case 1: Ambiguous names (same name, different entity types)
ambiguous_cases = [
    "I love Apple products.",                    # Apple = ORG (company)
    "I ate an apple for breakfast.",             # apple = not an entity
    "Jordan is a great basketball player.",      # Jordan = PER (Michael Jordan)
    "Jordan is a country in the Middle East.",   # Jordan = LOC (country)
]

print("AMBIGUOUS NAMES:")
print("="*50)
for text in ambiguous_cases:
    entities = ner(text)
    print(f"'{text}'")
    if entities:
        for ent in entities:
            print(f"  ‚Üí {ent['word']}: {ent['entity_group']} ({ent['score']:.0%})")
    else:
        print("  ‚Üí No entities detected")
    print()

In [None]:
# Edge case 2: Multi-word entities that get split incorrectly
complex_entities = [
    "The United States of America signed the treaty.",
    "The University of California, Berkeley is renowned.",
    "Dr. Martin Luther King Jr. gave a famous speech.",
    "The New York Stock Exchange opened higher today.",
]

print("COMPLEX MULTI-WORD ENTITIES:")
print("="*50)
for text in complex_entities:
    entities = ner(text)
    print(f"'{text}'")
    for ent in entities:
        # Check if entity looks complete
        print(f"  ‚Üí {ent['word']:35} [{ent['entity_group']:4}] ({ent['score']:.0%})")
    print()

### Confidence Score Analysis

Like in Notebook 1, confidence scores tell us how certain the model is. Low confidence often indicates potential errors:

In [None]:
# Analyze confidence scores across a longer text
long_text = """
Mark Zuckerberg, CEO of Meta, announced new AI features at their headquarters in Menlo Park.
The company, formerly known as Facebook, is competing with Google, Microsoft, and OpenAI.
Zuckerberg mentioned that Meta's Chief AI Scientist, Yann LeCun, has been instrumental in their research.
"""

entities = ner(long_text)

print("Entity confidence analysis:")
print("="*60)

# Sort by confidence
for ent in sorted(entities, key=lambda x: x['score'], reverse=True):
    confidence_bar = '‚ñà' * int(ent['score'] * 20)
    confidence_indicator = '‚úì' if ent['score'] > 0.9 else ('?' if ent['score'] > 0.7 else '‚ö†')
    print(f"{confidence_indicator} {ent['word']:20} [{ent['entity_group']:4}] {ent['score']:5.1%} {confidence_bar}")

### Using Different NER Models

Different models have different strengths. Let's compare a few:

In [None]:
# Load a different model - DistilBERT-based NER (faster, slightly less accurate)
print("Loading DistilBERT NER model...")
ner_distil = pipeline("ner", 
                       model="elastic/distilbert-base-cased-finetuned-conll03-english",
                       grouped_entities=True)
print("Model loaded!\n")

In [None]:
# Compare models on the same text
comparison_text = "Steve Jobs founded Apple in Cupertino with Steve Wozniak."

print(f"Text: '{comparison_text}'\n")

print("Default Model (BERT-large):")
for ent in ner(comparison_text):
    print(f"  {ent['word']:20} ‚Üí {ent['entity_group']:4} ({ent['score']:.0%})")

print("\nDistilBERT Model:")
for ent in ner_distil(comparison_text):
    print(f"  {ent['word']:20} ‚Üí {ent['entity_group']:4} ({ent['score']:.0%})")

---

## Exercise 2: Handling Split Entities (Semi-guided)

**Difficulty**: Intermediate | **Time**: 15-20 minutes

**Your task**: Write a function that detects when entities might be incorrectly split and attempts to merge adjacent entities of the same type.

**Hints**:
1. Look at the `start` and `end` positions of entities
2. Adjacent entities have `end` of one close to `start` of the next
3. Consider only merging if they're the same entity type

In [None]:
# YOUR CODE HERE

def merge_adjacent_entities(entities, text, max_gap=2):
    """
    Merge adjacent entities of the same type.
    
    Args:
        entities: List of entity dicts from NER pipeline
        text: Original text
        max_gap: Maximum character gap to consider entities adjacent
    
    Returns:
        List of merged entities
    """
    if not entities:
        return entities
    
    # Sort entities by start position
    sorted_entities = sorted(entities, key=lambda x: x['start'])
    
    merged = []
    current = sorted_entities[0].copy()
    
    for next_ent in sorted_entities[1:]:
        # Check if entities should be merged
        gap = next_ent['start'] - current['end']
        same_type = next_ent['entity_group'] == current['entity_group']
        
        if same_type and gap <= max_gap:
            # Merge: extend current entity
            current['end'] = next_ent['end']
            current['word'] = text[current['start']:current['end']]
            # Average the scores
            current['score'] = (current['score'] + next_ent['score']) / 2
        else:
            # Don't merge: save current and start new
            merged.append(current)
            current = next_ent.copy()
    
    merged.append(current)
    return merged


# Test the function
test_text = "The New York Stock Exchange opened at 9:30 AM."
original_entities = ner(test_text)

print(f"Text: '{test_text}'\n")

print("Original entities:")
for ent in original_entities:
    print(f"  {ent['word']:25} [{ent['entity_group']}] (start: {ent['start']}, end: {ent['end']})")

print("\nMerged entities:")
merged_entities = merge_adjacent_entities(original_entities, test_text)
for ent in merged_entities:
    print(f"  {ent['word']:25} [{ent['entity_group']}] (start: {ent['start']}, end: {ent['end']})")

---

# Part 4: Advanced Topics

## Under the Hood: Token Classification

Let's see what the NER pipeline does internally, building on what we learned in Notebook 1:

In [None]:
# Load model and tokenizer separately
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Number of labels: {model.config.num_labels}")
print(f"Labels: {model.config.id2label}")

In [None]:
# Step-by-step NER
text = "Elon Musk founded SpaceX."

# STEP 1: Tokenization (same as Notebook 1!)
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
offsets = inputs['offset_mapping'][0].tolist()

print("STEP 1 - Tokenization:")
print(f"  Text: '{text}'")
print(f"  Tokens: {tokens}")

In [None]:
# STEP 2: Model inference
# Remove offset_mapping before passing to model
model_inputs = {k: v for k, v in inputs.items() if k != 'offset_mapping'}

with torch.no_grad():
    outputs = model(**model_inputs)

# Get predictions
predictions = torch.argmax(outputs.logits, dim=2)

print("STEP 2 - Model Inference:")
print(f"  Logits shape: {outputs.logits.shape}")
print(f"  (batch_size, sequence_length, num_labels)")

In [None]:
# STEP 3: Convert predictions to labels
print("STEP 3 - Token Labels:")
print(f"{'Token':<12} {'Label':<10} {'Offset'}")
print("-" * 35)

for token, pred_id, offset in zip(tokens, predictions[0], offsets):
    label = model.config.id2label[pred_id.item()]
    print(f"{token:<12} {label:<10} {offset}")

In [None]:
# STEP 4: Convert to entities with confidence scores
print("\nSTEP 4 - Convert to Entities with Confidence:")

# Get probabilities
probs = torch.softmax(outputs.logits, dim=2)

for token, pred_id, prob, offset in zip(tokens, predictions[0], probs[0], offsets):
    label = model.config.id2label[pred_id.item()]
    confidence = prob[pred_id].item()
    
    # Skip special tokens and O labels for clarity
    if label != 'O' and offset != (0, 0):
        print(f"{token:<12} ‚Üí {label:<10} ({confidence:.2%})")

### Performance Considerations

| Consideration | Recommendation |
|---------------|----------------|
| **Model size** | Use DistilBERT for speed (2x faster, slightly less accurate) |
| **Batch processing** | Process multiple texts at once |
| **Long texts** | Split into sentences or chunks (models have max token limits) |
| **Post-processing** | Use `grouped_entities=True` and validate outputs |

In [None]:
# Batch processing example
texts = [
    "Google announced new AI features.",
    "Microsoft CEO Satya Nadella spoke in Seattle.",
    "The European Union passed new regulations."
]

# Process all at once (more efficient)
batch_results = ner(texts)

for text, entities in zip(texts, batch_results):
    print(f"'{text}'")
    for ent in entities:
        print(f"  ‚Üí {ent['word']}: {ent['entity_group']}")
    print()

### Limitations of NER

1. **Domain dependency**: Models trained on news may struggle with medical or legal text
2. **New entities**: Models can't recognize entities that didn't exist during training
3. **Context sensitivity**: Same name can be different entity types
4. **Language support**: Most models are English-only; multilingual models exist but have lower accuracy

In [None]:
# Limitation example: Domain-specific entities
domain_texts = [
    # Medical (might struggle)
    "The patient was prescribed Lisinopril for hypertension.",
    
    # Legal (might struggle)
    "The defendant violated Section 230 of the Communications Decency Act.",
    
    # News (should work well - training domain)
    "President Biden met with Chancellor Scholz in Berlin.",
]

print("DOMAIN COMPARISON:")
print("="*50)
for text in domain_texts:
    entities = ner(text)
    print(f"'{text}'")
    if entities:
        for ent in entities:
            print(f"  {ent['word']}: {ent['entity_group']} ({ent['score']:.0%})")
    else:
        print("  No entities found")
    print()

---

## Exercise 3: Entity Frequency Counter (Independent)

**Difficulty**: Advanced | **Time**: 15-20 minutes

**Your task**: Build a class that tracks entity frequencies across multiple texts.

**Requirements**:
1. Process multiple texts and track all entities
2. Count how many times each unique entity appears
3. Group by entity type
4. Provide a summary report

In [None]:
# YOUR CODE HERE

class EntityFrequencyTracker:
    """
    Tracks entity frequencies across multiple texts.
    """
    
    def __init__(self):
        """Initialize the tracker."""
        # TODO: Initialize data structures
        self.ner = pipeline("ner", grouped_entities=True)
        self.entity_counts = defaultdict(Counter)  # {entity_type: Counter({entity: count})}
        self.total_texts = 0
    
    def process_text(self, text):
        """Process a single text and update counts."""
        # TODO: Extract entities and update counts
        entities = self.ner(text)
        self.total_texts += 1
        
        for ent in entities:
            entity_type = ent['entity_group']
            entity_text = ent['word'].strip()
            self.entity_counts[entity_type][entity_text] += 1
        
        return entities
    
    def process_batch(self, texts):
        """Process multiple texts."""
        for text in texts:
            self.process_text(text)
    
    def get_top_entities(self, entity_type=None, n=10):
        """
        Get the most common entities.
        
        Args:
            entity_type: Filter by type (PER, ORG, LOC, MISC) or None for all
            n: Number of top entities to return
        """
        # TODO: Return top entities
        if entity_type:
            return self.entity_counts[entity_type].most_common(n)
        else:
            # Combine all counts
            all_counts = Counter()
            for type_counts in self.entity_counts.values():
                all_counts.update(type_counts)
            return all_counts.most_common(n)
    
    def get_summary(self):
        """Return a summary report."""
        # TODO: Create summary
        summary = []
        summary.append(f"Entity Frequency Report")
        summary.append(f"=" * 40)
        summary.append(f"Texts processed: {self.total_texts}")
        summary.append(f"")
        
        for entity_type in ['PER', 'ORG', 'LOC', 'MISC']:
            if entity_type in self.entity_counts:
                total = sum(self.entity_counts[entity_type].values())
                unique = len(self.entity_counts[entity_type])
                summary.append(f"{entity_type}: {total} mentions ({unique} unique)")
                for entity, count in self.entity_counts[entity_type].most_common(3):
                    summary.append(f"  - {entity}: {count}")
                summary.append("")
        
        return '\n'.join(summary)


# Test the tracker
tracker = EntityFrequencyTracker()

# Sample texts about tech companies
sample_texts = [
    "Apple CEO Tim Cook spoke at the Apple Park headquarters in Cupertino.",
    "Google and Microsoft are competing in the AI space.",
    "Satya Nadella, CEO of Microsoft, announced new Azure features.",
    "Tim Cook visited the new Apple Store in New York City.",
    "Google's Sundar Pichai discussed AI ethics in San Francisco.",
    "Microsoft acquired Activision Blizzard for gaming expansion.",
]

tracker.process_batch(sample_texts)
print(tracker.get_summary())

---

# Part 5: Mini-Project

## Project: News Article Entity Analyzer

**Scenario**: You're building a news aggregation tool that automatically tags articles with key entities for better search and categorization.

**Your goal**: Build a `NewsEntityAnalyzer` class that:
1. Takes a news article text
2. Extracts all entities and groups them by type
3. Identifies the most mentioned entities (likely the main subjects)
4. Creates a brief "entity summary" for the article

In [None]:
# MINI-PROJECT: News Article Entity Analyzer
# ==========================================

class NewsEntityAnalyzer:
    """
    Analyzes news articles to extract and summarize key entities.
    """
    
    def __init__(self):
        """Initialize the analyzer with NER pipeline."""
        self.ner = pipeline("ner", grouped_entities=True)
    
    def analyze(self, article_text):
        """
        Analyze an article and return structured entity information.
        
        Args:
            article_text: The full article text
            
        Returns:
            dict with entity analysis results
        """
        # Extract entities
        entities = self.ner(article_text)
        
        # Group by type
        by_type = defaultdict(list)
        for ent in entities:
            by_type[ent['entity_group']].append({
                'text': ent['word'],
                'score': ent['score'],
                'start': ent['start'],
                'end': ent['end']
            })
        
        # Count entity frequencies
        entity_freq = Counter(ent['word'] for ent in entities)
        
        # Identify main subjects (entities mentioned more than once or with high confidence)
        main_subjects = []
        seen = set()
        for ent in sorted(entities, key=lambda x: (-entity_freq[x['word']], -x['score'])):
            if ent['word'] not in seen:
                main_subjects.append({
                    'entity': ent['word'],
                    'type': ent['entity_group'],
                    'mentions': entity_freq[ent['word']],
                    'confidence': ent['score']
                })
                seen.add(ent['word'])
        
        return {
            'total_entities': len(entities),
            'unique_entities': len(set(e['word'] for e in entities)),
            'by_type': dict(by_type),
            'main_subjects': main_subjects[:5],  # Top 5
            'entity_frequency': dict(entity_freq)
        }
    
    def get_summary(self, article_text):
        """
        Generate a human-readable entity summary for an article.
        """
        analysis = self.analyze(article_text)
        
        lines = []
        lines.append("üì∞ Article Entity Analysis")
        lines.append("=" * 50)
        lines.append(f"Total entity mentions: {analysis['total_entities']}")
        lines.append(f"Unique entities: {analysis['unique_entities']}")
        lines.append("")
        
        # Main subjects
        lines.append("üéØ Main Subjects:")
        for subj in analysis['main_subjects']:
            mention_text = "mention" if subj['mentions'] == 1 else "mentions"
            lines.append(f"  ‚Ä¢ {subj['entity']} ({subj['type']}) - {subj['mentions']} {mention_text}")
        lines.append("")
        
        # By type breakdown
        lines.append("üìä Entities by Type:")
        type_icons = {'PER': 'üë§', 'ORG': 'üè¢', 'LOC': 'üìç', 'MISC': 'üè∑Ô∏è'}
        for entity_type in ['PER', 'ORG', 'LOC', 'MISC']:
            if entity_type in analysis['by_type']:
                entities = analysis['by_type'][entity_type]
                unique_names = list(set(e['text'] for e in entities))
                icon = type_icons.get(entity_type, '‚Ä¢')
                lines.append(f"  {icon} {entity_type}: {', '.join(unique_names[:5])}")
                if len(unique_names) > 5:
                    lines.append(f"       ... and {len(unique_names) - 5} more")
        
        return '\n'.join(lines)
    
    def get_tags(self, article_text, max_tags=10):
        """
        Generate tags for the article based on entities.
        
        Returns:
            list of (tag, entity_type) tuples
        """
        analysis = self.analyze(article_text)
        
        tags = []
        for subj in analysis['main_subjects'][:max_tags]:
            tags.append((subj['entity'], subj['type']))
        
        return tags


# Create the analyzer
analyzer = NewsEntityAnalyzer()

In [None]:
# Test with a sample news article
sample_article = """
Tech Giants Face New EU Regulations

Brussels - The European Union announced sweeping new regulations targeting major 
technology companies on Tuesday. European Commission President Ursula von der Leyen 
unveiled the Digital Markets Act, which will impose strict rules on companies like 
Apple, Google, Amazon, and Meta.

The regulations, developed in consultation with Margrethe Vestager, the EU's competition 
chief, aim to create a more level playing field in the digital economy. "These companies 
have become gatekeepers," von der Leyen said at a press conference in Brussels.

Tim Cook, Apple's CEO, expressed concerns about the new rules during a visit to Paris 
last week. Meanwhile, Sundar Pichai of Google and Mark Zuckerberg of Meta have indicated 
they are reviewing the regulations with their legal teams.

The United States Trade Representative has called the regulations potentially 
discriminatory against American companies. However, EU officials maintain that the 
rules apply equally to all companies operating in Europe, including European firms 
like Spotify and SAP.
"""

print(analyzer.get_summary(sample_article))

In [None]:
# Generate tags for the article
tags = analyzer.get_tags(sample_article)

print("\nüè∑Ô∏è Suggested Article Tags:")
for tag, entity_type in tags:
    print(f"  #{tag.replace(' ', '')} ({entity_type})")

In [None]:
# Try with your own article
# Paste a news article here:
your_article = """
Paste your own news article here for analysis.
The analyzer will extract all entities and provide
a structured summary.
"""

# Uncomment to analyze:
# print(analyzer.get_summary(your_article))

### Extension Ideas

If you want to extend this project further:

1. **Entity linking**: Connect entities to Wikipedia or a knowledge base
2. **Relationship extraction**: Find connections between entities ("X works at Y")
3. **Sentiment per entity**: Determine if coverage of each entity is positive/negative
4. **Timeline extraction**: Build a timeline from DATE entities
5. **Cross-article tracking**: Track entities across multiple articles over time

---

# Part 6: Wrap-Up

## Key Takeaways

1. **Named Entity Recognition** extracts and categorizes real-world entities (people, organizations, locations) from text

2. **BIO tagging** (Begin-Inside-Outside) is the scheme used to mark entity boundaries and handle multi-word entities

3. **Same BERT architecture** from Notebook 1, but with a different output head for token classification

4. **Use `grouped_entities=True`** to get aggregated entities instead of raw BIO tags

5. **Confidence scores matter** - low confidence often indicates ambiguous or incorrect predictions

## Common Mistakes to Avoid

| Mistake | Why It's a Problem |
|---------|-------------------|
| Forgetting `grouped_entities=True` | You get raw B-/I- tags instead of complete entities |
| Not handling split entities | Multi-word entities may be incorrectly separated |
| Using news-trained models on other domains | Medical/legal text needs specialized models |
| Ignoring confidence scores | Low-confidence predictions are often wrong |

## What's Next?

In **Notebook 3: Question Answering**, you'll learn:
- How to find answers within a given context (extractive QA)
- How models predict start and end positions of answer spans
- This builds on NER - both tasks extract spans from text!

The concepts of token classification and span extraction will directly apply!

---

## Solutions

### Check Your Understanding (Quiz Answers)

1. **B) Named Entity Recognition**
2. **B) Inside (continuation) of an entity**
3. **C) ORG (Organization)** - Universities are organizations
4. **B) NER uses BERT with a different output head**

### Exercise 2: Merge Adjacent Entities (Sample Solution)

In [None]:
# Sample solution for Exercise 2 is provided in the exercise itself
# The merge_adjacent_entities function handles:
# 1. Sorting entities by position
# 2. Checking for small gaps between entities
# 3. Only merging same entity types
# 4. Averaging confidence scores

# Test it with a challenging case:
test_text = "The University of California at Berkeley is in the Bay Area."
entities = ner(test_text)

print(f"Text: '{test_text}'")
print("\nOriginal entities:")
for ent in entities:
    print(f"  '{ent['word']}' [{ent['entity_group']}]")

print("\nMerged entities:")
merged = merge_adjacent_entities(entities, test_text, max_gap=3)
for ent in merged:
    print(f"  '{ent['word']}' [{ent['entity_group']}]")

---

## Additional Resources

- [Hugging Face NER Pipeline Docs](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline)
- [CoNLL-2003 Dataset](https://huggingface.co/datasets/conll2003) - The dataset many NER models are trained on
- [BERT for NER Paper](https://arxiv.org/abs/1810.04805) - Section on token classification
- [OntoNotes NER](https://catalog.ldc.upenn.edu/LDC2013T19) - Dataset with 18 entity types