# 3. Question Answering

**Estimated Time**: ~2 hours

**Prerequisites**: Notebooks 1-2 (understanding of tokenization, pipelines, confidence scores, and span extraction concepts from NER)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Distinguish** between extractive and generative question answering approaches
2. **Use** the `question-answering` pipeline to find answers within context
3. **Interpret** start/end positions and confidence scores in QA output
4. **Handle** unanswerable questions and low-confidence predictions
5. **Build** a practical document Q&A system

## Setup

Run this cell first. If you completed Notebooks 1-2, you already have the core packages ready.

In [None]:
# Core imports
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete!")

---

# Part 1: Conceptual Foundation

## What is Question Answering?

**In plain English**: Question Answering (QA) is the task of finding the answer to a question within a given text passage.

**Technical definition**: QA models take a question and a context, then identify the span (start and end positions) in the context that answers the question.

### Visual Example

```
Context: "Albert Einstein was born in Ulm, Germany, in 1879. He developed 
          the theory of relativity and won the Nobel Prize in Physics in 1921."

Question: "Where was Einstein born?"

Answer:   "Ulm, Germany" (extracted from context at positions 27-39)
           ↑___________↑
           start       end
```

### Two Types of Question Answering

| Type | How It Works | Output | Example Models |
|------|--------------|--------|----------------|
| **Extractive QA** | Finds the answer span within the given context | Exact substring from context | BERT, RoBERTa, DistilBERT |
| **Generative QA** | Generates an answer (may use context or not) | Newly generated text | GPT, T5, BART |

This notebook focuses on **Extractive QA** - the model extracts answers directly from the context.

```
EXTRACTIVE QA (This Notebook):
┌────────────────────────────────────────┐
│ Context: "The Eiffel Tower is 330m..." │
│ Question: "How tall is Eiffel Tower?"  │
└────────────────┬───────────────────────┘
                 │
                 ▼
          ┌────────────┐
          │  "330m"    │  ← Extracted directly
          └────────────┘

GENERATIVE QA (Different approach):
┌────────────────────────────────────────┐
│ Question: "How tall is Eiffel Tower?"  │
└────────────────┬───────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────┐
│  "The Eiffel Tower is approximately 330     │  ← Generated new text
│   meters (1,083 feet) tall."                │
└─────────────────────────────────────────────┘
```

### How Extractive QA Works: Start/End Prediction

Remember from Notebook 2 how NER finds entity spans? QA works similarly - but instead of finding all entities, it finds the **single span** that answers the question.

```
Context tokens:  [CLS] The Eiffel Tower is 330 meters tall . [SEP]
                   0    1     2      3    4   5    6     7  8   9

Question: "How tall is the Eiffel Tower?"

Model predicts:
  Start position: 5 ("330")
  End position:   7 ("tall")

Answer: tokens 5-7 = "330 meters tall"
```

The model outputs **two probability distributions**:
- One for the start position of the answer
- One for the end position of the answer

### Connection to Notebook 2: Span Extraction

Both NER and QA are **span extraction** tasks:

```
NER (Notebook 2):                    QA (This Notebook):
┌──────────────────┐                 ┌──────────────────┐
│   BERT Encoder   │                 │   BERT Encoder   │
│  (same weights)  │                 │  (same weights)  │
└────────┬─────────┘                 └────────┬─────────┘
         │                                    │
         ▼                                    ▼
┌──────────────────┐                 ┌──────────────────┐
│  Token Class Head│                 │  Start/End Head  │
│  (B-PER, I-PER..)│                 │  (position pred) │
└──────────────────┘                 └──────────────────┘
         │                                    │
         ▼                                    ▼
  Multiple spans                       Single span
  (many entities)                      (one answer)
```

The key difference: NER finds **many** spans, QA finds **one** answer span.

### Real-World Applications

QA powers many practical applications:

- **Customer Support**: Automatically answer FAQs from documentation
- **Search Engines**: Extract direct answers from web pages ("featured snippets")
- **Legal/Medical Research**: Find specific information in long documents
- **Virtual Assistants**: Answer factual questions from knowledge bases
- **Education**: Auto-grade reading comprehension questions
- **Enterprise Search**: Find answers in company wikis and documents

### Key Terminology

| Term | Definition |
|------|------------|
| **Context** | The passage of text containing the answer |
| **Question** | What we're asking about the context |
| **Answer Span** | The substring from context that answers the question |
| **Start/End Position** | Character indices where the answer begins and ends |
| **Extractive QA** | Finding answers by extracting spans from context |
| **Generative QA** | Creating answers by generating new text |
| **Unanswerable** | Question cannot be answered from the given context |

### Check Your Understanding

Before moving on, try to answer these questions (answers at the end):

1. What type of QA extracts answers directly from the given context?
   - A) Generative QA
   - B) Extractive QA
   - C) Creative QA

2. What does a QA model predict to find an answer?
   - A) The exact answer text
   - B) Start and end positions within the context
   - C) Whether the context is relevant

3. How does QA relate to NER from Notebook 2?
   - A) They're completely unrelated
   - B) Both are span extraction tasks using similar architectures
   - C) QA always uses NER as a preprocessing step

4. What happens when a question cannot be answered from the context?
   - A) The model always makes up an answer
   - B) The model returns an empty string
   - C) The model may return low confidence or empty/incorrect span

---

# Part 2: Basic Implementation

## Your First QA Pipeline

Let's create a QA pipeline and find answers within context:

In [None]:
# Create a QA pipeline
# The default model is distilbert-base-cased-distilled-squad
qa = pipeline("question-answering")

# Define context and question
context = """
Albert Einstein was born in Ulm, Germany, on March 14, 1879. He is widely regarded 
as one of the greatest physicists of all time. Einstein developed the theory of 
relativity, one of the two pillars of modern physics. He received the Nobel Prize 
in Physics in 1921 for his explanation of the photoelectric effect.
"""

question = "Where was Einstein born?"

# Get the answer
result = qa(question=question, context=context)

print(f"Question: {question}")
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.2%}")

### Understanding the Output

The QA pipeline returns a dictionary with:

- `answer`: The extracted text that answers the question
- `score`: Confidence score (0 to 1)
- `start`: Character position where answer begins
- `end`: Character position where answer ends

Let's examine the result in detail:

In [None]:
# Examine the result in detail
print("Detailed QA Result:")
print("="*40)
for key, value in result.items():
    if key == 'score':
        print(f"  {key:8}: {value:.4f} ({value:.2%})")
    else:
        print(f"  {key:8}: {value}")

# Verify the span is correct
print(f"\nVerification: context[{result['start']}:{result['end']}]")
print(f"  = '{context[result['start']:result['end']]}'")

### Asking Multiple Questions

Let's ask several questions about the same context:

In [None]:
# Multiple questions about Einstein
questions = [
    "Where was Einstein born?",
    "When was Einstein born?",
    "What did Einstein develop?",
    "When did Einstein receive the Nobel Prize?",
    "What was the Nobel Prize for?",
]

print(f"Context: {context.strip()[:100]}...\n")
print("Questions and Answers:")
print("="*60)

for q in questions:
    result = qa(question=q, context=context)
    confidence = "High" if result['score'] > 0.8 else "Medium" if result['score'] > 0.5 else "Low"
    print(f"Q: {q}")
    print(f"A: {result['answer']} ({result['score']:.0%} - {confidence})")
    print()

### Confidence Score Interpretation

Like in previous notebooks, confidence scores tell us how certain the model is:

| Score Range | Interpretation | Action |
|-------------|----------------|--------|
| **0.8 - 1.0** | High confidence | Trust the answer |
| **0.5 - 0.8** | Medium confidence | Verify if critical |
| **0.2 - 0.5** | Low confidence | Answer may be wrong |
| **< 0.2** | Very low | Question likely unanswerable |

In [None]:
# Demonstrate confidence levels with different question types
test_questions = [
    # Clear answer in context
    ("What year was Einstein born?", "Clearly answerable"),
    # Answer requires inference
    ("How old was Einstein when he got the Nobel Prize?", "Requires calculation"),
    # Not in context
    ("What was Einstein's favorite food?", "Not in context"),
]

print("Confidence Analysis:")
print("="*70)

for q, note in test_questions:
    result = qa(question=q, context=context)
    bar = "█" * int(result['score'] * 20)
    print(f"Q: {q}")
    print(f"   Type: {note}")
    print(f"   A: '{result['answer']}' | {result['score']:.0%} {bar}")
    print()

---

## Exercise 1: Wikipedia QA (Guided)

**Difficulty**: Basic | **Time**: 10-15 minutes

**Your task**: Ask questions about a Wikipedia-style passage and analyze the confidence scores.

### Step 1: Process this passage about the Moon

In [None]:
# Wikipedia-style passage about the Moon
moon_context = """
The Moon is Earth's only natural satellite. It orbits at an average distance of 
384,400 kilometers from Earth. The Moon's diameter is 3,474 kilometers, making it 
the fifth-largest satellite in the Solar System. The Moon was first visited by the 
Soviet spacecraft Luna 2 in 1959. The first crewed lunar landing was Apollo 11 in 
1969, when Neil Armstrong became the first person to walk on the Moon. The Moon's 
gravitational influence produces Earth's tides and slightly lengthens Earth's day.
"""

# Questions to ask
moon_questions = [
    "How far is the Moon from Earth?",
    "What is the Moon's diameter?",
    "When was the Moon first visited by spacecraft?",
    "Who was the first person to walk on the Moon?",
    "What does the Moon's gravity affect?",
]

print("Moon Q&A Session:")
print("="*60)

results = []
for q in moon_questions:
    result = qa(question=q, context=moon_context)
    results.append(result)
    print(f"Q: {q}")
    print(f"A: {result['answer']} ({result['score']:.0%})")
    print()

### Step 2: Analyze confidence distribution

In [None]:
# Analyze confidence distribution
print("Confidence Distribution:")
print("-"*40)

for q, r in zip(moon_questions, results):
    bar = "█" * int(r['score'] * 30)
    print(f"{r['score']:.0%} {bar}")

avg_confidence = sum(r['score'] for r in results) / len(results)
print(f"\nAverage confidence: {avg_confidence:.1%}")

### Step 3: Try your own questions

Add 3 of your own questions about the Moon passage:

In [None]:
# YOUR CODE HERE
# Add your own questions about the Moon
my_questions = [
    # Replace with your own questions
    "Your question 1 here",
    "Your question 2 here",
    "Your question 3 here",
]

for q in my_questions:
    result = qa(question=q, context=moon_context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} ({result['score']:.0%})")
    print()

---

# Part 3: Intermediate Exploration

## Handling Unanswerable Questions

A major challenge in QA: what happens when the answer isn't in the context?

Some models are trained on SQuAD 2.0, which includes unanswerable questions.

In [None]:
# Test with unanswerable questions
context = """
Python is a high-level programming language created by Guido van Rossum and 
first released in 1991. It emphasizes code readability and allows programmers 
to express concepts in fewer lines of code than languages like C++ or Java.
"""

questions = [
    # Answerable
    ("Who created Python?", True),
    ("When was Python released?", True),
    # Not answerable from context
    ("What company does Guido work for?", False),
    ("What is the latest Python version?", False),
    ("How many Python developers exist?", False),
]

print("Testing Answerable vs Unanswerable Questions:")
print("="*60)

for q, is_answerable in questions:
    result = qa(question=q, context=context)
    status = "✓ Answerable" if is_answerable else "✗ Not in context"
    warning = "" if result['score'] > 0.3 else " ⚠️ LOW CONFIDENCE"
    
    print(f"[{status}]")
    print(f"  Q: {q}")
    print(f"  A: '{result['answer']}' ({result['score']:.0%}){warning}")
    print()

### Using a Model Trained on SQuAD 2.0

SQuAD 2.0 models are trained to recognize unanswerable questions:

In [None]:
# Load a SQuAD 2.0 model (better at detecting unanswerable questions)
print("Loading SQuAD 2.0 model (better with unanswerable questions)...")
qa_squad2 = pipeline("question-answering", model="deepset/roberta-base-squad2")
print("Model loaded!\n")

# Test the same questions
print("Testing with SQuAD 2.0 Model:")
print("="*60)

for q, is_answerable in questions:
    result = qa_squad2(question=q, context=context)
    status = "✓ Answerable" if is_answerable else "✗ Not in context"
    
    print(f"[{status}]")
    print(f"  Q: {q}")
    print(f"  A: '{result['answer']}' ({result['score']:.0%})")
    print()

### Strategy for Handling Low Confidence

A practical approach: set a confidence threshold

In [None]:
def answer_with_threshold(qa_pipeline, question, context, threshold=0.3):
    """
    Answer a question with a confidence threshold.
    
    Returns:
        dict with 'answer', 'confidence', and 'is_confident'
    """
    result = qa_pipeline(question=question, context=context)
    
    is_confident = result['score'] >= threshold
    
    return {
        'answer': result['answer'] if is_confident else "I cannot confidently answer this question.",
        'raw_answer': result['answer'],
        'confidence': result['score'],
        'is_confident': is_confident,
    }

# Test the threshold approach
test_questions = [
    "Who created Python?",
    "What company does Guido work for?",
]

print("Using Confidence Threshold (0.3):")
print("="*50)

for q in test_questions:
    result = answer_with_threshold(qa, q, context)
    status = "✓" if result['is_confident'] else "✗"
    print(f"{status} Q: {q}")
    print(f"  Answer: {result['answer']}")
    print(f"  (Raw: '{result['raw_answer']}' at {result['confidence']:.0%})")
    print()

### Comparing QA Models

Different models have different strengths. Let's compare:

In [None]:
# We already have two models loaded:
# - qa: DistilBERT trained on SQuAD 1.1 (default, faster)
# - qa_squad2: RoBERTa trained on SQuAD 2.0 (better with unanswerable)

test_context = """
The Great Wall of China is a series of fortifications built along the northern 
borders of ancient Chinese states. Construction began in the 7th century BC and 
continued for centuries. The most well-known sections were built during the Ming 
Dynasty (1368-1644). The wall stretches over 21,000 kilometers.
"""

test_questions = [
    "How long is the Great Wall?",
    "When did construction begin?",
    "How many workers died building it?",  # Not in context
]

print("Model Comparison:")
print("="*70)

for q in test_questions:
    print(f"\nQ: {q}")
    
    result1 = qa(question=q, context=test_context)
    result2 = qa_squad2(question=q, context=test_context)
    
    print(f"  DistilBERT: '{result1['answer']}' ({result1['score']:.0%})")
    print(f"  RoBERTa:    '{result2['answer']}' ({result2['score']:.0%})")

---

## Exercise 2: Unanswerable Question Detection (Semi-guided)

**Difficulty**: Intermediate | **Time**: 15-20 minutes

**Your task**: Write a function that classifies questions as answerable or unanswerable based on confidence scores and answer characteristics.

**Hints**:
1. Low confidence often indicates unanswerable questions
2. Very short answers might be suspicious
3. You might want to use multiple thresholds

In [None]:
# YOUR CODE HERE

def classify_answerability(qa_pipeline, question, context, 
                          confidence_threshold=0.3,
                          min_answer_length=1):
    """
    Classify whether a question is answerable from the given context.
    
    Args:
        qa_pipeline: The QA pipeline to use
        question: The question to ask
        context: The context to search in
        confidence_threshold: Minimum confidence to consider answerable
        min_answer_length: Minimum answer length (characters)
    
    Returns:
        dict with classification result and reasoning
    """
    result = qa_pipeline(question=question, context=context)
    
    # Gather evidence
    confidence = result['score']
    answer_length = len(result['answer'].strip())
    
    # Make decision with reasoning
    reasons = []
    
    # Check confidence
    if confidence < confidence_threshold:
        reasons.append(f"Low confidence ({confidence:.0%} < {confidence_threshold:.0%})")
    
    # Check answer length
    if answer_length < min_answer_length:
        reasons.append(f"Answer too short ({answer_length} chars)")
    
    # Determine answerability
    is_answerable = len(reasons) == 0
    
    return {
        'question': question,
        'answer': result['answer'],
        'confidence': confidence,
        'is_answerable': is_answerable,
        'reasons': reasons if reasons else ["Confidence above threshold"],
    }


# Test the classifier
test_context = """
Amazon Web Services (AWS) was launched in 2006, offering cloud computing services.
It provides services in computing, storage, and databases. AWS operates from data 
centers around the world. Jeff Bezos founded Amazon in 1994 in Seattle.
"""

test_questions = [
    "When was AWS launched?",            # Answerable
    "Who founded Amazon?",                # Answerable  
    "How much revenue does AWS generate?", # Not in context
    "What is the AWS CEO's name?",        # Not in context
    "Where is Amazon headquartered?",     # Partially (Seattle mentioned)
]

print("Answerability Classification:")
print("="*60)

for q in test_questions:
    result = classify_answerability(qa_squad2, q, test_context)
    status = "✓ ANSWERABLE" if result['is_answerable'] else "✗ UNANSWERABLE"
    
    print(f"\n{status}")
    print(f"  Q: {result['question']}")
    print(f"  A: '{result['answer']}' ({result['confidence']:.0%})")
    print(f"  Reasoning: {', '.join(result['reasons'])}")

---

# Part 4: Advanced Topics

## Under the Hood: Start and End Logits

Let's see what the QA model does internally:

In [None]:
# Load model and tokenizer separately
model_name = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Model type: {type(model).__name__}")

In [None]:
# Step-by-step QA
context = "Paris is the capital of France. It is known for the Eiffel Tower."
question = "What is the capital of France?"

# STEP 1: Tokenization
# QA models take both question and context together
inputs = tokenizer(
    question, 
    context, 
    return_tensors="pt",
    return_offsets_mapping=True,
    truncation=True,
    max_length=512,
)

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

print("STEP 1 - Tokenization:")
print(f"  Question: '{question}'")
print(f"  Context: '{context}'")
print(f"\n  Tokens ({len(tokens)} total):")
for i, token in enumerate(tokens):
    print(f"    {i:2}: {token}")

In [None]:
# STEP 2: Model inference
# Remove offset_mapping before passing to model
model_inputs = {k: v for k, v in inputs.items() if k != 'offset_mapping'}

with torch.no_grad():
    outputs = model(**model_inputs)

print("STEP 2 - Model Inference:")
print(f"  Start logits shape: {outputs.start_logits.shape}")
print(f"  End logits shape: {outputs.end_logits.shape}")
print("  (Each has one score per token position)")

In [None]:
# STEP 3: Convert logits to probabilities and find best positions
start_probs = torch.softmax(outputs.start_logits, dim=1)
end_probs = torch.softmax(outputs.end_logits, dim=1)

# Get best positions
start_idx = torch.argmax(start_probs).item()
end_idx = torch.argmax(end_probs).item()

print("STEP 3 - Position Prediction:")
print(f"  Best start position: {start_idx} ('{tokens[start_idx]}')")
print(f"  Best end position: {end_idx} ('{tokens[end_idx]}')")
print(f"\n  Top 3 start positions:")
for idx in torch.topk(start_probs[0], 3).indices:
    print(f"    Position {idx.item():2}: '{tokens[idx.item()]}' ({start_probs[0][idx]:.2%})")

In [None]:
# STEP 4: Extract the answer
answer_tokens = tokens[start_idx:end_idx+1]
answer = tokenizer.convert_tokens_to_string(answer_tokens)

# Calculate confidence (product of start and end probabilities)
confidence = start_probs[0][start_idx] * end_probs[0][end_idx]

print("STEP 4 - Answer Extraction:")
print(f"  Answer tokens: {answer_tokens}")
print(f"  Answer text: '{answer}'")
print(f"  Start prob: {start_probs[0][start_idx]:.2%}")
print(f"  End prob: {end_probs[0][end_idx]:.2%}")
print(f"  Combined confidence: {confidence:.2%}")

In [None]:
# Visualize start and end probabilities
print("Position Probability Visualization:")
print("="*60)
print(f"{'Pos':>3} {'Token':>15} {'Start':>10} {'End':>10}")
print("-"*60)

for i, token in enumerate(tokens):
    start_bar = "█" * int(start_probs[0][i].item() * 30)
    end_bar = "█" * int(end_probs[0][i].item() * 30)
    
    # Highlight the answer positions
    marker = ""
    if i == start_idx:
        marker = " ← START"
    elif i == end_idx:
        marker = " ← END"
    
    if start_probs[0][i] > 0.05 or end_probs[0][i] > 0.05:
        print(f"{i:3} {token:>15} {start_probs[0][i]:>8.1%} {end_probs[0][i]:>8.1%}{marker}")

### Handling Long Contexts

QA models have maximum token limits (usually 512). For longer documents, you need to:

1. **Split into chunks**: Divide the document into overlapping chunks
2. **Process each chunk**: Run QA on each chunk
3. **Merge results**: Take the answer with highest confidence

In [None]:
def qa_on_long_context(qa_pipeline, question, context, chunk_size=400, overlap=100):
    """
    Answer questions on long contexts by chunking.
    
    Args:
        qa_pipeline: The QA pipeline to use
        question: The question to ask
        context: The (possibly long) context
        chunk_size: Maximum characters per chunk
        overlap: Overlap between chunks
    
    Returns:
        Best answer across all chunks
    """
    # Simple sentence-aware chunking
    sentences = context.replace('\n', ' ').split('. ')
    
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        if len(current_chunk) + len(sentence) < chunk_size:
            current_chunk += sentence + ". "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    # Process each chunk
    best_result = None
    best_score = -1
    
    for i, chunk in enumerate(chunks):
        result = qa_pipeline(question=question, context=chunk)
        result['chunk_index'] = i
        result['chunk'] = chunk[:50] + "..."
        
        if result['score'] > best_score:
            best_score = result['score']
            best_result = result
    
    return best_result, chunks

# Test with a longer context
long_context = """
The history of artificial intelligence began in antiquity, with myths and stories 
of artificial beings endowed with intelligence. The modern field of AI research 
was founded at a workshop at Dartmouth College in 1956. The term "artificial 
intelligence" was coined by John McCarthy for this workshop.

Early AI research focused on symbolic methods and problem solving. In the 1960s 
and 1970s, researchers developed expert systems that could reason about specialized 
domains. However, progress was slower than expected, leading to periods of reduced 
funding known as "AI winters."

Machine learning emerged as a major approach in the 1990s. Deep learning, using 
neural networks with many layers, achieved breakthroughs starting in 2012 with 
AlexNet's victory in the ImageNet competition. This was developed by Alex Krizhevsky, 
Ilya Sutskever, and Geoffrey Hinton at the University of Toronto.

Today, AI is used in many applications including virtual assistants, autonomous 
vehicles, medical diagnosis, and language translation. Major AI research labs 
include OpenAI, Google DeepMind, and Anthropic. The development of large language 
models like GPT and Claude has enabled new capabilities in natural language processing.
"""

test_questions = [
    "When was the term artificial intelligence coined?",
    "Who developed AlexNet?",
    "What are AI winters?",
]

print("Long Context Q&A:")
print("="*60)

for q in test_questions:
    result, chunks = qa_on_long_context(qa, q, long_context)
    print(f"Q: {q}")
    print(f"A: '{result['answer']}' ({result['score']:.0%})")
    print(f"   Found in chunk {result['chunk_index']+1}/{len(chunks)}")
    print()

### Performance Considerations

| Consideration | Recommendation |
|---------------|----------------|
| **Model size** | DistilBERT for speed, RoBERTa-large for accuracy |
| **Long documents** | Chunk with overlap, merge by confidence |
| **Unanswerable questions** | Use SQuAD 2.0 models, set confidence threshold |
| **Multiple questions** | Batch process when possible |
| **Domain-specific** | Fine-tune on domain data if available |

### Limitations of Extractive QA

1. **Answer must be in context**: Cannot synthesize or infer answers
2. **Single span limitation**: Cannot combine information from multiple places
3. **No reasoning**: Cannot perform math, logic, or multi-hop reasoning
4. **Context length limit**: 512 tokens for most models
5. **Exact match requirement**: Paraphrased answers won't be found

In [None]:
# Demonstrate limitations
limitation_examples = [
    # Requires synthesis from multiple sentences
    {
        'context': "John was born in 1990. Mary was born in 1995.",
        'question': "Who is older, John or Mary?",
        'issue': "Requires comparison across sentences"
    },
    # Requires calculation
    {
        'context': "The company earned $10 million in Q1 and $15 million in Q2.",
        'question': "What was the total earnings for both quarters?",
        'issue': "Requires mathematical calculation"
    },
    # Multi-hop reasoning
    {
        'context': "Alice manages Bob. Bob manages Carol.",
        'question': "Who is Carol's manager's manager?",
        'issue': "Requires multi-hop reasoning"
    },
]

print("QA Limitations:")
print("="*60)

for ex in limitation_examples:
    result = qa(question=ex['question'], context=ex['context'])
    print(f"\nIssue: {ex['issue']}")
    print(f"Context: '{ex['context']}'")
    print(f"Q: {ex['question']}")
    print(f"A: '{result['answer']}' ({result['score']:.0%})")

---

## Exercise 3: Multi-Question Ranking (Independent)

**Difficulty**: Advanced | **Time**: 15-20 minutes

**Your task**: Build a system that takes multiple questions and ranks them by how well they can be answered from a given context.

**Requirements**:
1. Accept a context and list of questions
2. Rank questions by answerability (confidence)
3. Categorize as "Easily Answered", "Partially Answered", "Cannot Answer"
4. Provide a summary report

In [None]:
# YOUR CODE HERE

class QuestionRanker:
    """
    Ranks questions by how well they can be answered from a given context.
    """
    
    def __init__(self, qa_pipeline=None):
        """Initialize with a QA pipeline."""
        self.qa = qa_pipeline or pipeline("question-answering")
        
        # Thresholds for categorization
        self.high_threshold = 0.7
        self.low_threshold = 0.3
    
    def rank_questions(self, context, questions):
        """
        Rank questions by answerability.
        
        Args:
            context: The text to answer from
            questions: List of questions
            
        Returns:
            List of dicts with question, answer, score, category
        """
        results = []
        
        for q in questions:
            answer = self.qa(question=q, context=context)
            
            # Categorize
            if answer['score'] >= self.high_threshold:
                category = "Easily Answered"
            elif answer['score'] >= self.low_threshold:
                category = "Partially Answered"
            else:
                category = "Cannot Answer"
            
            results.append({
                'question': q,
                'answer': answer['answer'],
                'score': answer['score'],
                'category': category,
            })
        
        # Sort by score (highest first)
        results.sort(key=lambda x: x['score'], reverse=True)
        
        return results
    
    def get_summary(self, context, questions):
        """
        Generate a summary report of question answerability.
        """
        results = self.rank_questions(context, questions)
        
        lines = []
        lines.append("Question Answerability Report")
        lines.append("=" * 50)
        lines.append(f"Total questions: {len(questions)}")
        lines.append("")
        
        # Group by category
        categories = {"Easily Answered": [], "Partially Answered": [], "Cannot Answer": []}
        for r in results:
            categories[r['category']].append(r)
        
        # Print summary
        icons = {"Easily Answered": "✓", "Partially Answered": "~", "Cannot Answer": "✗"}
        
        for category in ["Easily Answered", "Partially Answered", "Cannot Answer"]:
            count = len(categories[category])
            icon = icons[category]
            lines.append(f"{icon} {category}: {count} questions")
            
            for r in categories[category]:
                lines.append(f"    [{r['score']:.0%}] {r['question']}")
                lines.append(f"          → {r['answer']}")
            lines.append("")
        
        return '\n'.join(lines)


# Test the ranker
ranker = QuestionRanker(qa_squad2)

context = """
SpaceX is an American aerospace manufacturer founded in 2002 by Elon Musk. 
The company's headquarters is located in Hawthorne, California. SpaceX has 
developed the Falcon 9 rocket and the Dragon spacecraft. The Falcon 9 has 
become the most frequently launched rocket in the world. SpaceX's goal is 
to reduce space transportation costs and enable Mars colonization.
"""

questions = [
    "Who founded SpaceX?",
    "When was SpaceX founded?",
    "Where is SpaceX headquarters?",
    "What rockets has SpaceX developed?",
    "What is SpaceX's revenue?",
    "How many employees does SpaceX have?",
    "What is SpaceX's main goal?",
    "Who is the current CEO of SpaceX?",
]

print(ranker.get_summary(context, questions))

---

# Part 5: Mini-Project

## Project: Document Q&A System

**Scenario**: You're building a customer support tool that answers questions about product documentation.

**Your goal**: Build a `DocumentQA` class that:
1. Takes a document (like a product manual) as input
2. Answers user questions from the document
3. Handles cases where the answer isn't in the document
4. Provides confidence levels and source snippets

In [None]:
# MINI-PROJECT: Document Q&A System
# ==================================

class DocumentQA:
    """
    A document-based question answering system.
    """
    
    def __init__(self, use_squad2=True):
        """
        Initialize the Document QA system.
        
        Args:
            use_squad2: Whether to use SQuAD 2.0 model (better for unanswerable questions)
        """
        if use_squad2:
            self.qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
        else:
            self.qa = pipeline("question-answering")
        
        self.document = None
        self.sections = []
        self.confidence_threshold = 0.3
    
    def load_document(self, document_text, section_separator="\n\n"):
        """
        Load a document for Q&A.
        
        Args:
            document_text: The full document text
            section_separator: How to split into sections
        """
        self.document = document_text
        self.sections = [s.strip() for s in document_text.split(section_separator) if s.strip()]
        return len(self.sections)
    
    def ask(self, question):
        """
        Ask a question about the loaded document.
        
        Returns:
            dict with answer, confidence, source, and status
        """
        if not self.document:
            return {'error': 'No document loaded'}
        
        # Try each section, keep best answer
        best_result = None
        best_score = -1
        best_section_idx = -1
        
        for i, section in enumerate(self.sections):
            if len(section) < 10:  # Skip very short sections
                continue
                
            result = self.qa(question=question, context=section)
            
            if result['score'] > best_score:
                best_score = result['score']
                best_result = result
                best_section_idx = i
        
        # Determine status
        if best_score >= 0.7:
            status = "confident"
            status_message = "I found a clear answer."
        elif best_score >= self.confidence_threshold:
            status = "uncertain"
            status_message = "I found a possible answer, but I'm not fully confident."
        else:
            status = "not_found"
            status_message = "I couldn't find a confident answer in the document."
        
        # Get context around the answer
        section_text = self.sections[best_section_idx] if best_section_idx >= 0 else ""
        source_snippet = section_text[:200] + "..." if len(section_text) > 200 else section_text
        
        return {
            'question': question,
            'answer': best_result['answer'] if status != 'not_found' else None,
            'confidence': best_score,
            'status': status,
            'status_message': status_message,
            'source_section': best_section_idx + 1,
            'source_snippet': source_snippet,
        }
    
    def ask_multiple(self, questions):
        """
        Ask multiple questions about the document.
        """
        return [self.ask(q) for q in questions]
    
    def format_response(self, result):
        """
        Format the response for display.
        """
        lines = []
        
        # Status indicator
        status_icons = {'confident': '✓', 'uncertain': '?', 'not_found': '✗'}
        icon = status_icons.get(result['status'], '•')
        
        lines.append(f"Question: {result['question']}")
        lines.append(f"{icon} Status: {result['status_message']}")
        
        if result['answer']:
            lines.append(f"Answer: {result['answer']}")
            lines.append(f"Confidence: {result['confidence']:.0%}")
            lines.append(f"Source: Section {result['source_section']}")
        
        return '\n'.join(lines)


# Create the Document QA system
doc_qa = DocumentQA(use_squad2=True)

In [None]:
# Load a sample product manual
product_manual = """
SmartWidget Pro User Manual - Version 2.0

GETTING STARTED
Thank you for purchasing the SmartWidget Pro. This device features a 5-inch 
touchscreen display, 8GB of storage, and WiFi connectivity. The battery lasts 
up to 12 hours on a single charge. To turn on the device, press and hold the 
power button for 3 seconds.

INITIAL SETUP
When you first turn on your SmartWidget Pro, you will be prompted to select 
your language and connect to a WiFi network. The setup wizard will guide you 
through creating an account and personalizing your settings. Setup typically 
takes about 5 minutes.

CHARGING
Use only the included USB-C charger to charge your SmartWidget Pro. A full 
charge takes approximately 2 hours. The LED indicator turns green when fully 
charged and amber when charging. Never use third-party chargers as they may 
damage the battery.

TROUBLESHOOTING
If your device freezes, hold the power button for 10 seconds to force restart. 
If the screen is unresponsive, ensure the battery is charged. For WiFi issues, 
try toggling WiFi off and on in the Settings menu. For warranty claims, contact 
support@smartwidget.example.com within 1 year of purchase.

WARRANTY INFORMATION
Your SmartWidget Pro comes with a 1-year limited warranty covering manufacturing 
defects. Water damage and physical damage are not covered. To register your 
warranty, visit www.smartwidget.example.com/register with your serial number.
"""

num_sections = doc_qa.load_document(product_manual)
print(f"Document loaded with {num_sections} sections.")

In [None]:
# Test with customer questions
customer_questions = [
    "How long does the battery last?",
    "How do I turn on the device?",
    "What should I do if the screen freezes?",
    "How long is the warranty?",
    "What charger should I use?",
]

print("Customer Support Q&A")
print("="*60)

for q in customer_questions:
    result = doc_qa.ask(q)
    print(doc_qa.format_response(result))
    print("-"*60)

In [None]:
# Test with questions NOT in the manual
unanswerable_questions = [
    "What is the price of the SmartWidget Pro?",
    "Can I use this device underwater?",
    "Does it support Bluetooth?",
]

print("Testing Unanswerable Questions")
print("="*60)

for q in unanswerable_questions:
    result = doc_qa.ask(q)
    print(doc_qa.format_response(result))
    print("-"*60)

In [None]:
# Interactive mode - try your own questions
# Uncomment to use:

# your_question = "Your question here"
# result = doc_qa.ask(your_question)
# print(doc_qa.format_response(result))

### Extension Ideas

If you want to extend this project further:

1. **Semantic search**: Use embeddings to find the most relevant section first
2. **Follow-up questions**: Track conversation context for follow-up questions
3. **Citation generation**: Return the exact quote from the document
4. **Multi-document QA**: Search across multiple documents
5. **Answer reformatting**: Use an LLM to make answers more conversational

---

# Part 6: Wrap-Up

## Key Takeaways

1. **Extractive QA** finds answers by predicting start and end positions within a context - the answer is always a substring of the input

2. **Start/End logits** are the model's predictions for where the answer begins and ends in the tokenized text

3. **Confidence scores** indicate reliability - low scores often mean the question is unanswerable from the context

4. **SQuAD 2.0 models** are specifically trained to handle unanswerable questions

5. **Long documents** require chunking with overlap and merging results by confidence

## Common Mistakes to Avoid

| Mistake | Why It's a Problem |
|---------|-------------------|
| Ignoring confidence scores | The model will always return *something*, even for unanswerable questions |
| Using SQuAD 1.1 models for production | They don't handle unanswerable questions well |
| Not handling long contexts | Answers get truncated or missed |
| Expecting reasoning | Extractive QA can't calculate, compare, or infer |

## What's Next?

In **Notebook 4: Text Summarization**, you'll learn:
- How to compress long text while preserving key information
- The difference between extractive and abstractive summarization
- Encoder-decoder architectures for text generation

This builds on QA - both tasks require understanding context, but summarization generates new text instead of extracting spans!

---

## Solutions

### Check Your Understanding (Quiz Answers)

1. **B) Extractive QA** - Extracts answers directly from the context
2. **B) Start and end positions within the context** - The model predicts where the answer begins and ends
3. **B) Both are span extraction tasks using similar architectures** - Same BERT encoder, different output heads
4. **C) The model may return low confidence or empty/incorrect span** - SQuAD 2.0 models are better at this

### Exercise 2: Answerability Classifier (Sample Solution)

In [None]:
# Sample solution for Exercise 2 is provided in the exercise itself
# Key insights:
# 1. Use multiple signals: confidence score, answer length, etc.
# 2. SQuAD 2.0 models give more reliable confidence for unanswerable questions
# 3. You can tune thresholds based on your use case (precision vs recall)

# Additional test:
test_context = """
The Eiffel Tower was completed in 1889 for the World's Fair. 
It stands 330 meters tall and is located in Paris, France.
"""

test_cases = [
    "When was the Eiffel Tower completed?",  # Answerable
    "How much did it cost to build?",        # Not answerable
]

for q in test_cases:
    result = classify_answerability(qa_squad2, q, test_context)
    print(f"Q: {result['question']}")
    print(f"   Answerable: {result['is_answerable']} ({result['confidence']:.0%})")
    print()

---

## Additional Resources

- [Hugging Face QA Pipeline Docs](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.QuestionAnsweringPipeline)
- [SQuAD Dataset](https://rajpurkar.github.io/SQuAD-explorer/) - Stanford Question Answering Dataset
- [SQuAD 2.0 Paper](https://arxiv.org/abs/1806.03822) - Know What You Don't Know
- [BERT for QA](https://arxiv.org/abs/1810.04805) - Original BERT paper, Section 4.2
- [RoBERTa Paper](https://arxiv.org/abs/1907.11692) - Improved pretraining for better QA