# HotpotQA Quickstart - Understanding the Data

This notebook will help you:
1. Load the HotpotQA dataset
2. Understand the data structure
3. Explore examples
4. Test Mistral API on a few examples

**Goal**: Get familiar with the data before building RAG

## Step 1: Load the Data (Simple Way)

In [1]:
import json
from pathlib import Path
import sys

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

Project root: /Users/vatsalpatel/hotpotqa


In [2]:
# Simple function to load data
def load_data_simple(split='dev'):
    """Load HotpotQA data - simple version"""
    if split == 'train':
        file_path = project_root / 'data/raw/hotpot_train_v1.1.json'
    else:
        file_path = project_root / 'data/raw/hotpot_dev_distractor_v1.json'
    
    print(f"Loading from: {file_path}")
    
    with open(file_path, 'r') as f:
        data = json.load(f)
    
    print(f"‚úÖ Loaded {len(data):,} examples")
    return data

# Load dev data (smaller, faster to work with)
dev_data = load_data_simple('dev')

Loading from: /Users/vatsalpatel/hotpotqa/data/raw/hotpot_dev_distractor_v1.json
‚úÖ Loaded 7,405 examples


## Step 2: Explore One Example

In [3]:
# Look at the first example
example = dev_data[0]

print("=" * 80)
print("EXAMPLE STRUCTURE")
print("=" * 80)
print(f"\nüìù ID: {example['_id']}")
print(f"\n‚ùì Question: {example['question']}")
print(f"\n‚úÖ Answer: {example['answer']}")
print(f"\nüîç Type: {example['type']}")
print(f"\nüìä Level: {example['level']}")
print(f"\nüìö Number of context paragraphs: {len(example['context'])}")
print(f"\nüéØ Number of supporting facts: {len(example['supporting_facts'])}")

EXAMPLE STRUCTURE

üìù ID: 5a8b57f25542995d1e6f1371

‚ùì Question: Were Scott Derrickson and Ed Wood of the same nationality?

‚úÖ Answer: yes

üîç Type: comparison

üìä Level: hard

üìö Number of context paragraphs: 10

üéØ Number of supporting facts: 2


In [4]:
# Look at the context structure
print("\n" + "=" * 80)
print("CONTEXT PARAGRAPHS (First 3)")
print("=" * 80)

for i, (title, sentences) in enumerate(example['context'][:3]):
    print(f"\n--- Paragraph {i+1} ---")
    print(f"Title: {title}")
    print(f"Number of sentences: {len(sentences)}")
    print(f"First sentence: {sentences[0][:200]}...")
    print()


CONTEXT PARAGRAPHS (First 3)

--- Paragraph 1 ---
Title: Ed Wood (film)
Number of sentences: 3
First sentence: Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood....


--- Paragraph 2 ---
Title: Scott Derrickson
Number of sentences: 3
First sentence: Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer....


--- Paragraph 3 ---
Title: Woodson, Arkansas
Number of sentences: 5
First sentence: Woodson is a census-designated place (CDP) in Pulaski County, Arkansas, in the United States....



In [5]:
# Look at supporting facts
print("=" * 80)
print("SUPPORTING FACTS (Gold Facts Needed to Answer)")
print("=" * 80)

for title, sent_id in example['supporting_facts']:
    # Find the actual sentence
    for para_title, sentences in example['context']:
        if para_title == title:
            if sent_id < len(sentences):
                sentence = sentences[sent_id]
                print(f"\nüìå {title} [sentence {sent_id}]:")
                print(f"   {sentence}")
            break

SUPPORTING FACTS (Gold Facts Needed to Answer)

üìå Scott Derrickson [sentence 0]:
   Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.

üìå Ed Wood [sentence 0]:
   Edward Davis Wood Jr. (October 10, 1924 ‚Äì December 10, 1978) was an American filmmaker, actor, writer, producer, and director.


## Step 3: Analyze the Dataset

In [6]:
# Quick statistics
from collections import Counter

# Question types
types = Counter([ex['type'] for ex in dev_data])
levels = Counter([ex['level'] for ex in dev_data])

# Answer types
yes_no_count = sum(1 for ex in dev_data if ex['answer'].lower() in ['yes', 'no'])

print("=" * 80)
print("DATASET STATISTICS")
print("=" * 80)
print(f"\nTotal examples: {len(dev_data):,}")
print(f"\nQuestion Types:")
for qtype, count in types.items():
    pct = 100 * count / len(dev_data)
    print(f"  - {qtype}: {count:,} ({pct:.1f}%)")

print(f"\nDifficulty Levels:")
for level, count in levels.items():
    pct = 100 * count / len(dev_data)
    print(f"  - {level}: {count:,} ({pct:.1f}%)")

print(f"\nAnswer Types:")
pct_yesno = 100 * yes_no_count / len(dev_data)
print(f"  - Yes/No answers: {yes_no_count:,} ({pct_yesno:.1f}%)")
print(f"  - Span answers: {len(dev_data) - yes_no_count:,} ({100-pct_yesno:.1f}%)")

DATASET STATISTICS

Total examples: 7,405

Question Types:
  - comparison: 1,487 (20.1%)
  - bridge: 5,918 (79.9%)

Difficulty Levels:
  - hard: 7,405 (100.0%)

Answer Types:
  - Yes/No answers: 458 (6.2%)
  - Span answers: 6,947 (93.8%)


In [7]:
# Look at a few different examples
import random

print("=" * 80)
print("SAMPLE QUESTIONS (Random 5)")
print("=" * 80)

for ex in random.sample(dev_data, 5):
    print(f"\nQ: {ex['question']}")
    print(f"A: {ex['answer']}")
    print(f"Type: {ex['type']}, Level: {ex['level']}")

SAMPLE QUESTIONS (Random 5)

Q: Under Which Stanmore born prime minister was John Gorton a serving minister? 
A: Harold Holt
Type: bridge, Level: hard

Q: Which song by Last One Picked appeared in a 2004 American teen musical comedy film directed by Sara Sugarman?
A: Na Na
Type: bridge, Level: hard

Q: What former city, now the fourth-largest Russian city, was the Belarusian State Technological University evacuated to in 1941?
A: Sverdlovsk
Type: bridge, Level: hard

Q: Which super bowl that took place at the Miami Dolphins home stadium, featured the San Francisco 49ers defeating the San Diego Chargers?
A: Super Bowl XXIX
Type: bridge, Level: hard

Q: Which mountain, Masherbrum or Khunyang Chhish, is a taller mountain?
A: Khunyang Chhish
Type: comparison, Level: hard


## Step 4: Test Mistral API on a Simple Example

In [8]:
# Load environment variables
from dotenv import load_dotenv
import os

load_dotenv(project_root / '.env')

# Check if API key is loaded
if os.getenv('MISTRAL_API_KEY'):
    print("‚úÖ Mistral API key loaded")
else:
    print("‚ùå No Mistral API key found. Set MISTRAL_API_KEY in .env file")

‚úÖ Mistral API key loaded


In [9]:
# Initialize Mistral client
from mistralai import Mistral

client = Mistral(api_key=os.getenv('MISTRAL_API_KEY'))
model = "mistral-large-latest"

print(f"‚úÖ Mistral client initialized with model: {model}")

‚úÖ Mistral client initialized with model: mistral-large-latest


In [10]:
# Test on one example - WITH ALL CONTEXT (no retrieval yet)
test_example = dev_data[0]

# Format all context paragraphs
context_text = "\n\n".join([
    f"Title: {title}\n{' '.join(sentences)}"
    for title, sentences in test_example['context']
])

# Create prompt
prompt = f"""Answer the question based on the provided context. Give a concise answer.

Context:
{context_text}

Question: {test_example['question']}

Answer:"""

print("Sending request to Mistral...")

response = client.chat.complete(
    model=model,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

predicted_answer = response.choices[0].message.content.strip()

print("\n" + "=" * 80)
print("MISTRAL PREDICTION (with all context)")
print("=" * 80)
print(f"\nQuestion: {test_example['question']}")
print(f"\nü§ñ Predicted: {predicted_answer}")
print(f"\n‚úÖ Ground Truth: {test_example['answer']}")
print(f"\nüìä Match: {'YES ‚úì' if predicted_answer.lower() == test_example['answer'].lower() else 'NO ‚úó'}")

Sending request to Mistral...

MISTRAL PREDICTION (with all context)

Question: Were Scott Derrickson and Ed Wood of the same nationality?

ü§ñ Predicted: Yes.

‚úÖ Ground Truth: yes

üìä Match: NO ‚úó


## Step 5: Test on Multiple Examples

In [11]:
# Test on 5 examples (be careful of API costs!)
def test_mistral(example, use_all_context=True):
    """Test Mistral on one example"""
    
    # Format context
    if use_all_context:
        context_text = "\n\n".join([
            f"Title: {title}\n{' '.join(sentences)}"
            for title, sentences in example['context']
        ])
    else:
        # Use only first 3 paragraphs (faster/cheaper)
        context_text = "\n\n".join([
            f"Title: {title}\n{' '.join(sentences)}"
            for title, sentences in example['context'][:3]
        ])
    
    prompt = f"""Answer the question based on the context. Be concise.

Context:
{context_text}

Question: {example['question']}

Answer:"""
    
    response = client.chat.complete(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    
    return response.choices[0].message.content.strip()

# Test on 5 random examples
print("Testing on 5 random examples...\n")

test_examples = random.sample(dev_data, 5)
correct = 0

for i, ex in enumerate(test_examples, 1):
    print(f"\n{'='*80}")
    print(f"Example {i}/5")
    print(f"{'='*80}")
    
    predicted = test_mistral(ex, use_all_context=False)  # Use only 3 paragraphs
    
    print(f"Q: {ex['question']}")
    print(f"\nü§ñ Predicted: {predicted}")
    print(f"‚úÖ Truth: {ex['answer']}")
    
    # Simple exact match check
    if predicted.lower().strip() == ex['answer'].lower().strip():
        print("‚úì CORRECT")
        correct += 1
    else:
        print("‚úó WRONG")

print(f"\n\n{'='*80}")
print(f"Accuracy: {correct}/5 ({100*correct/5:.0f}%)")
print(f"{'='*80}")
print("\n‚ö†Ô∏è Note: This is just exact string matching. The official eval is more sophisticated.")

Testing on 5 random examples...


Example 1/5
Q: What Ruben Fleischer film did No One's Gonna Love You appear in?

ü§ñ Predicted: No connection exists in the context. *"No One's Gonna Love You"* is not mentioned in relation to Ruben Fleischer's film *30 Minutes or Less*.
‚úÖ Truth: Zombieland
‚úó WRONG

Example 2/5
Q: The city Charles Prince Airport is approximately 16 km northwest of was called Salisbury until what year?

ü§ñ Predicted: 1982
‚úÖ Truth: 1982
‚úì CORRECT

Example 3/5
Q: Which author dedicated a 1985 romance novel to the author who did in 2009 and wrote under the pen name Gwyneth Moore?

ü§ñ Predicted: None of the authors mentioned in the context.
‚úÖ Truth: Eva Ibbotson
‚úó WRONG

Example 4/5
Q: The Lance Todd Trophy is presented at a stadium located in what country?

ü§ñ Predicted: England.
‚úÖ Truth: England
‚úó WRONG

Example 5/5
Q: What is the name of this French former footballer whom football fans refer the football player Wilfred Bamnjo to as?

ü§ñ Predicted

## Step 6: Understand Multi-hop Reasoning

In [12]:
# Find a good multi-hop example
def find_multihop_example(data, min_hops=2):
    """Find an example with multiple reasoning hops"""
    for ex in data:
        # Count unique paragraph titles in supporting facts
        unique_titles = set([title for title, _ in ex['supporting_facts']])
        if len(unique_titles) >= min_hops:
            return ex, len(unique_titles)
    return None, 0

multihop_ex, num_hops = find_multihop_example(dev_data)

print("=" * 80)
print(f"MULTI-HOP EXAMPLE ({num_hops} hops)")
print("=" * 80)

print(f"\n‚ùì Question: {multihop_ex['question']}")
print(f"\n‚úÖ Answer: {multihop_ex['answer']}")
print(f"\nüîç Type: {multihop_ex['type']}")

print(f"\n\nüéØ SUPPORTING FACTS (Why it needs {num_hops} hops):")
print("=" * 80)

# Show supporting facts from different paragraphs
context_dict = {title: sentences for title, sentences in multihop_ex['context']}

for hop_num, (title, sent_id) in enumerate(multihop_ex['supporting_facts'], 1):
    if title in context_dict and sent_id < len(context_dict[title]):
        sentence = context_dict[title][sent_id]
        print(f"\nHop {hop_num}: From '{title}'")
        print(f"   ‚Üí {sentence}")

print("\n" + "=" * 80)
print("üí° INSIGHT: The answer requires connecting facts from multiple paragraphs!")
print("   This is why simple retrieval might fail - you need multi-hop reasoning.")
print("=" * 80)

MULTI-HOP EXAMPLE (2 hops)

‚ùì Question: Were Scott Derrickson and Ed Wood of the same nationality?

‚úÖ Answer: yes

üîç Type: comparison


üéØ SUPPORTING FACTS (Why it needs 2 hops):

Hop 1: From 'Scott Derrickson'
   ‚Üí Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.

Hop 2: From 'Ed Wood'
   ‚Üí Edward Davis Wood Jr. (October 10, 1924 ‚Äì December 10, 1978) was an American filmmaker, actor, writer, producer, and director.

üí° INSIGHT: The answer requires connecting facts from multiple paragraphs!
   This is why simple retrieval might fail - you need multi-hop reasoning.


## Summary & Next Steps

### What You Learned:
1. ‚úÖ HotpotQA data structure (questions, context, supporting facts)
2. ‚úÖ Two types of questions: **bridge** and **comparison**
3. ‚úÖ Multi-hop reasoning requires connecting facts from different paragraphs
4. ‚úÖ Mistral can answer questions when given context

### The Challenge for RAG:
- Each example has **10 paragraphs** (2 gold + 8 distractors)
- You need to **retrieve the right paragraphs** before generating
- For multi-hop questions, you might need to **retrieve multiple times**

### Next Notebook:
- `02_simple_rag.ipynb` - Build a basic RAG system with retrieval

### What You'll Need for RAG:
1. **Retrieval**: Find relevant paragraphs (BM25, embeddings, or hybrid)
2. **Re-ranking**: Improve retrieved results
3. **Multi-hop**: Chain retrieval for complex questions
4. **Generation**: Use Mistral to generate answers
5. **Evaluation**: Measure EM, F1 scores on dev set