# Simple RAG System for HotpotQA

Now that you understand the data, let's build a **basic RAG (Retrieval-Augmented Generation) system**.

## Pipeline:
1. **Retrieve** relevant paragraphs from the 10 available
2. **Generate** answer using Mistral with retrieved context
3. **Evaluate** on dev set

We'll start with simple retrieval and improve it step by step.

In [1]:
import json
from pathlib import Path
import sys
from dotenv import load_dotenv
import os

# Setup
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))
load_dotenv(project_root / '.env')

print(f"Project root: {project_root}")

Project root: /Users/vatsalpatel/hotpotqa


In [2]:
# Load data
def load_data(split='dev'):
    if split == 'train':
        file_path = project_root / 'data/raw/hotpot_train_v1.1.json'
    else:
        file_path = project_root / 'data/raw/hotpot_dev_distractor_v1.json'
    
    with open(file_path, 'r') as f:
        data = json.load(f)
    
    print(f"‚úÖ Loaded {len(data):,} {split} examples")
    return data

dev_data = load_data('dev')

‚úÖ Loaded 7,405 dev examples


In [3]:
def simple_keyword_retrieval(question, context_paragraphs, top_k=3):
    """
    Simple retrieval: score paragraphs by keyword overlap with question
    
    Args:
        question: Question string
        context_paragraphs: List of [title, sentences] pairs
        top_k: Number of paragraphs to retrieve
    
    Returns:
        List of top_k paragraphs as formatted strings
    """
    question_words = set(question.lower().split())
    
    # Score each paragraph
    scored = []
    for title, sentences in context_paragraphs:
        # Combine title and sentences
        text = title + ' ' + ' '.join(sentences)
        text_lower = text.lower()
        
        # Count question words in paragraph
        score = sum(1 for word in question_words if word in text_lower)
        
        scored.append((score, title, sentences, text))
    
    # Sort by score (highest first) and take top_k
    scored.sort(reverse=True, key=lambda x: x[0])
    
    # Format retrieved paragraphs
    retrieved = []
    for score, title, sentences, text in scored[:top_k]:
        retrieved.append(f"Title: {title}\n{' '.join(sentences)}")
    
    return retrieved

# Test on one example
test_ex = dev_data[0]
retrieved = simple_keyword_retrieval(test_ex['question'], test_ex['context'], top_k=3)

print("Question:", test_ex['question'])
print("\nRetrieved paragraphs:")
for i, para in enumerate(retrieved, 1):
    print(f"\n{i}. {para[:200]}...")

Question: Were Scott Derrickson and Ed Wood of the same nationality?

Retrieved paragraphs:

1. Title: Doctor Strange (2016 film)
Doctor Strange is a 2016 American superhero film based on the Marvel Comics character of the same name, produced by Marvel Studios and distributed by Walt Disney Stud...

2. Title: Tyler Bates
Tyler Bates (born June 5, 1965) is an American musician, music producer, and composer for films, television, and video games.  Much of his work is in the action and horror film genr...

3. Title: Deliver Us from Evil (2014 film)
Deliver Us from Evil is a 2014 American supernatural horror film directed by Scott Derrickson and produced by Jerry Bruckheimer.  The film is officially based o...


## Build Complete RAG Class

In [4]:
from mistralai import Mistral

class SimpleRAG:
    """Simple RAG system for HotpotQA"""
    
    def __init__(self, api_key=None, model="mistral-large-latest"):
        self.client = Mistral(api_key=api_key or os.getenv('MISTRAL_API_KEY'))
        self.model = model
    
    def retrieve(self, question, context_paragraphs, top_k=3):
        """Retrieve top_k relevant paragraphs"""
        return simple_keyword_retrieval(question, context_paragraphs, top_k)
    
    def generate(self, question, context_paragraphs):
        """Generate answer using Mistral"""
        context_text = "\n\n".join(context_paragraphs)
        
        prompt = f"""Answer the question based on the provided context. Be concise and direct.

Context:
{context_text}

Question: {question}

Answer:"""
        
        response = self.client.chat.complete(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        
        return response.choices[0].message.content.strip()
    
    def answer(self, example, top_k=3, verbose=False):
        """
        Complete RAG pipeline for one example
        
        Returns:
            dict with 'answer' and 'retrieved_paragraphs'
        """
        # Step 1: Retrieve
        retrieved = self.retrieve(example['question'], example['context'], top_k)
        
        if verbose:
            print(f"Retrieved {len(retrieved)} paragraphs")
        
        # Step 2: Generate
        answer = self.generate(example['question'], retrieved)
        
        return {
            'answer': answer,
            'retrieved_paragraphs': retrieved
        }

# Initialize RAG
rag = SimpleRAG()
print("‚úÖ RAG system initialized")

‚úÖ RAG system initialized


## Test on a Few Examples

In [None]:
# Test on 3 examples
import random

test_examples = random.sample(dev_data, 3)

for i, ex in enumerate(test_examples, 1):
    print("\n" + "="*80)
    print(f"Example {i}/3")
    print("="*80)
    
    result = rag.answer(ex, top_k=3, verbose=True)
    
    print(f"\nQuestion: {ex['question']}")
    print(f"\nü§ñ Predicted: {result['answer']}")
    print(f"‚úÖ Ground Truth: {ex['answer']}")
    print(f"\nüìä Match: {'YES ‚úì' if result['answer'].lower() == ex['answer'].lower() else 'NO ‚úó'}")

## Evaluate on Larger Sample

‚ö†Ô∏è **Warning**: This will use API calls. Start with 10-20 examples to test.

In [None]:
from tqdm import tqdm

def evaluate_rag(rag, examples, top_k=3):
    """
    Evaluate RAG on a set of examples
    
    Returns:
        predictions dict in HotpotQA format
    """
    predictions = {
        'answer': {},
        'sp': {}  # supporting facts (we'll skip for now)
    }
    
    correct = 0
    total = len(examples)
    
    for ex in tqdm(examples, desc="Evaluating"):
        try:
            result = rag.answer(ex, top_k=top_k)
            
            predictions['answer'][ex['_id']] = result['answer']
            predictions['sp'][ex['_id']] = []  # Empty for now
            
            # Simple exact match check
            if result['answer'].lower().strip() == ex['answer'].lower().strip():
                correct += 1
        
        except Exception as e:
            print(f"Error on {ex['_id']}: {e}")
            predictions['answer'][ex['_id']] = ""
            predictions['sp'][ex['_id']] = []
    
    accuracy = correct / total if total > 0 else 0
    
    print(f"\n‚úÖ Simple Accuracy: {correct}/{total} ({100*accuracy:.1f}%)")
    
    return predictions

# Test on 20 examples first (adjust as needed)
sample_size = 20
print(f"Testing on {sample_size} examples...")
print("‚ö†Ô∏è  This will make API calls. Cancel if you want to reduce sample size.\n")

sample_data = dev_data[:sample_size]
predictions = evaluate_rag(rag, sample_data, top_k=3)

## Save Predictions and Use Official Evaluation

In [None]:
# Save predictions
output_path = project_root / 'predictions_simple_rag.json'
with open(output_path, 'w') as f:
    json.dump(predictions, f)

print(f"‚úÖ Predictions saved to {output_path}")

In [None]:
# Use official evaluation (for the subset we tested)
# We need to create a subset of dev data with only our tested examples

tested_ids = set(predictions['answer'].keys())
subset_dev = [ex for ex in dev_data if ex['_id'] in tested_ids]

subset_path = project_root / 'dev_subset.json'
with open(subset_path, 'w') as f:
    json.dump(subset_dev, f)

print(f"Created subset with {len(subset_dev)} examples")

In [None]:
# Run official evaluation
from evaluation.eval import eval

print("Running official HotpotQA evaluation...\n")
eval(str(output_path), str(subset_path))

## Analyze Results: Where Does It Fail?

In [None]:
# Find examples where prediction was wrong
wrong_examples = []

for ex in sample_data:
    pred = predictions['answer'].get(ex['_id'], '')
    if pred.lower().strip() != ex['answer'].lower().strip():
        wrong_examples.append({
            'question': ex['question'],
            'predicted': pred,
            'truth': ex['answer'],
            'type': ex['type'],
            'supporting_facts': ex['supporting_facts']
        })

print(f"Found {len(wrong_examples)} wrong predictions\n")

# Show first 3 failures
for i, err in enumerate(wrong_examples[:3], 1):
    print(f"\n{'='*80}")
    print(f"Failure {i}")
    print(f"{'='*80}")
    print(f"Question: {err['question']}")
    print(f"Type: {err['type']}")
    print(f"Predicted: {err['predicted']}")
    print(f"Truth: {err['truth']}")
    print(f"Supporting facts needed: {len(err['supporting_facts'])}")

## Next Steps to Improve

### Current Limitations:
1. ‚ùå **Simple keyword matching** - doesn't understand semantics
2. ‚ùå **No multi-hop reasoning** - retrieves once, doesn't chain
3. ‚ùå **No re-ranking** - just uses simple word overlap
4. ‚ùå **Fixed top_k** - doesn't adapt to question complexity

### Improvements (in order of impact):

1. **Better Retrieval** (Notebook 3)
   - Use BM25 (classic IR)
   - Or use embedding-based retrieval (sentence-transformers)
   - Or hybrid (BM25 + embeddings)

2. **Multi-hop Retrieval** (Notebook 4)
   - Retrieve ‚Üí Read ‚Üí Retrieve again
   - Build reasoning chains

3. **Better Prompting** (Quick Win)
   - Add reasoning instructions
   - Use chain-of-thought
   - Extract supporting facts

4. **Re-ranking**
   - Re-score retrieved paragraphs with cross-encoder

### Current Performance Baseline:
You should see around:
- **EM**: 20-30% (exact match)
- **F1**: 30-40% (token overlap)

### Target Performance:
- **Good RAG system**: 50-60% EM, 60-70% F1
- **SOTA**: 70%+ EM, 80%+ F1