# HotpotQA with Full Wikipedia Dump RAG

This notebook demonstrates:
1. Loading the full Wikipedia dump (enwiki-20171001-pages-meta-current-withlinks-abstracts.tar.bz2)
2. Building a simple retrieval index using embeddings
3. Answering questions using RAG with Mistral

**Dataset**: HotpotQA with full Wikipedia as knowledge base

## Step 1: Setup and Imports

In [1]:
import json
import tarfile
import bz2
from pathlib import Path
import sys
import pickle
from typing import List, Dict, Tuple
import numpy as np
from tqdm.auto import tqdm

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

Project root: /Users/vatsalpatel/hotpotqa


## Step 2: Load Wikipedia Dump

The Wikipedia dump is in tar.bz2 format. Each article contains:
- Title
- Text/Abstract
- Links to other articles

In [2]:
# Path to Wikipedia dump
wiki_dump_path = project_root / 'data/raw/enwiki-20171001-pages-meta-current-withlinks-abstracts.tar.bz2'

print(f"Wikipedia dump path: {wiki_dump_path}")
print(f"File exists: {wiki_dump_path.exists()}")

if wiki_dump_path.exists():
    file_size_gb = wiki_dump_path.stat().st_size / (1024**3)
    print(f"File size: {file_size_gb:.2f} GB")

Wikipedia dump path: /Users/vatsalpatel/hotpotqa/data/raw/enwiki-20171001-pages-meta-current-withlinks-abstracts.tar.bz2
File exists: True
File size: 1.45 GB


In [3]:
def load_wikipedia_dump(dump_path: Path, max_articles: int = None) -> List[Dict]:
    """
    Load Wikipedia articles from tar.bz2 dump.

    The dump contains multiple .bz2 files, each with one JSON object per line.
    Each JSON has: id, url, title, text (list of sentences), text_with_links, etc.

    Args:
        dump_path: Path to the tar.bz2 file
        max_articles: Maximum number of articles to load (for testing)

    Returns:
        List of article dictionaries with 'title' and 'text' keys
    """
    articles = []

    print(f"Opening Wikipedia dump: {dump_path.name}")

    with tarfile.open(dump_path, 'r:bz2') as tar:
        members = tar.getmembers()
        print(f"Total files in archive: {len(members)}")

        # Filter to only process .bz2 files (not directories)
        bz2_members = [m for m in members if m.name.endswith('.bz2') and m.isfile()]
        print(f"BZ2 files found: {len(bz2_members)}")

        if max_articles:
            # Estimate how many files to process based on max_articles
            # Assuming ~100-500 articles per file
            max_files = min(max(1, max_articles // 100), len(bz2_members))
            bz2_members = bz2_members[:max_files]
            print(f"Processing first {len(bz2_members)} files to get ~{max_articles} articles...")

        for member in tqdm(bz2_members, desc="Processing files"):
            if max_articles and len(articles) >= max_articles:
                break

            try:
                # Extract the compressed file
                f = tar.extractfile(member)
                if f is None:
                    continue

                # Decompress the bz2 content
                decompressed = bz2.decompress(f.read())

                # Each line is a separate JSON object
                for line in decompressed.decode('utf-8').strip().split('\n'):
                    if not line.strip():
                        continue

                    try:
                        article_data = json.loads(line)

                        # Extract title and text
                        title = article_data.get('title', '')

                        # Text is stored as a list of sentences
                        text_list = article_data.get('text', [])
                        if isinstance(text_list, list):
                            text = ' '.join(text_list)
                        else:
                            text = str(text_list)

                        if title and text:
                            articles.append({
                                'id': article_data.get('id', ''),
                                'title': title,
                                'text': text,
                                'url': article_data.get('url', '')
                            })

                        if max_articles and len(articles) >= max_articles:
                            break

                    except json.JSONDecodeError as e:
                        # Skip malformed JSON lines
                        continue

            except Exception as e:
                print(f"Error processing {member.name}: {e}")
                continue

    print(f"\n‚úÖ Loaded {len(articles):,} Wikipedia articles")
    return articles

In [4]:
# Load a sample of Wikipedia articles for testing
# Set max_articles=None to load all articles (will take time!)
wiki_articles = load_wikipedia_dump(wiki_dump_path, max_articles=1000)

# Show first article
if wiki_articles:
    print("\n" + "="*80)
    print("SAMPLE ARTICLE")
    print("="*80)
    print(f"Title: {wiki_articles[0]['title']}")
    print(f"\nText preview (first 500 chars):")
    print(wiki_articles[0]['text'][:500] + "...")

Opening Wikipedia dump: enwiki-20171001-pages-meta-current-withlinks-abstracts.tar.bz2
Total files in archive: 15674
BZ2 files found: 15517
Processing first 10 files to get ~1000 articles...


Processing files:   0%|          | 0/10 [00:00<?, ?it/s]


‚úÖ Loaded 1,000 Wikipedia articles

SAMPLE ARTICLE
Title: One Night Stand (1984 film)

Text preview (first 500 chars):
One Night Stand is a 1984 film directed by John Duigan....


## Step 3: Build Simple Retrieval Index

We'll use sentence-transformers to create embeddings for retrieval.
This is a simple dense retrieval approach.

In [5]:
# Install sentence-transformers if not already installed
!pip install sentence-transformers -q

In [6]:
from sentence_transformers import SentenceTransformer
import torch

# Load embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"‚úÖ Model loaded: {embedding_model.get_sentence_embedding_dimension()} dimensions")



Loading embedding model...
‚úÖ Model loaded: 384 dimensions


In [7]:
def build_document_index(articles: List[Dict], batch_size: int = 32) -> Tuple[np.ndarray, List[str]]:
    """
    Build embeddings index for Wikipedia articles.
    
    Args:
        articles: List of article dictionaries
        batch_size: Batch size for embedding generation
    
    Returns:
        Tuple of (embeddings array, list of titles)
    """
    print(f"Building index for {len(articles):,} articles...")
    
    # Prepare text for embedding: combine title and text
    texts = [f"{article['title']}. {article['text'][:500]}" for article in articles]
    titles = [article['title'] for article in articles]
    
    # Generate embeddings in batches
    embeddings = embedding_model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True
    )
    
    print(f"‚úÖ Index built: {embeddings.shape}")
    return embeddings, titles

In [8]:
# Build the index
wiki_embeddings, wiki_titles = build_document_index(wiki_articles)

print(f"\nIndex statistics:")
print(f"  - Number of documents: {len(wiki_articles):,}")
print(f"  - Embedding dimensions: {wiki_embeddings.shape[1]}")
print(f"  - Total size in memory: {wiki_embeddings.nbytes / (1024**2):.2f} MB")

Building index for 1,000 articles...


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

‚úÖ Index built: (1000, 384)

Index statistics:
  - Number of documents: 1,000
  - Embedding dimensions: 384
  - Total size in memory: 1.46 MB


## Step 4: Implement Retrieval Function

In [9]:
def retrieve_documents(query: str, top_k: int = 5) -> List[Tuple[str, str, float]]:
    """
    Retrieve most relevant documents for a query.
    
    Args:
        query: Query string
        top_k: Number of documents to retrieve
    
    Returns:
        List of (title, text, score) tuples
    """
    # Encode query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)
    
    # Compute cosine similarity
    similarities = np.dot(wiki_embeddings, query_embedding.T).flatten()
    
    # Get top-k indices
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    # Return results
    results = []
    for idx in top_indices:
        results.append((
            wiki_articles[idx]['title'],
            wiki_articles[idx]['text'],
            float(similarities[idx])
        ))
    
    return results

In [10]:
# Test retrieval
test_query = "Who was the first president of the United States?"

print(f"Query: {test_query}")
print("\n" + "="*80)
print("RETRIEVED DOCUMENTS")
print("="*80)

retrieved = retrieve_documents(test_query, top_k=3)

for i, (title, text, score) in enumerate(retrieved, 1):
    print(f"\n{i}. {title} (score: {score:.4f})")
    print(f"   {text[:200]}...")

Query: Who was the first president of the United States?

RETRIEVED DOCUMENTS

1. William Everhart (score: 0.3182)
   William Everhart (May 17, 1785 ‚Äì October 30, 1868) was an entrepreneur and wealthy businessman from Pennsylvania.  He was responsible for developing much of West Chester and stimulating its economic g...

2. Isaac Newton Evans (score: 0.3090)
   Isaac Evans (July 29, 1827 ‚Äì December 3, 1901) was a Republican member of the U.S. House of Representatives from Pennsylvania....

3. Lyndon Hardy (score: 0.3008)
   Lyndon Mauriece Hardy is an American physicist, fantasy author, and business owner....


## Step 5: Load HotpotQA Question

In [11]:
# Load HotpotQA dev data
hotpotqa_path = project_root / 'data/raw/hotpot_dev_distractor_v1.json'

with open(hotpotqa_path, 'r') as f:
    hotpotqa_data = json.load(f)

print(f"‚úÖ Loaded {len(hotpotqa_data):,} HotpotQA questions")

# Pick a test question
test_example = hotpotqa_data[0]

print("\n" + "="*80)
print("TEST QUESTION")
print("="*80)
print(f"Question: {test_example['question']}")
print(f"Answer: {test_example['answer']}")
print(f"Type: {test_example['type']}")

‚úÖ Loaded 7,405 HotpotQA questions

TEST QUESTION
Question: Were Scott Derrickson and Ed Wood of the same nationality?
Answer: yes
Type: comparison


## Step 6: Implement Simple RAG Pipeline

In [12]:
# Load Mistral API
from dotenv import load_dotenv
import os
from mistralai import Mistral

load_dotenv(project_root / '.env')

mistral_client = Mistral(api_key=os.getenv('MISTRAL_API_KEY'))
model = "mistral-large-latest"

print(f"‚úÖ Mistral client initialized")

‚úÖ Mistral client initialized


In [13]:
def answer_question_with_rag(question: str, top_k: int = 5, verbose: bool = True) -> str:
    """
    Answer a question using RAG: Retrieve + Generate.
    
    Args:
        question: Question to answer
        top_k: Number of documents to retrieve
        verbose: Print retrieval details
    
    Returns:
        Generated answer
    """
    # Step 1: Retrieve relevant documents
    if verbose:
        print("üîç Retrieving relevant documents...")
    
    retrieved_docs = retrieve_documents(question, top_k=top_k)
    
    if verbose:
        print(f"\nRetrieved {len(retrieved_docs)} documents:")
        for i, (title, _, score) in enumerate(retrieved_docs, 1):
            print(f"  {i}. {title} (score: {score:.4f})")
    
    # Step 2: Format context
    context_parts = []
    for i, (title, text, _) in enumerate(retrieved_docs, 1):
        context_parts.append(f"Document {i}: {title}\n{text}")
    
    context = "\n\n".join(context_parts)
    
    # Step 3: Generate answer
    prompt = f"""Answer the question based on the provided documents. Be concise and direct.

Context:
{context}

Question: {question}

Answer:"""
    
    if verbose:
        print("\nü§ñ Generating answer...")
    
    response = mistral_client.chat.complete(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    
    answer = response.choices[0].message.content.strip()
    
    return answer

## Step 7: Test RAG on HotpotQA Question

In [14]:
# Answer the test question
print("="*80)
print("RAG PIPELINE - ANSWERING QUESTION")
print("="*80)
print(f"\nQuestion: {test_example['question']}\n")

predicted_answer = answer_question_with_rag(
    test_example['question'],
    top_k=5,
    verbose=True
)

print("\n" + "="*80)
print("RESULTS")
print("="*80)
print(f"\nü§ñ Predicted Answer: {predicted_answer}")
print(f"‚úÖ Ground Truth: {test_example['answer']}")

# Simple match check
is_correct = predicted_answer.lower().strip() == test_example['answer'].lower().strip()
print(f"\n{'‚úì CORRECT' if is_correct else '‚úó DIFFERENT'}")

RAG PIPELINE - ANSWERING QUESTION

Question: Were Scott Derrickson and Ed Wood of the same nationality?

üîç Retrieving relevant documents...

Retrieved 5 documents:
  1. Kevin McDonald (footballer, born 1988) (score: 0.3916)
  2. Britt Woodman (score: 0.3907)
  3. Kenny Nolan (score: 0.3632)
  4. Robert Fisher Tomes (score: 0.3523)
  5. Sam Wilder (American football) (score: 0.3418)

ü§ñ Generating answer...

RESULTS

ü§ñ Predicted Answer: No relevant information about **Scott Derrickson** or **Ed Wood** is provided in the given documents. Cannot determine their nationalities.
‚úÖ Ground Truth: yes

‚úó DIFFERENT


## Step 8: Test on Multiple Questions

In [15]:
import random

# Test on 3 random questions
test_questions = random.sample(hotpotqa_data, 3)

results = []

for i, example in enumerate(test_questions, 1):
    print("\n" + "="*80)
    print(f"QUESTION {i}/3")
    print("="*80)
    print(f"Q: {example['question']}")
    
    # Get answer (non-verbose)
    predicted = answer_question_with_rag(example['question'], top_k=3, verbose=False)
    
    print(f"\nü§ñ Predicted: {predicted}")
    print(f"‚úÖ Truth: {example['answer']}")
    
    is_match = predicted.lower().strip() == example['answer'].lower().strip()
    results.append(is_match)
    print(f"{'‚úì MATCH' if is_match else '‚úó DIFFERENT'}")

print("\n" + "="*80)
print(f"Exact matches: {sum(results)}/{len(results)}")
print("="*80)


QUESTION 1/3
Q: Which band was formed first, Lit or Adorable?

ü§ñ Predicted: Neither **Lit** nor **Adorable** is mentioned in the provided documents, so the answer cannot be determined from the given context.
‚úÖ Truth: Adorable
‚úó DIFFERENT

QUESTION 2/3
Q: Cave-In-Rock, Illinois was a stronghold for serial killer/bandit brothers who operated in which century?

ü§ñ Predicted: Cave-In-Rock, Illinois, was a stronghold for the **Harpe brothers**, serial killer/bandit brothers who operated in the **late 18th century** (1790s).
‚úÖ Truth: who operated in Tennessee, Kentucky, Illinois, and Mississippi, in the late eighteenth century.
‚úó DIFFERENT

QUESTION 3/3
Q: Whose works are more likely to be seen in an art gallery, Hovsep Pushman or Armen Chakmakian?

ü§ñ Predicted: Neither **Hovsep Pushman** nor **Armen Chakmakian** are mentioned in the provided documents. Therefore, I cannot determine whose works are more likely to be seen in an art gallery based on the given context.
‚úÖ Trut

## Step 9: Save Index for Reuse (Optional)

In [16]:
# Save the index to avoid rebuilding
cache_dir = project_root / 'data/cache'
cache_dir.mkdir(exist_ok=True)

index_path = cache_dir / 'wiki_index.pkl'

with open(index_path, 'wb') as f:
    pickle.dump({
        'articles': wiki_articles,
        'embeddings': wiki_embeddings,
        'titles': wiki_titles
    }, f)

print(f"‚úÖ Index saved to {index_path}")
print(f"File size: {index_path.stat().st_size / (1024**2):.2f} MB")

‚úÖ Index saved to /Users/vatsalpatel/hotpotqa/data/cache/wiki_index.pkl
File size: 1.82 MB


In [17]:
# To load the index later:
# with open(index_path, 'rb') as f:
#     data = pickle.load(f)
#     wiki_articles = data['articles']
#     wiki_embeddings = data['embeddings']
#     wiki_titles = data['titles']
# print(f"‚úÖ Index loaded: {len(wiki_articles):,} articles")

## Summary

### What You Built:
1. ‚úÖ Loaded Wikipedia dump from tar.bz2 file
2. ‚úÖ Built dense retrieval index using sentence-transformers
3. ‚úÖ Implemented simple RAG pipeline (Retrieve + Generate)
4. ‚úÖ Tested on HotpotQA questions

### Performance Notes:
- This is a **simple baseline** - single-hop retrieval
- HotpotQA requires **multi-hop reasoning**
- May need to retrieve multiple times for bridge questions

### Improvements to Try:
1. **Better retrieval**: Use hybrid search (BM25 + dense)
2. **Multi-hop**: Iteratively retrieve based on previous context
3. **Re-ranking**: Add a cross-encoder for better ranking
4. **Chunking**: Split long articles into smaller chunks
5. **Index all Wikipedia**: Load full dump instead of sample

### Next Steps:
- Evaluate on full dev set
- Implement proper EM/F1 metrics
- Try different retrieval strategies