# Part 7.1: Retrieval-Augmented Generation (RAG) — The Formula 1 Edition

Language models are powerful, but they have a fundamental limitation: their knowledge is frozen at training time. Ask about yesterday's news, your company's internal docs, or a recently published paper, and the model can only hallucinate. **Retrieval-Augmented Generation (RAG)** solves this by giving LLMs access to external knowledge at inference time.

**F1 analogy:** Think of RAG like a race engineer's access to the team's historical database. During a race, the engineer doesn't rely solely on memory — they query past race data for similar conditions (retrieval), then combine that historical context with live telemetry to make strategy calls (augmented generation). The vector store is the team's indexed archive of thousands of past races, searchable by situation similarity rather than date or keyword.

RAG is the most widely deployed LLM pattern in production today — it's how ChatGPT plugins, enterprise search, and AI assistants work with custom data. Understanding RAG means understanding how to build practical AI systems.

## Learning Objectives

- [ ] Understand why RAG exists and when to use it vs. fine-tuning
- [ ] Implement document chunking strategies and understand their tradeoffs
- [ ] Build a vector store with similarity search from scratch
- [ ] Implement a complete RAG pipeline: chunk → embed → retrieve → generate
- [ ] Understand and implement reranking for improved retrieval quality
- [ ] Evaluate RAG systems with retrieval and generation metrics
- [ ] Recognize common RAG failure modes and how to fix them
- [ ] Connect RAG to the embedding concepts from Notebook 15

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import defaultdict, Counter
import re
import hashlib
import torch
import torch.nn as nn
import torch.nn.functional as F

np.random.seed(42)
torch.manual_seed(42)

print("Part 7.1: Retrieval-Augmented Generation (RAG)")
print("=" * 50)

---

## 1. Why RAG?

### The Problem with Parametric Knowledge

LLMs store knowledge in their parameters (weights). This has three major limitations:

| Problem | Description | Example | F1 Parallel |
|---------|------------|----------|-------------|
| **Staleness** | Knowledge frozen at training cutoff | "Who won the 2025 Super Bowl?" | A strategy model trained on 2023 data doesn't know about 2024 regulation changes |
| **Hallucination** | Generates plausible but wrong facts | Citing non-existent papers | The model confidently predicting tire life based on data from a track it's never seen |
| **No private data** | Can't access your specific documents | Company policy questions | Can't access your team's proprietary telemetry or confidential race debriefs |

### RAG vs. Fine-Tuning

| Approach | When to Use | Pros | Cons |
|----------|------------|------|------|
| **RAG** | Dynamic knowledge, factual accuracy | No training needed, always up-to-date, verifiable | Latency overhead, retrieval errors |
| **Fine-tuning** | Behavior change, domain style | Faster inference, no retrieval | Expensive, knowledge goes stale |
| **RAG + Fine-tuning** | Best of both | Accurate + well-behaved | Most complex |

### Intuitive Explanation

Think of RAG like an open-book exam: instead of memorizing everything (fine-tuning), you bring your textbooks (retrieved documents) and look up the answer. The LLM's job shifts from "know everything" to "read well and synthesize."

**F1 analogy:** Fine-tuning is like training a driver on a simulator until they memorize every corner — great until the track layout changes. RAG is like giving the driver a radio connection to the pit wall, where engineers look up relevant data from past races in real-time and relay it. The driver (LLM) still makes the final call, but with the latest information at hand.

### Visualization: The RAG Pipeline

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
ax.set_xlim(0, 14)
ax.set_ylim(0, 9)
ax.axis('off')
ax.set_title('The RAG Pipeline', fontsize=16, fontweight='bold')

# Indexing pipeline (top)
ax.text(7, 8.5, 'Indexing Pipeline (offline)', ha='center', fontsize=12,
        fontweight='bold', color='gray', style='italic')

index_steps = [
    (0.5, 7, 2.5, 1, 'Documents', '#95a5a6', 'Raw text, PDFs,\nweb pages'),
    (3.5, 7, 2.5, 1, 'Chunker', '#3498db', 'Split into\nmanageable pieces'),
    (6.5, 7, 2.5, 1, 'Embedder', '#9b59b6', 'Convert chunks\nto vectors'),
    (10, 7, 3, 1, 'Vector Store', '#2ecc71', 'Index vectors\nfor fast search'),
]

for x, y, w, h, label, color, desc in index_steps:
    box = mpatches.FancyBboxPatch((x, y), w, h, boxstyle="round,pad=0.2",
                                   facecolor=color, edgecolor='black', linewidth=2, alpha=0.9)
    ax.add_patch(box)
    ax.text(x + w/2, y + h/2, label, ha='center', va='center',
            fontsize=10, fontweight='bold', color='white')
    ax.text(x + w/2, y - 0.4, desc, ha='center', va='center', fontsize=8, color='gray')

for i in range(len(index_steps) - 1):
    x1 = index_steps[i][0] + index_steps[i][2]
    x2 = index_steps[i+1][0]
    ax.annotate('', xy=(x2, 7.5), xytext=(x1, 7.5),
               arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Query pipeline (bottom)
ax.text(7, 5, 'Query Pipeline (online)', ha='center', fontsize=12,
        fontweight='bold', color='gray', style='italic')

query_steps = [
    (0.5, 3, 2.5, 1, 'User Query', '#e74c3c', '"What is...?"'),
    (3.5, 3, 2.5, 1, 'Embed Query', '#9b59b6', 'Same embedder\nas indexing'),
    (6.5, 3, 2.5, 1, 'Retrieve', '#2ecc71', 'Find top-k\nnearest chunks'),
    (10, 3, 3, 1, 'Generate', '#f39c12', 'LLM synthesizes\nanswer + context'),
]

for x, y, w, h, label, color, desc in query_steps:
    box = mpatches.FancyBboxPatch((x, y), w, h, boxstyle="round,pad=0.2",
                                   facecolor=color, edgecolor='black', linewidth=2, alpha=0.9)
    ax.add_patch(box)
    ax.text(x + w/2, y + h/2, label, ha='center', va='center',
            fontsize=10, fontweight='bold', color='white')
    ax.text(x + w/2, y - 0.4, desc, ha='center', va='center', fontsize=8, color='gray')

for i in range(len(query_steps) - 1):
    x1 = query_steps[i][0] + query_steps[i][2]
    x2 = query_steps[i+1][0]
    ax.annotate('', xy=(x2, 3.5), xytext=(x1, 3.5),
               arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Connection from vector store to retrieve
ax.annotate('', xy=(7.75, 4), xytext=(11.5, 7),
           arrowprops=dict(arrowstyle='->', lw=2, color='#2ecc71',
                          connectionstyle='arc3,rad=-0.3'))

# Answer
box = mpatches.FancyBboxPatch((4, 0.5), 6, 1, boxstyle="round,pad=0.3",
                               facecolor='#2c3e50', edgecolor='black', linewidth=2)
ax.add_patch(box)
ax.text(7, 1, 'Grounded Answer (with citations)', ha='center', va='center',
        fontsize=11, fontweight='bold', color='white')
ax.annotate('', xy=(7, 1.5), xytext=(11.5, 3),
           arrowprops=dict(arrowstyle='->', lw=2, color='#f39c12'))

plt.tight_layout()
plt.show()

---

## 2. Document Chunking

Before we can search documents, we need to split them into chunks. Chunking strategy dramatically affects RAG quality.

### Why Chunk?

- LLMs have limited context windows
- Embedding models work best on shorter texts
- Retrieval is more precise with smaller, focused chunks

### Key Tradeoff

- **Too small**: Loses context, retrieves fragments without enough information
- **Too large**: Dilutes relevance, wastes context window space

**F1 analogy:** Chunking is like breaking race reports into queryable sections. A full 50-page post-race debrief is too big to search effectively — you need it split into sections: "tire strategy," "weather conditions," "overtaking analysis," "pit stop timing." Too granular (individual sentences) and you lose context; too coarse (the entire report) and the retrieval becomes imprecise. The art is finding the right granularity, just like choosing how to organize the team's knowledge base so engineers can find what they need mid-race.

In [None]:
# Sample document corpus for our RAG system
DOCUMENTS = {
    "neural_networks": """Neural networks are computing systems inspired by biological neural networks. 
They consist of layers of interconnected nodes called neurons. Each connection has a weight that 
adjusts during training. The input layer receives data, hidden layers process it through weighted 
connections and activation functions, and the output layer produces predictions.

Backpropagation is the key algorithm for training neural networks. It computes gradients of the loss 
function with respect to each weight by applying the chain rule. These gradients tell us how to 
adjust weights to minimize the loss. Stochastic gradient descent (SGD) and its variants like Adam 
are commonly used optimizers.

Deep learning refers to neural networks with many hidden layers. Deep networks can learn hierarchical 
representations — early layers detect simple features like edges, while deeper layers combine them 
into complex patterns like faces or objects. This hierarchical feature learning is what makes deep 
learning so powerful for tasks like image recognition and natural language processing.""",

    "transformers": """The Transformer architecture was introduced in the 2017 paper 'Attention Is All 
You Need' by Vaswani et al. It replaced recurrent networks with self-attention mechanisms, enabling 
parallel processing of sequences. The key innovation is the attention mechanism, which allows each 
token to attend to all other tokens in the sequence.

Self-attention computes three matrices: Query (Q), Key (K), and Value (V) from the input. The 
attention score between positions is computed as the dot product of Q and K, scaled by the square 
root of the dimension, then softmaxed to get weights for the V matrix. Multi-head attention runs 
multiple attention operations in parallel, each learning different relationship patterns.

Transformers use positional encodings since they have no inherent notion of sequence order. The 
original paper used sinusoidal encodings, but modern models often use learned positional embeddings 
or relative position encodings like RoPE. Layer normalization and residual connections are critical 
for training stability in deep transformer models.""",

    "rlhf": """Reinforcement Learning from Human Feedback (RLHF) is the technique used to align 
language models with human preferences. The process has three stages: supervised fine-tuning (SFT) 
on human demonstrations, training a reward model on preference comparisons, and optimizing the 
policy with PPO using the reward model.

The reward model is trained using the Bradley-Terry preference model. Given pairs of responses 
where humans indicated a preference, the model learns to assign higher scores to preferred 
responses. The loss function is the negative log likelihood of the human preferences under 
the model's scoring.

PPO optimization with a KL penalty is critical to prevent reward hacking. Without the KL 
constraint, the model finds degenerate solutions that exploit the reward model's weaknesses. 
The KL penalty keeps the policy close to the SFT reference model, ensuring the model remains 
coherent and fluent while improving on the reward signal. Direct Preference Optimization (DPO) 
is a newer alternative that skips the reward model entirely.""",

    "embeddings": """Word embeddings represent words as dense vectors in a continuous space. 
Word2Vec introduced two key approaches: Skip-gram, which predicts context words from a target 
word, and CBOW, which predicts the target from context. These models learn that semantically 
similar words end up close together in the embedding space.

Modern embedding models like sentence transformers produce embeddings for entire sentences or 
paragraphs. These are crucial for RAG systems because they enable semantic search — finding 
documents by meaning rather than keyword matching. Cosine similarity is the standard metric 
for comparing embeddings.

Contextual embeddings from models like BERT produce different vectors for the same word 
depending on context. The word 'bank' gets different embeddings in 'river bank' versus 
'bank account'. This context-sensitivity makes them far more powerful than static embeddings 
for understanding natural language."""
}

print(f"Document corpus: {len(DOCUMENTS)} documents")
for name, text in DOCUMENTS.items():
    words = len(text.split())
    print(f"  {name}: {words} words")

In [None]:
class DocumentChunker:
    """Multiple chunking strategies for RAG."""
    
    @staticmethod
    def fixed_size(text, chunk_size=200, overlap=50):
        """Split into fixed-size character chunks with overlap."""
        chunks = []
        start = 0
        while start < len(text):
            end = start + chunk_size
            chunk = text[start:end].strip()
            if chunk:
                chunks.append(chunk)
            start += chunk_size - overlap
        return chunks
    
    @staticmethod
    def sentence_based(text, max_sentences=3):
        """Split into groups of sentences."""
        sentences = re.split(r'(?<=[.!?])\s+', text.strip())
        chunks = []
        for i in range(0, len(sentences), max_sentences):
            chunk = ' '.join(sentences[i:i + max_sentences]).strip()
            if chunk:
                chunks.append(chunk)
        return chunks
    
    @staticmethod
    def paragraph_based(text):
        """Split on paragraph boundaries."""
        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
        return paragraphs
    
    @staticmethod
    def semantic_window(text, window_size=3):
        """Sliding window over sentences with overlap."""
        sentences = re.split(r'(?<=[.!?])\s+', text.strip())
        chunks = []
        for i in range(len(sentences)):
            window = sentences[max(0, i - window_size // 2):i + window_size // 2 + 1]
            chunk = ' '.join(window).strip()
            if chunk and chunk not in chunks:
                chunks.append(chunk)
        return chunks


# Demonstrate chunking strategies
sample_text = DOCUMENTS['transformers']
chunker = DocumentChunker()

strategies = {
    'Fixed (200 chars, 50 overlap)': chunker.fixed_size(sample_text, 200, 50),
    'Sentence-based (3 per chunk)': chunker.sentence_based(sample_text, 3),
    'Paragraph-based': chunker.paragraph_based(sample_text),
}

for name, chunks in strategies.items():
    print(f"\n{name}: {len(chunks)} chunks")
    for i, chunk in enumerate(chunks[:3]):
        preview = chunk[:80].replace('\n', ' ') + ('...' if len(chunk) > 80 else '')
        print(f"  [{i}] ({len(chunk)} chars) {preview}")

### Visualization: Chunking Strategy Comparison

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, (name, chunks) in zip(axes, strategies.items()):
    sizes = [len(c) for c in chunks]
    bars = ax.bar(range(len(sizes)), sizes, color='#3498db', edgecolor='black', alpha=0.8)
    ax.set_xlabel('Chunk index', fontsize=11)
    ax.set_ylabel('Chunk size (chars)', fontsize=11)
    ax.set_title(f'{name}\n({len(chunks)} chunks)', fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    
    # Annotate mean
    mean_size = np.mean(sizes)
    ax.axhline(y=mean_size, color='red', linestyle='--', alpha=0.7)
    ax.text(len(sizes) - 1, mean_size + 10, f'μ={mean_size:.0f}', color='red', fontsize=9)

plt.suptitle('Chunk Size Distribution by Strategy', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Fixed-size: Uniform chunks but may split mid-sentence")
print("Sentence-based: Respects sentence boundaries, variable size")
print("Paragraph-based: Respects semantic boundaries, fewest chunks")

---

## 3. Embeddings and Vector Search

Once we have chunks, we convert them to vectors using an embedding model, then store them for similarity search. We covered embeddings in Notebook 15 — now we'll use them in practice.

**F1 analogy:** The vector store is the team's knowledge base of past races, indexed by situation similarity. Instead of filing race reports by date or circuit name, you encode each report section as a vector that captures its *meaning*. When the engineer asks "What happened last time we faced tire degradation on a hot street circuit?", the system finds the most semantically similar past situations — even if they used completely different words in the original report. This is the difference between keyword search ("tire degradation AND street circuit") and semantic search ("situations like this one").

### Building a Simple Embedding Model

In production, you'd use a pre-trained model (e.g., OpenAI `text-embedding-3-small` or an open-source sentence transformer). Here we'll build a simplified TF-IDF + learned projection to demonstrate the mechanics.

In [None]:
class SimpleEmbedder:
    """TF-IDF based embedding model for demonstration.
    
    In production, use sentence-transformers or API-based embeddings.
    """
    
    def __init__(self, embedding_dim=64):
        self.embedding_dim = embedding_dim
        self.vocab = {}
        self.idf = {}
        self.projection = None  # Random projection for dimensionality reduction
    
    def _tokenize(self, text):
        """Simple whitespace + lowercase tokenization."""
        return re.findall(r'\b[a-zA-Z]{2,}\b', text.lower())
    
    def fit(self, documents):
        """Build vocabulary and IDF scores from a corpus."""
        doc_freq = Counter()
        all_tokens = set()
        
        for doc in documents:
            tokens = set(self._tokenize(doc))
            for token in tokens:
                doc_freq[token] += 1
            all_tokens.update(tokens)
        
        # Build vocabulary (top tokens by document frequency)
        sorted_tokens = sorted(all_tokens, key=lambda t: doc_freq[t], reverse=True)
        self.vocab = {token: idx for idx, token in enumerate(sorted_tokens)}
        
        # Compute IDF
        n_docs = len(documents)
        self.idf = {token: np.log(n_docs / (1 + doc_freq[token]))
                     for token in self.vocab}
        
        # Random projection matrix for dimensionality reduction
        vocab_size = len(self.vocab)
        np.random.seed(42)
        self.projection = np.random.randn(vocab_size, self.embedding_dim) / np.sqrt(self.embedding_dim)
        
        print(f"Embedder fitted: {len(self.vocab)} tokens, {self.embedding_dim}d embeddings")
    
    def embed(self, text):
        """Convert text to a dense embedding vector."""
        tokens = self._tokenize(text)
        
        # TF-IDF vector
        tf = Counter(tokens)
        tfidf = np.zeros(len(self.vocab))
        for token, count in tf.items():
            if token in self.vocab:
                idx = self.vocab[token]
                tfidf[idx] = (count / len(tokens)) * self.idf.get(token, 0)
        
        # Project to lower dimension
        embedding = tfidf @ self.projection
        
        # L2 normalize
        norm = np.linalg.norm(embedding)
        if norm > 0:
            embedding = embedding / norm
        
        return embedding
    
    def embed_batch(self, texts):
        """Embed multiple texts."""
        return np.array([self.embed(text) for text in texts])


# Build embedder from our corpus
all_text = list(DOCUMENTS.values())
embedder = SimpleEmbedder(embedding_dim=64)
embedder.fit(all_text)

# Test it
test_embedding = embedder.embed("How does attention work in transformers?")
print(f"\nQuery embedding shape: {test_embedding.shape}")
print(f"L2 norm: {np.linalg.norm(test_embedding):.4f} (should be ~1.0)")

---

## 4. Vector Store

A vector store indexes embeddings for fast similarity search. The core operation is **nearest neighbor search**: given a query vector, find the $k$ most similar stored vectors.

**F1 analogy:** The vector store is like the team's race strategy database — thousands of past race situations encoded as vectors. When the race engineer types "We're on mediums, lap 25 of 55, gap to car ahead is closing," the vector store finds the three or four most similar historical situations. The cosine similarity score tells you *how* similar each past situation was. In production F1 teams, this kind of indexed retrieval needs to return results in milliseconds — you can't wait 10 seconds for an answer when racing at 300 km/h.

In production, you'd use FAISS, Pinecone, Weaviate, or Chroma. Here we build one from scratch to understand the mechanics.

In [None]:
class VectorStore:
    """Simple vector store with cosine similarity search."""
    
    def __init__(self, embedder):
        self.embedder = embedder
        self.embeddings = []  # List of vectors
        self.documents = []   # Original text chunks
        self.metadata = []    # Source document info
    
    def add_documents(self, chunks, source_name=""):
        """Index a list of text chunks."""
        for chunk in chunks:
            embedding = self.embedder.embed(chunk)
            self.embeddings.append(embedding)
            self.documents.append(chunk)
            self.metadata.append({'source': source_name, 'length': len(chunk)})
    
    def search(self, query, top_k=3):
        """Find the top-k most similar chunks to a query."""
        query_embedding = self.embedder.embed(query)
        
        # Cosine similarity (embeddings are already normalized)
        similarities = np.array([
            np.dot(query_embedding, emb) for emb in self.embeddings
        ])
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append({
                'chunk': self.documents[idx],
                'score': similarities[idx],
                'metadata': self.metadata[idx],
                'index': idx
            })
        
        return results
    
    def __len__(self):
        return len(self.documents)


# Build the vector store
store = VectorStore(embedder)

for doc_name, doc_text in DOCUMENTS.items():
    chunks = DocumentChunker.paragraph_based(doc_text)
    store.add_documents(chunks, source_name=doc_name)

print(f"Vector store: {len(store)} chunks indexed")
print(f"Sources: {set(m['source'] for m in store.metadata)}")

# Test search
query = "How does attention work?"
results = store.search(query, top_k=3)

print(f"\nQuery: '{query}'")
print("\nTop-3 results:")
for i, r in enumerate(results):
    preview = r['chunk'][:100].replace('\n', ' ') + '...'
    print(f"  [{i+1}] Score: {r['score']:.4f} | Source: {r['metadata']['source']}")
    print(f"       {preview}")

### Visualization: Embedding Space

In [None]:
# Visualize the embedding space with PCA
from numpy.linalg import svd

embeddings_matrix = np.array(store.embeddings)
# Simple PCA via SVD
mean = embeddings_matrix.mean(axis=0)
centered = embeddings_matrix - mean
U, S, Vt = svd(centered, full_matrices=False)
pca_2d = centered @ Vt[:2].T

# Color by source document
source_colors = {'neural_networks': '#3498db', 'transformers': '#e74c3c',
                 'rlhf': '#2ecc71', 'embeddings': '#f39c12'}

fig, ax = plt.subplots(1, 1, figsize=(10, 8))

for i, (point, meta) in enumerate(zip(pca_2d, store.metadata)):
    color = source_colors[meta['source']]
    ax.scatter(point[0], point[1], c=color, s=100, edgecolor='black', zorder=5)
    ax.annotate(f"{meta['source'][:5]}_{i}", (point[0], point[1]),
               textcoords="offset points", xytext=(5, 5), fontsize=7)

# Plot query point
query_emb = embedder.embed(query)
query_pca = (query_emb - mean) @ Vt[:2].T
ax.scatter(query_pca[0], query_pca[1], c='black', s=200, marker='*', zorder=10, label='Query')

# Draw lines to top results
for r in results[:3]:
    idx = r['index']
    ax.plot([query_pca[0], pca_2d[idx, 0]], [query_pca[1], pca_2d[idx, 1]],
            'k--', alpha=0.3, linewidth=1.5)

# Legend
for name, color in source_colors.items():
    ax.scatter([], [], c=color, s=80, label=name, edgecolor='black')
ax.legend(fontsize=10, loc='upper left')

ax.set_xlabel('PCA Component 1', fontsize=12)
ax.set_ylabel('PCA Component 2', fontsize=12)
ax.set_title('Chunk Embeddings in 2D (PCA)\nDashed lines = retrieved chunks',
             fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 5. The Complete RAG Pipeline

Now let's put it all together into a complete RAG system.

**F1 analogy:** The full RAG pipeline mirrors what happens on the pit wall during a race. The engineer's question (query) gets embedded and matched against the team's historical database (retrieval). The top-k most relevant past race situations are pulled together with the current live data (augmented context). Then the strategy model (LLM) synthesizes all of this into a recommendation: "Based on similar situations at Silverstone 2022 and Barcelona 2023, we recommend pitting on lap 32 for hards." The key insight: the model doesn't need to *remember* every past race — it just needs to *read* the relevant ones and reason well.

In [None]:
class RAGPipeline:
    """Complete Retrieval-Augmented Generation pipeline."""
    
    def __init__(self, vector_store, top_k=3):
        self.store = vector_store
        self.top_k = top_k
    
    def build_prompt(self, query, retrieved_chunks):
        """Build the prompt with retrieved context."""
        context = "\n\n".join([
            f"[Source: {r['metadata']['source']}] {r['chunk']}"
            for r in retrieved_chunks
        ])
        
        prompt = f"""Answer the question based on the provided context. If the context doesn't 
contain enough information, say so. Cite the source documents.

Context:
{context}

Question: {query}

Answer:"""
        return prompt
    
    def query(self, question):
        """Full RAG pipeline: retrieve + build prompt."""
        # Step 1: Retrieve relevant chunks
        results = self.store.search(question, top_k=self.top_k)
        
        # Step 2: Build augmented prompt
        prompt = self.build_prompt(question, results)
        
        # Step 3: In production, send to LLM. Here we return the prompt.
        return {
            'prompt': prompt,
            'retrieved': results,
            'n_chunks': len(results),
            'total_context_chars': sum(len(r['chunk']) for r in results),
        }


# Build and test the RAG pipeline
rag = RAGPipeline(store, top_k=3)

test_queries = [
    "How does backpropagation work?",
    "What is the attention mechanism?",
    "How is RLHF used to train language models?",
    "What are word embeddings?",
]

for query in test_queries:
    result = rag.query(query)
    print(f"\nQuery: '{query}'")
    print(f"  Retrieved {result['n_chunks']} chunks ({result['total_context_chars']} chars)")
    for r in result['retrieved']:
        print(f"    [{r['score']:.3f}] {r['metadata']['source']}")

In [None]:
# Show what the LLM would see
result = rag.query("What is the attention mechanism?")
print("=" * 60)
print("PROMPT SENT TO LLM:")
print("=" * 60)
print(result['prompt'][:1000])
if len(result['prompt']) > 1000:
    print(f"\n... ({len(result['prompt'])} total characters)")

---

## 6. Reranking: Improving Retrieval Quality

Embedding-based retrieval is fast but approximate. A **reranker** takes the top-k results and re-scores them using a more powerful (but slower) model. This two-stage approach is the standard in production RAG:

1. **Stage 1 (Retriever)**: Fast approximate search over millions of chunks → top-k candidates
2. **Stage 2 (Reranker)**: Accurate scoring of k candidates → final ranked list

The reranker sees the query and chunk together (cross-attention), unlike the embedder which encodes them independently.

**F1 analogy:** This is exactly how F1 teams filter data during a race. Stage 1 is the quick database scan: "Pull all past situations involving tire degradation on a hot track." That might return 50 results. Stage 2 is the senior strategist reviewing those 50 and saying, "Actually, only these 5 are truly comparable — the others were wet races or used different tire compounds." The reranker applies deeper analysis to a smaller candidate set, just like the experienced engineer applying judgment to narrow down the relevant precedents.

In [None]:
class CrossEncoderReranker:
    """Simplified cross-encoder reranker.
    
    In production, use a fine-tuned cross-encoder like ms-marco-MiniLM.
    Here we simulate with a learned relevance scorer.
    """
    
    def __init__(self, embedder):
        self.embedder = embedder
    
    def score(self, query, chunk):
        """Score query-chunk relevance.
        
        Simulates a cross-encoder by computing multiple similarity features
        and combining them.
        """
        q_emb = self.embedder.embed(query)
        c_emb = self.embedder.embed(chunk)
        
        # Feature 1: Cosine similarity (same as retriever)
        cosine_sim = np.dot(q_emb, c_emb)
        
        # Feature 2: Term overlap (keyword matching)
        q_tokens = set(re.findall(r'\b[a-zA-Z]{2,}\b', query.lower()))
        c_tokens = set(re.findall(r'\b[a-zA-Z]{2,}\b', chunk.lower()))
        if len(q_tokens) > 0:
            term_overlap = len(q_tokens & c_tokens) / len(q_tokens)
        else:
            term_overlap = 0
        
        # Feature 3: Chunk length penalty (prefer focused chunks)
        length_penalty = 1.0 / (1.0 + len(chunk) / 1000)
        
        # Combined score (weighted)
        score = 0.5 * cosine_sim + 0.35 * term_overlap + 0.15 * length_penalty
        return score
    
    def rerank(self, query, results):
        """Rerank a list of retrieval results."""
        scored = []
        for r in results:
            new_score = self.score(query, r['chunk'])
            scored.append({**r, 'original_score': r['score'], 'rerank_score': new_score})
        
        scored.sort(key=lambda x: x['rerank_score'], reverse=True)
        return scored


# Demonstrate reranking
reranker = CrossEncoderReranker(embedder)

query = "How does the reward model work in RLHF?"
initial_results = store.search(query, top_k=6)
reranked_results = reranker.rerank(query, initial_results)

print(f"Query: '{query}'\n")
print(f"{'Rank':<5} {'Original':>10} {'Reranked':>10} {'Source':<20} {'Preview'}")
print("-" * 80)
for i, r in enumerate(reranked_results):
    preview = r['chunk'][:50].replace('\n', ' ') + '...'
    print(f"{i+1:<5} {r['original_score']:10.4f} {r['rerank_score']:10.4f} "
          f"{r['metadata']['source']:<20} {preview}")

---

## 7. Evaluating RAG Systems

RAG evaluation has two components:

### Retrieval Metrics
- **Precision@k**: Of the k retrieved chunks, how many are relevant?
- **Recall@k**: Of all relevant chunks, how many were retrieved?
- **MRR** (Mean Reciprocal Rank): Where does the first relevant result appear?

### Generation Metrics
- **Faithfulness**: Does the answer stick to the retrieved context? (No hallucination)
- **Relevance**: Does the answer address the question?
- **Groundedness**: Can every claim be traced to a source chunk?

**F1 analogy:** Evaluating a RAG system is like evaluating the pit wall's information delivery. **Precision@k**: Of the 3 past race reports the system pulled up, how many were actually relevant to the current situation? **Recall@k**: Were there other critical past situations the system missed? **MRR**: Was the most relevant historical race the first one shown, or buried at position 5? On the generation side, **faithfulness** asks: did the strategy recommendation actually follow from the retrieved data, or did the model hallucinate a recommendation not supported by the evidence?

In [None]:
class RAGEvaluator:
    """Evaluate RAG retrieval quality."""
    
    @staticmethod
    def precision_at_k(retrieved_sources, relevant_sources, k=None):
        """Fraction of retrieved items that are relevant."""
        if k:
            retrieved_sources = retrieved_sources[:k]
        if not retrieved_sources:
            return 0.0
        relevant_count = sum(1 for s in retrieved_sources if s in relevant_sources)
        return relevant_count / len(retrieved_sources)
    
    @staticmethod
    def recall_at_k(retrieved_sources, relevant_sources, k=None):
        """Fraction of relevant items that were retrieved."""
        if k:
            retrieved_sources = retrieved_sources[:k]
        if not relevant_sources:
            return 0.0
        relevant_count = sum(1 for s in retrieved_sources if s in relevant_sources)
        return relevant_count / len(relevant_sources)
    
    @staticmethod
    def mrr(retrieved_sources, relevant_sources):
        """Mean Reciprocal Rank: 1/rank of first relevant result."""
        for i, source in enumerate(retrieved_sources):
            if source in relevant_sources:
                return 1.0 / (i + 1)
        return 0.0
    
    @staticmethod
    def context_relevance(query, chunks, embedder):
        """Average cosine similarity between query and retrieved chunks."""
        q_emb = embedder.embed(query)
        sims = [np.dot(q_emb, embedder.embed(chunk)) for chunk in chunks]
        return np.mean(sims)


# Create evaluation dataset (query, expected relevant sources)
eval_dataset = [
    ("How does backpropagation work?", {"neural_networks"}),
    ("What is self-attention?", {"transformers"}),
    ("How is RLHF used?", {"rlhf"}),
    ("What are word embeddings?", {"embeddings"}),
    ("How does PPO optimize language models?", {"rlhf"}),
    ("What is the transformer architecture?", {"transformers"}),
    ("How do neural networks learn?", {"neural_networks"}),
    ("What is cosine similarity?", {"embeddings"}),
]

evaluator = RAGEvaluator()

# Evaluate retrieval
all_p1, all_p3, all_mrr = [], [], []

for query, relevant in eval_dataset:
    results = store.search(query, top_k=5)
    retrieved = [r['metadata']['source'] for r in results]
    
    p1 = evaluator.precision_at_k(retrieved, relevant, k=1)
    p3 = evaluator.precision_at_k(retrieved, relevant, k=3)
    mrr = evaluator.mrr(retrieved, relevant)
    
    all_p1.append(p1)
    all_p3.append(p3)
    all_mrr.append(mrr)

print("RAG Retrieval Evaluation:")
print(f"  Precision@1: {np.mean(all_p1):.3f}")
print(f"  Precision@3: {np.mean(all_p3):.3f}")
print(f"  MRR:         {np.mean(all_mrr):.3f}")

In [None]:
# Visualize per-query retrieval performance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Per-query metrics
ax = axes[0]
x = np.arange(len(eval_dataset))
width = 0.25
ax.bar(x - width, all_p1, width, label='P@1', color='#3498db', alpha=0.8)
ax.bar(x, all_p3, width, label='P@3', color='#2ecc71', alpha=0.8)
ax.bar(x + width, all_mrr, width, label='MRR', color='#e74c3c', alpha=0.8)
ax.set_xticks(x)
ax.set_xticklabels([q[0][:20] + '...' for q in eval_dataset], rotation=45, ha='right', fontsize=8)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Per-Query Retrieval Metrics', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')

# Summary metrics
ax = axes[1]
metrics = {'P@1': np.mean(all_p1), 'P@3': np.mean(all_p3), 'MRR': np.mean(all_mrr)}
colors = ['#3498db', '#2ecc71', '#e74c3c']
bars = ax.bar(metrics.keys(), metrics.values(), color=colors, edgecolor='black', alpha=0.8)
for bar, val in zip(bars, metrics.values()):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{val:.3f}', ha='center', fontsize=12, fontweight='bold')
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Average Retrieval Metrics', fontsize=13, fontweight='bold')
ax.set_ylim(0, 1.1)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

---

## 8. Common RAG Failure Modes

Understanding how RAG fails is as important as understanding how it works.

| Failure Mode | Cause | Fix | F1 Parallel |
|-------------|-------|-----|-------------|
| **Missing context** | Relevant chunk not retrieved | Better chunking, more chunks, hybrid search | The database didn't have data from a comparable race — need to expand the archive |
| **Wrong context** | Irrelevant chunks retrieved | Reranking, better embeddings, metadata filters | Pulling up a Monaco comparison when the current race is Monza — wrong track type entirely |
| **Lost in the middle** | LLM ignores middle context | Put important context first/last | The critical data point was buried in the third report — the strategist only read the first and last |
| **Hallucination** | LLM generates beyond context | Stronger prompting, faithfulness checks | The model recommends a three-stop strategy based on data that only supports two stops |
| **Chunk boundary** | Answer spans two chunks | Overlapping chunks, larger windows | Tire degradation data is in one chunk, weather data in another — the full picture requires both |
| **Embedding mismatch** | Query and doc use different vocabulary | Query expansion, hypothetical document embeddings | Engineer asks about "graining" but the report used "front tire surface degradation" |

In [None]:
# Demonstrate the "chunk boundary" problem
print("FAILURE MODE: Chunk Boundary Problem")
print("=" * 50)

# Information split across chunks
text = """The transformer uses multi-head attention with 8 heads. Each head has dimension 64.

The total dimension is therefore 8 * 64 = 512, which is the model's hidden size."""

chunks = DocumentChunker.paragraph_based(text)
print(f"\nOriginal text answers: 'What is the total dimension?'")
print(f"But paragraph chunking splits it:")
for i, chunk in enumerate(chunks):
    print(f"  Chunk {i}: '{chunk.strip()}'")
print(f"\nChunk 0 has the components, Chunk 1 has the answer.")
print(f"Retrieving only one chunk misses the complete picture.")

# Fix: overlapping chunks
print(f"\nFIX: Sentence-based chunking with overlap")
chunks_fixed = DocumentChunker.sentence_based(text, max_sentences=4)
for i, chunk in enumerate(chunks_fixed):
    print(f"  Chunk {i}: '{chunk.strip()[:100]}'")

In [None]:
# Demonstrate hybrid search (keyword + semantic)
class HybridSearch:
    """Combine keyword (BM25-like) and semantic search."""
    
    def __init__(self, vector_store, alpha=0.5):
        self.store = vector_store
        self.alpha = alpha  # Weight for semantic vs keyword
    
    def keyword_score(self, query, chunk):
        """Simple keyword matching score (BM25 approximation)."""
        q_tokens = set(re.findall(r'\b[a-zA-Z]{3,}\b', query.lower()))
        c_tokens = re.findall(r'\b[a-zA-Z]{3,}\b', chunk.lower())
        c_set = set(c_tokens)
        
        if not q_tokens:
            return 0.0
        
        # Term frequency component
        tf_score = 0
        for token in q_tokens:
            if token in c_set:
                count = c_tokens.count(token)
                tf_score += count / (count + 1.0)  # Saturating TF
        
        return tf_score / len(q_tokens)
    
    def search(self, query, top_k=3):
        """Hybrid search combining semantic and keyword scores."""
        # Get all semantic scores
        q_emb = self.store.embedder.embed(query)
        
        scored = []
        for i, (emb, doc, meta) in enumerate(zip(
            self.store.embeddings, self.store.documents, self.store.metadata
        )):
            semantic = np.dot(q_emb, emb)
            keyword = self.keyword_score(query, doc)
            combined = self.alpha * semantic + (1 - self.alpha) * keyword
            scored.append({
                'chunk': doc, 'score': combined,
                'semantic_score': semantic, 'keyword_score': keyword,
                'metadata': meta, 'index': i
            })
        
        scored.sort(key=lambda x: x['score'], reverse=True)
        return scored[:top_k]


# Compare semantic vs hybrid
hybrid = HybridSearch(store, alpha=0.6)

query = "Bradley-Terry preference model"
semantic_results = store.search(query, top_k=3)
hybrid_results = hybrid.search(query, top_k=3)

print(f"Query: '{query}'\n")
print("Semantic search:")
for r in semantic_results:
    print(f"  [{r['score']:.3f}] {r['metadata']['source']}")

print("\nHybrid search:")
for r in hybrid_results:
    print(f"  [{r['score']:.3f}] sem={r['semantic_score']:.3f} kw={r['keyword_score']:.3f} "
          f"{r['metadata']['source']}")

---

## 9. Advanced RAG Patterns

Production RAG systems go beyond basic retrieve-and-generate:

| Pattern | Description | When to Use | F1 Parallel |
|---------|------------|-------------|-------------|
| **Query expansion** | Rewrite query for better retrieval | Ambiguous or short queries | Engineer says "tires" — expand to "tire degradation compound temperature graining blistering" |
| **HyDE** | Generate hypothetical answer, embed that | Query-document vocabulary mismatch | Generate what a good race report *would* say, then find actual reports like it |
| **Multi-step RAG** | Retrieve → reason → retrieve again | Complex multi-hop questions | "Find races with similar weather, then from those find ones with similar tire strategy outcomes" |
| **Parent-child chunking** | Retrieve small chunks, return parent | Need precision + context | Find the exact sentence about tire life, but return the full strategy section for context |
| **Metadata filtering** | Filter by date, source, category | Large corpora with structure | Only search races from this season, or only search data from this specific circuit |
| **Agentic RAG** | Agent decides when/what to retrieve | Open-ended tasks | The strategy model decides whether it needs historical data or can answer from current telemetry alone |

In [None]:
# Implement HyDE (Hypothetical Document Embeddings)
class HyDE:
    """Hypothetical Document Embeddings.
    
    Instead of embedding the query directly, generate a hypothetical answer
    and embed THAT. The hypothetical answer will use vocabulary closer to
    the actual documents.
    """
    
    def __init__(self, vector_store):
        self.store = vector_store
    
    def generate_hypothetical(self, query):
        """Simulate generating a hypothetical answer.
        In production, this would use an LLM."""
        # Simple simulation: expand query with related terms
        expansions = {
            'attention': 'self-attention computes query key value matrices dot product softmax weights',
            'backprop': 'backpropagation chain rule gradients loss function weights update',
            'rlhf': 'reinforcement learning human feedback reward model PPO policy optimization',
            'embedding': 'word embeddings dense vectors semantic similarity cosine distance',
        }
        
        hypothetical = query
        for key, expansion in expansions.items():
            if key in query.lower():
                hypothetical = f"{query} {expansion}"
                break
        
        return hypothetical
    
    def search(self, query, top_k=3):
        """Search using hypothetical document embedding."""
        hypothetical = self.generate_hypothetical(query)
        
        # Embed the hypothetical answer instead of the raw query
        query_emb = self.store.embedder.embed(hypothetical)
        
        similarities = np.array([
            np.dot(query_emb, emb) for emb in self.store.embeddings
        ])
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        results = []
        for idx in top_indices:
            results.append({
                'chunk': self.store.documents[idx],
                'score': similarities[idx],
                'metadata': self.store.metadata[idx],
            })
        return results


hyde = HyDE(store)
query = "How does attention work?"

normal_results = store.search(query, top_k=3)
hyde_results = hyde.search(query, top_k=3)

print(f"Query: '{query}'\n")
print("Standard retrieval:")
for r in normal_results:
    print(f"  [{r['score']:.3f}] {r['metadata']['source']}")
print("\nHyDE retrieval:")
for r in hyde_results:
    print(f"  [{r['score']:.3f}] {r['metadata']['source']}")

---

## Exercises

### Exercise 1: Chunking Strategy Comparison — Race Report Optimization

Imagine you're building a RAG system for an F1 team's race report archive. Run the full RAG evaluation with each chunking strategy (fixed, sentence, paragraph). Which strategy gives the best retrieval precision? Think about how this applies to chunking race reports: would you split by fixed character count (potentially cutting mid-sentence about tire compounds), by sentence (natural boundaries), or by paragraph (one topic per chunk, like "pit stop analysis" or "weather impact")? Does the best chunking strategy depend on the query type?

In [None]:
# Exercise 1: Your code here
# Hint: Build separate vector stores for each chunking strategy,
# run the same eval queries, and compare P@1, P@3, MRR


### Exercise 2: Multi-Hop RAG — Cross-Referencing Race Data

Implement a two-step RAG system where the first retrieval provides context that helps reformulate the query for a second retrieval. This mirrors how an F1 strategist might first look up "races with similar tire degradation patterns" and then use those results to refine their search to "what pit stop strategies worked in those specific races?" Test with the question: "What algorithm from NB20 uses the mechanism from NB12?"

In [None]:
# Exercise 2: Your code here
# Hint: Retrieve once, extract key terms from results,
# reformulate query, retrieve again


### Exercise 3: RAG with Metadata Filtering — Circuit-Specific Search

Extend the VectorStore to support metadata filtering (e.g., only search within `source='transformers'`). In F1 terms, this is like filtering your race database to only search within a specific circuit or season — "only show me data from Silverstone" or "only recent regulation era." Show how filtering improves precision when the user specifies a topic.

In [None]:
# Exercise 3: Your code here
# Hint: Add a filter parameter to VectorStore.search()
# that skips chunks not matching the filter criteria


---

## Summary

### Key Concepts

| Concept | Definition | F1 Parallel |
|---------|-----------|-------------|
| **RAG** | Augments LLMs with external knowledge at inference time — no retraining needed | Race engineer querying the team's historical database mid-race instead of relying on memory |
| **Chunking strategy** | Splitting documents into searchable pieces — too small loses context, too large dilutes relevance | Breaking race reports into sections: tire strategy, weather, pit stops — right granularity matters |
| **Vector stores** | Index embeddings for fast similarity search using cosine similarity | The team's knowledge base indexed by situation similarity, not just date or circuit name |
| **Reranking** | Rescoring candidates with a cross-encoder for improved precision | Senior strategist reviewing the initial search results and filtering to the truly relevant ones |
| **Hybrid search** | Combining semantic + keyword search outperforms either alone | Matching both by meaning ("similar conditions") and keywords ("medium compound, lap 25") |
| **Retrieval metrics** | P@k, Recall@k, MRR quantify how well the right chunks are found | Measuring if the pit wall is pulling up the right historical data when it matters |
| **Advanced patterns** | HyDE, multi-hop RAG, and agentic RAG handle complex queries | Multi-step data lookups: find similar weather, then find strategies that worked in that weather |

### Fundamental Insight

RAG transforms LLMs from closed-book test-takers into open-book researchers. By separating knowledge storage (vector store) from reasoning (LLM), RAG systems stay current, reduce hallucination, and work with private data — making them the most practical pattern for production AI systems. In F1 terms, you're giving the strategy model access to the full team archive instead of asking it to memorize everything from training. The model's job is to *reason* over the retrieved data, not to *store* all the data in its weights.

---

## Next Steps

RAG retrieves knowledge, but sometimes an AI system needs to **take actions** — search the web, call APIs, write code, or interact with tools. In **Notebook 22: AI Agents & Tool Use**, we'll build agents that reason about *what to do* and *when to do it*, using the ReAct pattern and multi-step planning. In F1 terms, we're moving from the strategist who *looks up data* to the race engineer who *makes decisions and executes them* — querying the weather API, running the tire model, adjusting the fuel calculator, and radioing the driver with the call.