# Module 11: Advanced RAG - Evaluation, Optimization & Production

This notebook covers advanced techniques to improve RAG systems beyond the basic pipeline:

1. **RAG Evaluation** - Measuring retrieval and generation quality
2. **Query Transformation** - HyDE and query decomposition
3. **Re-ranking** - Cross-encoder re-ranking for better precision
4. **Hybrid Search** - Combining BM25 (keyword) with vector search
5. **Failure Analysis** - Debugging RAG systems

**Prerequisites:** Modules 9 (RAG Foundations) and 10 (RAG Pipeline)

## 1. Setup

In [None]:
!pip install -q chromadb openai python-dotenv sentence-transformers rank-bm25 matplotlib numpy

In [None]:
import os
import json
import math
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

from dotenv import load_dotenv
load_dotenv("/home/amir/source/.env")

from openai import OpenAI
import chromadb
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi

client = OpenAI()
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("Setup complete!")

## 2. Build the Knowledge Base

Let's create a knowledge base of AI/ML documents to work with throughout this notebook.

In [None]:
# Knowledge base: paragraphs about AI/ML topics with metadata
documents = [
    {"id": "doc_0", "text": "Neural networks are computing systems inspired by biological neural networks in the brain. They consist of layers of interconnected nodes (neurons) that process information. Each connection has a weight that adjusts during training. Deep learning refers to neural networks with many hidden layers, enabling them to learn hierarchical representations of data.", "category": "deep_learning", "topic": "neural_networks"},
    {"id": "doc_1", "text": "Convolutional Neural Networks (CNNs) are specialized for processing grid-like data such as images. They use convolutional layers that apply filters to detect features like edges, textures, and patterns. Pooling layers reduce spatial dimensions while preserving important features. CNNs have revolutionized computer vision, achieving superhuman performance on image classification tasks like ImageNet.", "category": "deep_learning", "topic": "cnn"},
    {"id": "doc_2", "text": "Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state that captures information from previous time steps. Long Short-Term Memory (LSTM) networks address the vanishing gradient problem with gating mechanisms. GRUs are a simplified variant. RNNs have been largely superseded by Transformers for most NLP tasks due to parallelization advantages.", "category": "deep_learning", "topic": "rnn"},
    {"id": "doc_3", "text": "The Transformer architecture, introduced in 'Attention Is All You Need' (2017), relies entirely on self-attention mechanisms. It processes all positions in parallel, unlike RNNs. The architecture consists of encoder and decoder stacks, each with multi-head attention and feed-forward layers. Transformers are the foundation of modern LLMs like GPT, BERT, and T5.", "category": "deep_learning", "topic": "transformers"},
    {"id": "doc_4", "text": "Transfer learning allows a model trained on one task to be fine-tuned for another. In NLP, pre-trained language models like BERT are fine-tuned on downstream tasks such as sentiment analysis, named entity recognition, and question answering. This approach dramatically reduces the amount of labeled data needed and improves performance on specialized tasks.", "category": "nlp", "topic": "transfer_learning"},
    {"id": "doc_5", "text": "BERT (Bidirectional Encoder Representations from Transformers) uses masked language modeling for pre-training, randomly masking 15% of tokens and predicting them. Unlike GPT, BERT processes text bidirectionally, making it excellent for understanding tasks. BERT-base has 110M parameters and BERT-large has 340M parameters.", "category": "nlp", "topic": "bert"},
    {"id": "doc_6", "text": "GPT (Generative Pre-trained Transformer) models use autoregressive (left-to-right) language modeling. GPT-3 has 175 billion parameters and demonstrated strong few-shot learning abilities. GPT-4 is multimodal, accepting both text and image inputs. The GPT family uses decoder-only transformer architecture with causal attention masking.", "category": "nlp", "topic": "gpt"},
    {"id": "doc_7", "text": "Retrieval-Augmented Generation (RAG) combines a retriever that finds relevant documents with a generator that produces answers. RAG addresses LLM limitations including knowledge cutoff, hallucination, and lack of domain-specific knowledge. The retriever typically uses dense embeddings and vector similarity search to find relevant passages.", "category": "rag", "topic": "rag_overview"},
    {"id": "doc_8", "text": "Vector databases store high-dimensional embedding vectors and support efficient similarity search. Popular options include ChromaDB (lightweight, local), Pinecone (cloud-native, managed), FAISS (Facebook's library for billion-scale search), and Weaviate (open-source, hybrid search). The choice depends on scale, deployment model, and feature requirements.", "category": "rag", "topic": "vector_databases"},
    {"id": "doc_9", "text": "Chunking strategies significantly impact RAG quality. Fixed-size chunking splits text at character boundaries with overlap. Sentence-based chunking preserves semantic units. Recursive character splitting tries multiple separators (paragraphs, sentences, words). Semantic chunking groups sentences by embedding similarity. The optimal chunk size balances context (larger) with precision (smaller), typically 200-500 tokens.", "category": "rag", "topic": "chunking"},
    {"id": "doc_10", "text": "Embedding models convert text into dense vector representations that capture semantic meaning. Sentence-BERT and its successors (E5, GTE, BGE) are trained with contrastive learning for semantic similarity. OpenAI's text-embedding-ada-002 and text-embedding-3-small provide high-quality embeddings via API. The choice of embedding model significantly affects retrieval quality.", "category": "rag", "topic": "embeddings"},
    {"id": "doc_11", "text": "Reinforcement Learning from Human Feedback (RLHF) aligns LLMs with human preferences. The process involves supervised fine-tuning on demonstration data, training a reward model on human preference comparisons, and optimizing the LLM policy using PPO. RLHF was crucial in making ChatGPT helpful and safe. DPO (Direct Preference Optimization) offers a simpler alternative.", "category": "training", "topic": "rlhf"},
    {"id": "doc_12", "text": "Prompt engineering is the practice of designing effective inputs for LLMs. Key techniques include zero-shot prompting, few-shot prompting with examples, chain-of-thought reasoning for complex problems, and role prompting to set behavior. Structured output formats like JSON can be enforced through careful prompt design and system messages.", "category": "nlp", "topic": "prompt_engineering"},
    {"id": "doc_13", "text": "AI agents are autonomous systems that use LLMs to reason, plan, and take actions through tools. The ReAct pattern interleaves reasoning (Thought) with actions (Action/Observation). Multi-agent systems coordinate multiple specialized agents for complex tasks. LangGraph provides a framework for building agent workflows as state machines with conditional routing.", "category": "agents", "topic": "ai_agents"},
    {"id": "doc_14", "text": "Fine-tuning adapts a pre-trained model to specific tasks or domains. Full fine-tuning updates all parameters but requires significant compute. Parameter-efficient methods like LoRA (Low-Rank Adaptation) freeze the base model and train small adapter matrices, reducing compute by 10-100x. QLoRA combines quantization with LoRA for even greater efficiency.", "category": "training", "topic": "fine_tuning"},
    {"id": "doc_15", "text": "Attention mechanisms allow models to focus on relevant parts of the input. Self-attention computes query-key-value interactions within a sequence. Cross-attention allows one sequence to attend to another (e.g., decoder attending to encoder). Multi-head attention runs multiple attention computations in parallel, each learning different relationship types.", "category": "deep_learning", "topic": "attention"},
    {"id": "doc_16", "text": "Tokenization converts text into numerical tokens for model processing. Subword tokenization methods like BPE (Byte Pair Encoding) and WordPiece balance vocabulary size with coverage. SentencePiece provides language-agnostic tokenization. Typical vocabulary sizes range from 30K (BERT) to 100K+ (GPT-4). Tokenization affects model performance, multilingual capability, and inference cost.", "category": "nlp", "topic": "tokenization"},
    {"id": "doc_17", "text": "Evaluation metrics for RAG systems include retrieval metrics (precision@k, recall@k, MRR) and generation metrics (faithfulness, relevance, answer correctness). The RAGAS framework provides automated evaluation combining context relevance, answer faithfulness, and answer relevance. Human evaluation remains the gold standard for assessing RAG system quality.", "category": "rag", "topic": "evaluation"},
    {"id": "doc_18", "text": "Cosine similarity measures the angle between two vectors, ranging from -1 to 1. It is the most common distance metric for comparing embeddings because it is invariant to vector magnitude. Dot product is faster but affected by magnitude. Euclidean distance measures absolute distance. For normalized vectors, cosine similarity equals dot product.", "category": "rag", "topic": "distance_metrics"},
    {"id": "doc_19", "text": "Generative Adversarial Networks (GANs) consist of a generator and discriminator trained adversarially. The generator creates synthetic data while the discriminator distinguishes real from fake. GANs produce high-quality images but suffer from training instability and mode collapse. Diffusion models have largely replaced GANs for image generation due to better training stability and quality.", "category": "deep_learning", "topic": "gans"}
]

# Embed and store in ChromaDB
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="ai_knowledge_base", metadata={"hnsw:space": "cosine"})

texts = [doc["text"] for doc in documents]
ids = [doc["id"] for doc in documents]
metadatas = [{"category": doc["category"], "topic": doc["topic"]} for doc in documents]

embeddings = embedding_model.encode(texts).tolist()

collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=ids,
    metadatas=metadatas
)

print(f"Knowledge base loaded: {collection.count()} documents")
print(f"Categories: {set(doc['category'] for doc in documents)}")

## 3. RAG Evaluation Metrics

Before improving RAG, we need to **measure** how good it is. There are two types of metrics:

### Retrieval Metrics
- **Precision@k**: Of the k documents retrieved, how many are relevant?
- **Recall@k**: Of all relevant documents, how many did we retrieve?
- **MRR (Mean Reciprocal Rank)**: How high is the first relevant result ranked?

### Generation Metrics
- **Faithfulness**: Is the answer supported by the retrieved context?
- **Relevance**: Does the answer address the question?

In [None]:
# Create an evaluation dataset: questions with known relevant document IDs
eval_dataset = [
    {
        "question": "How do transformers work?",
        "relevant_ids": ["doc_3", "doc_15"]
    },
    {
        "question": "What is RAG and why is it needed?",
        "relevant_ids": ["doc_7", "doc_8"]
    },
    {
        "question": "How are GPT models trained?",
        "relevant_ids": ["doc_6", "doc_11"]
    },
    {
        "question": "What are the best chunking strategies for RAG?",
        "relevant_ids": ["doc_9"]
    },
    {
        "question": "What is the difference between BERT and GPT?",
        "relevant_ids": ["doc_5", "doc_6"]
    },
    {
        "question": "How do embedding models work for search?",
        "relevant_ids": ["doc_10", "doc_18"]
    },
    {
        "question": "What is fine-tuning and LoRA?",
        "relevant_ids": ["doc_14", "doc_4"]
    },
    {
        "question": "How do AI agents use tools?",
        "relevant_ids": ["doc_13"]
    }
]

print(f"Evaluation dataset: {len(eval_dataset)} questions")
for item in eval_dataset:
    print(f"  Q: {item['question']}")
    print(f"     Relevant: {item['relevant_ids']}")

In [None]:
# Implement retrieval evaluation metrics

def precision_at_k(retrieved_ids, relevant_ids, k):
    """What fraction of the top-k retrieved documents are relevant?"""
    top_k = retrieved_ids[:k]
    relevant_in_top_k = len(set(top_k) & set(relevant_ids))
    return relevant_in_top_k / k

def recall_at_k(retrieved_ids, relevant_ids, k):
    """What fraction of all relevant documents appear in the top-k?"""
    top_k = retrieved_ids[:k]
    relevant_in_top_k = len(set(top_k) & set(relevant_ids))
    return relevant_in_top_k / len(relevant_ids) if relevant_ids else 0

def reciprocal_rank(retrieved_ids, relevant_ids):
    """1/rank of the first relevant document."""
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant_ids:
            return 1.0 / (i + 1)
    return 0.0

# Demo with a sample query
sample_retrieved = ["doc_3", "doc_2", "doc_15", "doc_5", "doc_6"]
sample_relevant = ["doc_3", "doc_15"]

for k in [1, 3, 5]:
    p = precision_at_k(sample_retrieved, sample_relevant, k)
    r = recall_at_k(sample_retrieved, sample_relevant, k)
    print(f"k={k}: Precision@{k}={p:.3f}, Recall@{k}={r:.3f}")

rr = reciprocal_rank(sample_retrieved, sample_relevant)
print(f"Reciprocal Rank: {rr:.3f} (first relevant doc at position 1)")

In [None]:
def retrieve(query, collection, k=5):
    """Retrieve top-k documents from ChromaDB."""
    query_embedding = embedding_model.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k
    )
    return results["ids"][0], results["documents"][0], results["distances"][0]

def evaluate_retrieval(eval_data, collection, k=5):
    """Evaluate retrieval across all questions."""
    precisions = []
    recalls = []
    mrr_scores = []
    
    for item in eval_data:
        retrieved_ids, _, _ = retrieve(item["question"], collection, k)
        
        p = precision_at_k(retrieved_ids, item["relevant_ids"], k)
        r = recall_at_k(retrieved_ids, item["relevant_ids"], k)
        rr = reciprocal_rank(retrieved_ids, item["relevant_ids"])
        
        precisions.append(p)
        recalls.append(r)
        mrr_scores.append(rr)
    
    return {
        f"precision@{k}": np.mean(precisions),
        f"recall@{k}": np.mean(recalls),
        "mrr": np.mean(mrr_scores)
    }

# Baseline evaluation
baseline_metrics = evaluate_retrieval(eval_dataset, collection, k=3)
print("Baseline Retrieval Metrics (k=3):")
for metric, value in baseline_metrics.items():
    print(f"  {metric}: {value:.3f}")

## Exercise 1: Build a RAG Evaluation Pipeline

**Your Task:** Implement a comprehensive evaluation pipeline that measures both retrieval quality and a simple faithfulness check.

**Steps:**
1. Run retrieval for each question in the eval dataset
2. Compute precision@k and MRR
3. For faithfulness: check if key terms from the retrieved docs appear in the generated answer

**Hints:**
- Use the `evaluate_retrieval()` function as a starting point
- For faithfulness, a simple keyword overlap ratio works as a rough approximation

In [None]:
def evaluate_rag_pipeline(eval_data, collection, k=3):
    """
    Evaluate both retrieval and generation quality.
    
    Returns:
        dict with precision@k, recall@k, mrr, and avg_faithfulness
    """
    # TODO: Implement evaluation pipeline
    
    results = []
    
    for item in eval_data:
        # 1. Retrieve documents
        retrieved_ids, retrieved_docs, _ = None, None, None  # Your code
        
        # 2. Compute retrieval metrics
        p_at_k = None  # Your code
        rr = None  # Your code
        
        # 3. Generate answer using retrieved context
        context = None  # Your code: join retrieved_docs
        answer = None  # Your code: call LLM with context + question
        
        # 4. Simple faithfulness: fraction of context keywords found in answer
        faithfulness = None  # Your code
        
        results.append({"p_at_k": p_at_k, "rr": rr, "faithfulness": faithfulness})
    
    return {
        f"precision@{k}": np.mean([r["p_at_k"] for r in results]),
        "mrr": np.mean([r["rr"] for r in results]),
        "avg_faithfulness": np.mean([r["faithfulness"] for r in results])
    }

# Test
# metrics = evaluate_rag_pipeline(eval_dataset[:3], collection, k=3)
# print(metrics)

### Solution for Exercise 1

In [None]:
def generate_answer(question, context):
    """Generate an answer using OpenAI given context."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer the question based on the provided context. Be concise."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.0
    )
    return response.choices[0].message.content

def simple_faithfulness(context, answer):
    """Simple faithfulness: fraction of important context words found in answer."""
    context_words = set(context.lower().split())
    answer_words = set(answer.lower().split())
    # Filter to meaningful words (length > 4 to skip stopwords)
    context_keywords = {w for w in context_words if len(w) > 4}
    if not context_keywords:
        return 1.0
    overlap = context_keywords & answer_words
    return len(overlap) / len(context_keywords)

def evaluate_rag_pipeline(eval_data, collection, k=3):
    """Evaluate both retrieval and generation quality - SOLUTION."""
    results = []
    
    for item in eval_data:
        # 1. Retrieve documents
        retrieved_ids, retrieved_docs, _ = retrieve(item["question"], collection, k)
        
        # 2. Retrieval metrics
        p_at_k = precision_at_k(retrieved_ids, item["relevant_ids"], k)
        rr = reciprocal_rank(retrieved_ids, item["relevant_ids"])
        
        # 3. Generate answer
        context = "\n\n".join(retrieved_docs)
        answer = generate_answer(item["question"], context)
        
        # 4. Faithfulness
        faith = simple_faithfulness(context, answer)
        
        results.append({"p_at_k": p_at_k, "rr": rr, "faithfulness": faith,
                       "question": item["question"], "answer": answer[:100]})
        print(f"Q: {item['question']}")
        print(f"  P@{k}={p_at_k:.2f}, RR={rr:.2f}, Faith={faith:.2f}")
        print(f"  A: {answer[:100]}...\n")
    
    return {
        f"precision@{k}": np.mean([r["p_at_k"] for r in results]),
        "mrr": np.mean([r["rr"] for r in results]),
        "avg_faithfulness": np.mean([r["faithfulness"] for r in results])
    }

metrics = evaluate_rag_pipeline(eval_dataset[:4], collection, k=3)
print("\n" + "="*50)
print("Overall Metrics:")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.3f}")

---

## 4. Query Transformation: HyDE

**HyDE (Hypothetical Document Embeddings)** improves retrieval by:
1. Asking the LLM to generate a **hypothetical answer** to the question
2. Embedding that hypothetical answer (instead of the raw question)
3. Using the hypothetical answer's embedding for retrieval

**Intuition:** The hypothetical answer is stylistically closer to the actual documents in the knowledge base than a short question would be. This bridges the "query-document gap."

```
Standard:  "How do transformers work?" → embed question → search
HyDE:      "How do transformers work?" → LLM generates paragraph → embed paragraph → search
```

In [None]:
def hyde_retrieve(question, collection, k=5):
    """
    HyDE: Generate hypothetical document, embed it, use for retrieval.
    """
    # Step 1: Generate hypothetical answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Write a detailed paragraph answering the question. Write as if this is from a textbook. Do not say 'I don't know'."},
            {"role": "user", "content": question}
        ],
        temperature=0.0
    )
    hypothetical_doc = response.choices[0].message.content
    
    # Step 2: Embed the hypothetical document (not the question!)
    hyde_embedding = embedding_model.encode([hypothetical_doc]).tolist()
    
    # Step 3: Search with hypothetical document's embedding
    results = collection.query(
        query_embeddings=hyde_embedding,
        n_results=k
    )
    
    return results["ids"][0], results["documents"][0], results["distances"][0], hypothetical_doc

# Compare standard vs HyDE retrieval
test_question = "What approaches exist for making LLMs generate better outputs?"

print("=" * 60)
print("STANDARD RETRIEVAL")
print("=" * 60)
std_ids, std_docs, std_dists = retrieve(test_question, collection, k=3)
for i, (doc_id, doc, dist) in enumerate(zip(std_ids, std_docs, std_dists)):
    print(f"\n{i+1}. [{doc_id}] (dist={dist:.3f})")
    print(f"   {doc[:100]}...")

print("\n" + "=" * 60)
print("HyDE RETRIEVAL")
print("=" * 60)
hyde_ids, hyde_docs, hyde_dists, hypo = hyde_retrieve(test_question, collection, k=3)
print(f"\nHypothetical document:\n{hypo[:200]}...\n")
for i, (doc_id, doc, dist) in enumerate(zip(hyde_ids, hyde_docs, hyde_dists)):
    print(f"{i+1}. [{doc_id}] (dist={dist:.3f})")
    print(f"   {doc[:100]}...")

## Exercise 2: Implement HyDE and Compare Retrieval Quality

**Your Task:** Evaluate HyDE vs standard retrieval across the full evaluation dataset.

**Steps:**
1. Run both standard and HyDE retrieval for each question
2. Compute precision@3 and MRR for both
3. Compare the results

In [None]:
def compare_hyde_vs_standard(eval_data, collection, k=3):
    """
    Compare standard retrieval vs HyDE on the evaluation dataset.
    """
    # TODO: Implement comparison
    
    standard_results = []
    hyde_results = []
    
    for item in eval_data:
        # Standard retrieval
        std_ids = None  # Your code
        std_p = None    # Your code: precision@k
        std_rr = None   # Your code: reciprocal rank
        
        # HyDE retrieval
        hyde_ids = None  # Your code
        hyde_p = None    # Your code
        hyde_rr = None   # Your code
        
        standard_results.append({"p": std_p, "rr": std_rr})
        hyde_results.append({"p": hyde_p, "rr": hyde_rr})
    
    # Print comparison
    pass  # Your code

# compare_hyde_vs_standard(eval_dataset[:4], collection)

### Solution for Exercise 2

In [None]:
def compare_hyde_vs_standard(eval_data, collection, k=3):
    """Compare standard retrieval vs HyDE - SOLUTION."""
    standard_results = []
    hyde_results = []
    
    for item in eval_data:
        question = item["question"]
        relevant = item["relevant_ids"]
        
        # Standard retrieval
        std_ids, _, _ = retrieve(question, collection, k)
        std_p = precision_at_k(std_ids, relevant, k)
        std_rr = reciprocal_rank(std_ids, relevant)
        
        # HyDE retrieval
        hyde_ids, _, _, _ = hyde_retrieve(question, collection, k)
        hyde_p = precision_at_k(hyde_ids, relevant, k)
        hyde_rr = reciprocal_rank(hyde_ids, relevant)
        
        standard_results.append({"p": std_p, "rr": std_rr})
        hyde_results.append({"p": hyde_p, "rr": hyde_rr})
        
        better = "HyDE" if hyde_p > std_p else ("Standard" if std_p > hyde_p else "Tie")
        print(f"Q: {question}")
        print(f"  Standard P@{k}={std_p:.2f} | HyDE P@{k}={hyde_p:.2f} | Winner: {better}")
    
    # Summary
    print("\n" + "=" * 50)
    print(f"Average P@{k} - Standard: {np.mean([r['p'] for r in standard_results]):.3f}")
    print(f"Average P@{k} - HyDE:     {np.mean([r['p'] for r in hyde_results]):.3f}")
    print(f"Average MRR  - Standard: {np.mean([r['rr'] for r in standard_results]):.3f}")
    print(f"Average MRR  - HyDE:     {np.mean([r['rr'] for r in hyde_results]):.3f}")

compare_hyde_vs_standard(eval_dataset[:4], collection)

---

## 5. Re-ranking with Cross-Encoders

Retrieval uses **bi-encoders** (encode query and documents separately) which are fast but less accurate. **Cross-encoders** process the query and document together, giving much more accurate relevance scores but are slower.

**Pipeline:** Bi-encoder retrieves top-20 candidates → Cross-encoder re-ranks → Return top-5

```
Stage 1 (fast):  Bi-encoder retrieves ~20 candidates from millions
Stage 2 (accurate):  Cross-encoder re-scores 20 candidates
Final: Return top-5 after re-ranking
```

In [None]:
# Load a cross-encoder model for re-ranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query, collection, initial_k=10, final_k=3):
    """
    Two-stage retrieval: bi-encoder retrieval + cross-encoder re-ranking.
    """
    # Stage 1: Bi-encoder retrieval (fast, broad)
    retrieved_ids, retrieved_docs, distances = retrieve(query, collection, k=initial_k)
    
    # Stage 2: Cross-encoder re-ranking (slow, accurate)
    # Cross-encoder takes (query, document) pairs and scores them
    pairs = [[query, doc] for doc in retrieved_docs]
    cross_scores = cross_encoder.predict(pairs)
    
    # Sort by cross-encoder score (higher is better)
    ranked_indices = np.argsort(cross_scores)[::-1]
    
    reranked_ids = [retrieved_ids[i] for i in ranked_indices[:final_k]]
    reranked_docs = [retrieved_docs[i] for i in ranked_indices[:final_k]]
    reranked_scores = [cross_scores[i] for i in ranked_indices[:final_k]]
    
    return reranked_ids, reranked_docs, reranked_scores

# Compare standard vs re-ranked results
query = "How do transformers use attention?"

print("BEFORE RE-RANKING (bi-encoder only):")
ids, docs, dists = retrieve(query, collection, k=5)
for i, (doc_id, dist) in enumerate(zip(ids, dists)):
    topic = next(d["topic"] for d in documents if d["id"] == doc_id)
    print(f"  {i+1}. [{doc_id}] topic={topic}, dist={dist:.3f}")

print("\nAFTER RE-RANKING (bi-encoder + cross-encoder):")
ids_rr, docs_rr, scores_rr = retrieve_and_rerank(query, collection, initial_k=10, final_k=5)
for i, (doc_id, score) in enumerate(zip(ids_rr, scores_rr)):
    topic = next(d["topic"] for d in documents if d["id"] == doc_id)
    print(f"  {i+1}. [{doc_id}] topic={topic}, cross-score={score:.3f}")

## Exercise 3: Evaluate Re-ranking Impact

**Your Task:** Add cross-encoder re-ranking to the retrieval pipeline and measure the improvement.

**Steps:**
1. Run standard retrieval (bi-encoder only) on eval dataset
2. Run re-ranked retrieval on the same questions
3. Compare precision@3 and MRR for both

In [None]:
def evaluate_reranking(eval_data, collection, initial_k=10, final_k=3):
    """
    Compare retrieval with and without re-ranking.
    """
    # TODO: Implement evaluation
    
    for item in eval_data:
        # Standard retrieval
        std_ids = None  # Your code
        
        # Re-ranked retrieval
        reranked_ids = None  # Your code
        
        # Compare metrics
        pass  # Your code

# evaluate_reranking(eval_dataset[:4], collection)

### Solution for Exercise 3

In [None]:
def evaluate_reranking(eval_data, collection, initial_k=10, final_k=3):
    """Compare retrieval with and without re-ranking - SOLUTION."""
    std_metrics = {"precision": [], "mrr": []}
    rr_metrics = {"precision": [], "mrr": []}
    
    for item in eval_data:
        relevant = item["relevant_ids"]
        
        # Standard retrieval (top final_k only)
        std_ids, _, _ = retrieve(item["question"], collection, k=final_k)
        std_metrics["precision"].append(precision_at_k(std_ids, relevant, final_k))
        std_metrics["mrr"].append(reciprocal_rank(std_ids, relevant))
        
        # Re-ranked retrieval
        rr_ids, _, _ = retrieve_and_rerank(item["question"], collection, initial_k, final_k)
        rr_metrics["precision"].append(precision_at_k(rr_ids, relevant, final_k))
        rr_metrics["mrr"].append(reciprocal_rank(rr_ids, relevant))
        
        print(f"Q: {item['question']}")
        print(f"  Standard: P@{final_k}={std_metrics['precision'][-1]:.2f}")
        print(f"  Reranked: P@{final_k}={rr_metrics['precision'][-1]:.2f}")
    
    print("\n" + "=" * 50)
    print("SUMMARY")
    print(f"Standard - Avg P@{final_k}: {np.mean(std_metrics['precision']):.3f}, MRR: {np.mean(std_metrics['mrr']):.3f}")
    print(f"Reranked - Avg P@{final_k}: {np.mean(rr_metrics['precision']):.3f}, MRR: {np.mean(rr_metrics['mrr']):.3f}")

evaluate_reranking(eval_dataset, collection)

---

## 6. Hybrid Search: BM25 + Vector Search

**BM25** is a keyword-based ranking algorithm (like a smart version of TF-IDF). It excels at exact keyword matches.

**Vector search** excels at semantic meaning but can miss exact keywords.

**Hybrid search** combines both for the best of both worlds.

We combine rankings using **Reciprocal Rank Fusion (RRF)**:

$$\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{k + \text{rank}_r(d)}$$

where k is typically 60.

In [None]:
# BM25 setup
tokenized_docs = [doc.lower().split() for doc in texts]
bm25 = BM25Okapi(tokenized_docs)

def bm25_retrieve(query, k=5):
    """Retrieve using BM25 keyword matching."""
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    top_k_indices = np.argsort(scores)[::-1][:k]
    return (
        [ids[i] for i in top_k_indices],
        [texts[i] for i in top_k_indices],
        [scores[i] for i in top_k_indices]
    )

def reciprocal_rank_fusion(rankings, k=60):
    """
    Combine multiple rankings using Reciprocal Rank Fusion.
    
    Args:
        rankings: list of lists of doc_ids (each list is a ranking)
        k: constant (typically 60)
    Returns:
        list of doc_ids sorted by fused score
    """
    fused_scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0
            fused_scores[doc_id] += 1.0 / (k + rank + 1)
    
    sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in sorted_docs]

def hybrid_search(query, collection, k=5):
    """Combine BM25 and vector search with RRF."""
    # Get both rankings
    vector_ids, _, _ = retrieve(query, collection, k=10)
    bm25_ids, _, _ = bm25_retrieve(query, k=10)
    
    # Fuse rankings
    fused_ids = reciprocal_rank_fusion([vector_ids, bm25_ids])[:k]
    fused_docs = [texts[ids.index(doc_id)] for doc_id in fused_ids]
    
    return fused_ids, fused_docs

# Compare all three approaches
query = "What vector databases are available for storing embeddings?"

print(f"Query: {query}\n")

print("BM25 (keyword):")
bm25_ids, _, bm25_scores = bm25_retrieve(query, k=3)
for i, (doc_id, score) in enumerate(zip(bm25_ids, bm25_scores)):
    print(f"  {i+1}. [{doc_id}] score={score:.2f}")

print("\nVector Search (semantic):")
vec_ids, _, vec_dists = retrieve(query, collection, k=3)
for i, (doc_id, dist) in enumerate(zip(vec_ids, vec_dists)):
    print(f"  {i+1}. [{doc_id}] dist={dist:.3f}")

print("\nHybrid (BM25 + Vector + RRF):")
hybrid_ids, _ = hybrid_search(query, collection, k=3)
for i, doc_id in enumerate(hybrid_ids):
    print(f"  {i+1}. [{doc_id}]")

## Exercise 4: Evaluate Hybrid Search

**Your Task:** Compare BM25-only, vector-only, and hybrid search across the evaluation dataset.

**Steps:**
1. Run all three retrieval methods on each question
2. Compute precision@3 and MRR for each
3. Create a bar chart comparing the methods

In [None]:
def compare_search_methods(eval_data, collection, k=3):
    """
    Compare BM25, vector search, and hybrid search.
    """
    # TODO: Implement comparison
    
    methods = {"BM25": [], "Vector": [], "Hybrid": []}
    
    for item in eval_data:
        relevant = item["relevant_ids"]
        
        # BM25
        bm25_result_ids = None  # Your code
        
        # Vector
        vec_result_ids = None  # Your code
        
        # Hybrid
        hybrid_result_ids = None  # Your code
        
        # Compute precision@k for each
        pass  # Your code
    
    # Plot comparison
    pass  # Your code: bar chart

# compare_search_methods(eval_dataset, collection)

### Solution for Exercise 4

In [None]:
def compare_search_methods(eval_data, collection, k=3):
    """Compare BM25, vector search, and hybrid search - SOLUTION."""
    methods = {
        "BM25": {"precision": [], "mrr": []},
        "Vector": {"precision": [], "mrr": []},
        "Hybrid": {"precision": [], "mrr": []}
    }
    
    for item in eval_data:
        relevant = item["relevant_ids"]
        
        # BM25
        bm25_result_ids, _, _ = bm25_retrieve(item["question"], k=k)
        methods["BM25"]["precision"].append(precision_at_k(bm25_result_ids, relevant, k))
        methods["BM25"]["mrr"].append(reciprocal_rank(bm25_result_ids, relevant))
        
        # Vector
        vec_result_ids, _, _ = retrieve(item["question"], collection, k=k)
        methods["Vector"]["precision"].append(precision_at_k(vec_result_ids, relevant, k))
        methods["Vector"]["mrr"].append(reciprocal_rank(vec_result_ids, relevant))
        
        # Hybrid
        hybrid_result_ids, _ = hybrid_search(item["question"], collection, k=k)
        methods["Hybrid"]["precision"].append(precision_at_k(hybrid_result_ids, relevant, k))
        methods["Hybrid"]["mrr"].append(reciprocal_rank(hybrid_result_ids, relevant))
    
    # Print results
    print(f"{'Method':<10} {'Avg P@'+str(k):<12} {'Avg MRR':<10}")
    print("-" * 32)
    for method, scores in methods.items():
        avg_p = np.mean(scores["precision"])
        avg_mrr = np.mean(scores["mrr"])
        print(f"{method:<10} {avg_p:<12.3f} {avg_mrr:<10.3f}")
    
    # Bar chart
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    method_names = list(methods.keys())
    colors = ['#ff9999', '#66b3ff', '#99ff99']
    
    for ax, metric_name in zip(axes, [f"precision", "mrr"]):
        values = [np.mean(methods[m][metric_name]) for m in method_names]
        bars = ax.bar(method_names, values, color=colors)
        ax.set_title(f"Average {metric_name.upper()}" if metric_name == "mrr" else f"Average Precision@{k}")
        ax.set_ylim(0, 1)
        for bar, val in zip(bars, values):
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                   f"{val:.3f}", ha='center', fontsize=11)
    
    plt.suptitle("Search Method Comparison", fontsize=14)
    plt.tight_layout()
    plt.show()

compare_search_methods(eval_dataset, collection)

---

## 7. Failure Analysis & Debugging RAG

RAG systems can fail at multiple points. Here's a systematic debugging approach:

### Common Failure Modes

| Failure | Symptom | Root Cause | Fix |
|---------|---------|------------|-----|
| Wrong chunks retrieved | Answer about wrong topic | Embedding model mismatch, poor chunking | Better embeddings, adjust chunk size |
| Right chunks, wrong answer | Plausible but incorrect answer | LLM hallucination, prompt issue | Better prompt, lower temperature |
| Chunks too large | Answer misses details | Large chunks dilute signal | Smaller chunks with overlap |
| Chunks too small | Missing context | Small chunks lose coherence | Larger chunks, parent-child strategy |
| Query mismatch | No relevant results | User query != document language | Query expansion, HyDE |

### Debugging Checklist

1. **Check retrieval first** - Are the right documents being found?
2. **Check chunk quality** - Are chunks coherent and informative?
3. **Check the prompt** - Is the context + question formatted well?
4. **Check the answer** - Is the LLM following instructions?

In [None]:
def debug_rag_query(question, collection, k=3):
    """Debug a single RAG query step by step."""
    print(f"DEBUGGING RAG QUERY")
    print(f"Question: {question}")
    print("=" * 60)
    
    # Step 1: Check retrieval
    print("\n[1] RETRIEVAL RESULTS:")
    retrieved_ids, retrieved_docs, distances = retrieve(question, collection, k)
    for i, (doc_id, doc, dist) in enumerate(zip(retrieved_ids, retrieved_docs, distances)):
        meta = next(d for d in documents if d["id"] == doc_id)
        print(f"  #{i+1} [{doc_id}] cat={meta['category']}, topic={meta['topic']}, dist={dist:.3f}")
        print(f"      {doc[:80]}...")
    
    # Step 2: Check context quality
    print("\n[2] CONTEXT QUALITY:")
    context = "\n\n".join(retrieved_docs)
    print(f"  Total context length: {len(context)} chars, {len(context.split())} words")
    
    # Step 3: Generate answer
    print("\n[3] GENERATED ANSWER:")
    answer = generate_answer(question, context)
    print(f"  {answer}")
    
    # Step 4: Check faithfulness
    print("\n[4] FAITHFULNESS CHECK:")
    faith = simple_faithfulness(context, answer)
    print(f"  Keyword overlap: {faith:.2f}")
    if faith < 0.1:
        print("  WARNING: Low faithfulness - answer may not be grounded in context!")
    
    return {"ids": retrieved_ids, "answer": answer, "faithfulness": faith}

# Debug a query that might fail
debug_rag_query("What is the difference between GANs and diffusion models for image generation?", collection)
print("\n" + "#" * 60 + "\n")
debug_rag_query("How does quantum computing relate to machine learning?", collection)

---

## 8. Summary

### Key Takeaways

1. **Evaluation first**: Always measure before optimizing. Use precision@k, recall@k, and MRR for retrieval. Use faithfulness for generation.

2. **HyDE**: Generate a hypothetical answer, embed that instead of the query. Bridges the query-document gap.

3. **Cross-encoder re-ranking**: Two-stage retrieval (bi-encoder → cross-encoder) dramatically improves precision at minimal latency cost.

4. **Hybrid search**: Combining BM25 (keyword) with vector search (semantic) via Reciprocal Rank Fusion gives the best of both worlds.

5. **Debug systematically**: Check retrieval → chunk quality → prompt → answer, in that order.

### Techniques Summary

| Technique | When to Use | Complexity | Impact |
|-----------|------------|------------|--------|
| HyDE | When queries are short/vague | Low | Medium |
| Cross-encoder re-ranking | When precision matters | Medium | High |
| Hybrid search (BM25+vector) | When users use specific terms | Medium | High |
| Query decomposition | Complex multi-part questions | Low | Medium |

### References

- Paper: Gao et al. "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE, 2022)
- Framework: RAGAS (ragas.io) for automated RAG evaluation
- Paper: Nogueira & Cho "Passage Re-ranking with BERT" (2019)
- Library: rank-bm25 for BM25 implementation
- Library: sentence-transformers CrossEncoder for re-ranking