# From Zero to RAG: An Incremental, Hands-On Notebook

Welcome to this comprehensive tutorial on building a Retrieval-Augmented Generation (RAG) pipeline from scratch! In this notebook, you'll learn how to create a complete RAG system incrementally, starting with basic keyword search and progressing to sophisticated semantic retrieval with re-ranking.

## What You'll Learn
- Build retrieval systems using TF-IDF, BM25, and semantic embeddings
- Implement hybrid retrieval combining lexical and semantic approaches
- Apply rank fusion techniques (Reciprocal Rank Fusion)
- Use cross-encoder re-ranking for improved precision
- Create an end-to-end RAG pipeline with generation

## What Gets Built
By the end, you'll have a working RAG system that can answer questions about a synthetic knowledge base, complete with retrieval, re-ranking, and generation components.

## Technical Constraints
- **Python 3.10+** compatible code throughout
- **Open-source models only** for embeddings and re-ranking (sentence-transformers, cross-encoders)
- **Hugging Face Token for model access** required HF_TOKEN
- **OpenAI SDK** used only for the final generation step; Needs OPENAI_API_KEY
- All dependencies installable via pip

In [181]:
# Setup & Environment Check
import sys
import os
import subprocess
import importlib.util

# Print Python version to verify compatibility
print(f"Python version: {sys.version}")
if sys.version_info < (3, 10):
    print("⚠️  Warning: This notebook requires Python 3.10 or higher")
else:
    print("✅ Python version compatible")

# Define required packages
required_packages = [
    'numpy',
    'pandas', 
    'scikit-learn',
    'rank_bm25',
    'sentence-transformers',
    'torch',
    'faiss-cpu',
    'tqdm',
    'openai',
    'python-dotenv'
]

def install_package(package_name):
    """Install a package using pip programmatically"""
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package_name])
        print(f"✅ Successfully installed {package_name}")
        return True
    except subprocess.CalledProcessError:
        print(f"❌ Failed to install {package_name}")
        return False

def check_and_install_packages(packages):
    """Check if packages are available, install if missing"""
    missing_packages = []
    
    # First pass: check what's missing
    for package in packages:
        # Handle special cases for import names vs package names
        import_name = package
        if package == 'scikit-learn':
            import_name = 'sklearn'
        elif package == 'faiss-cpu':
            import_name = 'faiss'
        elif package == 'rank_bm25':
            import_name = 'rank_bm25'
        elif package == 'python-dotenv':
            import_name = 'dotenv'
            
        spec = importlib.util.find_spec(import_name)
        if spec is None:
            missing_packages.append(package)
            print(f"❌ {package} not found")
        else:
            print(f"✅ {package} available")
    
    # Second pass: install missing packages
    if missing_packages:
        print(f"\n📦 Installing {len(missing_packages)} missing packages...")
        for package in missing_packages:
            install_package(package)
    
    return missing_packages

# Check and install packages
missing = check_and_install_packages(required_packages)

print("\n🔄 Importing all packages...")
try:
    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    from rank_bm25 import BM25Okapi
    from sentence_transformers import SentenceTransformer, CrossEncoder
    import torch
    try:
        import faiss
        FAISS_AVAILABLE = True
    except ImportError:
        from sklearn.neighbors import NearestNeighbors
        FAISS_AVAILABLE = False
        print("ℹ️  FAISS not available, will use sklearn NearestNeighbors")
    from tqdm import tqdm
    from openai import OpenAI
    from dotenv import load_dotenv
    
    print("✅ All imports successful!")
    
except ImportError as e:
    print(f"❌ Import failed: {e}")
    print("Please restart the kernel and try again.")

# Load environment variables from .env file
print("\n🔑 Loading environment variables...")
env_loaded = load_dotenv()
if env_loaded:
    print("✅ .env file loaded successfully")
else:
    print("ℹ️  No .env file found or already loaded")

# Try to load Kaggle secrets if available (only works in Kaggle environment)
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    
    # Try to get secrets
    try:
        hf_token = user_secrets.get_secret("HF_TOKEN")
        if hf_token:
            os.environ['HF_TOKEN'] = hf_token
            print("✅ HF_TOKEN loaded from Kaggle secrets")
    except Exception:
        print("ℹ️  HF_TOKEN not found in Kaggle secrets")
    
    try:
        openai_key = user_secrets.get_secret("OPENAI_API_KEY")
        if openai_key:
            os.environ['OPENAI_API_KEY'] = openai_key
            print("✅ OPENAI_API_KEY loaded from Kaggle secrets")
    except Exception:
        print("ℹ️  OPENAI_API_KEY not found in Kaggle secrets")
        
except ImportError:
    print("ℹ️  kaggle_secrets not available (not in Kaggle environment)")
    print("   Using .env file or environment variables instead")

# Check for required API keys
openai_key = os.getenv('OPENAI_API_KEY')
hf_token = os.getenv('HF_TOKEN')

print("\n🔐 API Key Status:")
if openai_key:
    print(f"✅ OPENAI_API_KEY: Found (starts with: {openai_key[:10]}...)")
else:
    print("❌ OPENAI_API_KEY: Not found")
    print("   Set OPENAI_API_KEY in your .env file or environment variables")

if hf_token:
    print(f"✅ HF_TOKEN: Found (starts with: {hf_token[:10]}...)")
else:
    print("❌ HF_TOKEN: Not found")
    print("   Set HF_TOKEN in your .env file or environment variables")

# Set global random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
    
print("\n🎯 Random seeds set for reproducibility")
print("🚀 Environment ready!")

Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 10:07:17) [Clang 14.0.6 ]
✅ Python version compatible
✅ numpy available
✅ pandas available
✅ scikit-learn available
✅ rank_bm25 available
❌ sentence-transformers not found
✅ torch available
✅ faiss-cpu available
✅ tqdm available
✅ openai available
✅ python-dotenv available

📦 Installing 1 missing packages...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


✅ Successfully installed sentence-transformers

🔄 Importing all packages...
✅ All imports successful!

🔑 Loading environment variables...
✅ .env file loaded successfully
ℹ️  kaggle_secrets not available (not in Kaggle environment)
   Using .env file or environment variables instead

🔐 API Key Status:
✅ OPENAI_API_KEY: Found (starts with: sk-proj-io...)
✅ HF_TOKEN: Found (starts with: hf_nIWLRsu...)

🎯 Random seeds set for reproducibility
🚀 Environment ready!


## What is RAG?

![RAG Overview](https://medium.com/@drjulija/what-is-retrieval-augmented-generation-rag-938e4f6e03d1)

**Retrieval-Augmented Generation (RAG)** combines information retrieval with text generation to create more accurate, grounded responses. Instead of relying solely on a language model's training data, RAG first retrieves relevant documents from a knowledge base, then uses those documents to generate answers.

### Core Components
1. **Indexing**: Preprocessing and storing documents for efficient retrieval
2. **Retrieval**: Finding relevant documents given a query
3. **Fusion**: Combining results from multiple retrieval methods
4. **Re-ranking**: Refining the order of retrieved documents
5. **Generation**: Creating answers using retrieved context

### Key Benefits
- **Reduces hallucinations** by grounding responses in actual documents
- **Enables up-to-date information** without retraining models
- **Provides citations** for transparency and verification
- **Scales efficiently** to large knowledge bases

### Trade-offs
RAG adds complexity and latency but dramatically improves factual accuracy and allows dynamic knowledge updates. The retrieval quality directly impacts the final answer quality.

## Dataset Preparation

Quality retrieval starts with quality data. Clean, well-structured documents are essential for effective RAG systems. Each document should have consistent fields and clear, focused content.

### Key Principles
- **Structured format**: Use consistent fields (id, title, text) for easy processing
- **Appropriate granularity**: Documents should be focused but comprehensive
- **Clean text**: Remove formatting artifacts, normalize whitespace
- **Diverse content**: Include varied topics to test retrieval robustness

### Our Synthetic Corpus
We'll create a small but diverse dataset spanning multiple domains (astronomy, cooking, programming, history, health, sports). This allows us to test different retrieval methods on varied content types. Including some near-duplicates and paraphrases helps evaluate robustness to semantic similarity.

### Licensing Note
When using real data, always verify licensing terms and respect copyright. Our synthetic dataset avoids these concerns while providing realistic testing scenarios.

**Deterministic seeds** ensure reproducible results across runs, crucial for comparing retrieval methods fairly.

In [182]:
# Create synthetic corpus for tutorial
import pandas as pd
import os

# Create data directory if it doesn't exist
os.makedirs('./data', exist_ok=True)

# Generate synthetic documents for demonstration
documents = [
    # Astronomy documents
    {"id": "ast_001", "title": "Understanding Black Holes", 
     "text": "Black holes are regions of spacetime where gravity is so strong that nothing, including light, can escape once it crosses the event horizon. The event horizon is the boundary beyond which escape becomes impossible. Black holes form when massive stars collapse under their own gravity at the end of their lifecycle. The singularity at the center represents a point where spacetime curvature becomes infinite."},
    
    {"id": "ast_002", "title": "The Life Cycle of Stars", 
     "text": "Stars are born from clouds of gas and dust called nebulae. Through gravitational collapse, the core temperature rises until nuclear fusion begins, converting hydrogen into helium and releasing enormous amounts of energy. A star's mass determines its lifecycle - more massive stars burn brighter and die younger, while smaller stars can burn for billions of years."},
    
    {"id": "ast_003", "title": "Solar System Formation", 
     "text": "The solar system formed approximately 4.6 billion years ago from the gravitational collapse of a molecular cloud. The Sun formed at the center while leftover material formed a protoplanetary disk. Through accretion and collisions, planetary embryos grew into the planets we know today. The process explains the orbital characteristics and composition differences between inner rocky planets and outer gas giants."},
    
    {"id": "ast_004", "title": "Exoplanet Detection Methods", 
     "text": "Astronomers use several methods to detect exoplanets. The transit method observes the dimming of a star as a planet passes in front of it. The radial velocity method detects wobbles in a star's motion caused by an orbiting planet's gravitational pull. Direct imaging captures light from the planet itself, though this is challenging due to the brightness difference between stars and planets."},
    
    {"id": "ast_005", "title": "Dark Matter and Dark Energy", 
     "text": "Dark matter makes up approximately 27% of the universe but doesn't interact electromagnetically, making it invisible to direct observation. Its existence is inferred from gravitational effects on visible matter and large-scale structure formation. Dark energy, comprising about 68% of the universe, drives the accelerating expansion of spacetime itself, counteracting gravity on cosmic scales."},
    
    # Cooking documents
    {"id": "cook_001", "title": "Essential Knife Skills", 
     "text": "Proper knife skills form the foundation of cooking efficiency and safety. The chef's knife should be held with a pinch grip, controlling the blade with thumb and forefinger. The guiding hand forms a claw to protect fingertips while providing stability. Consistent cuts ensure even cooking - brunoise for small dice, julienne for thin strips, and chiffonade for leafy herbs."},
    
    {"id": "cook_002", "title": "Understanding Heat and Cooking Methods", 
     "text": "Heat transfer occurs through conduction, convection, and radiation in cooking. Dry heat methods like roasting and grilling develop flavor through the Maillard reaction, creating complex tastes and aromas. Moist heat methods like braising and steaming are gentler, preserving delicate textures. Understanding heat control prevents overcooking and ensures proteins remain tender and juicy."},
    
    {"id": "cook_003", "title": "Building Flavor Profiles", 
     "text": "Flavor development starts with aromatics - onions, garlic, and celery form the foundation of many cuisines. Layering flavors throughout the cooking process creates depth and complexity. Seasoning should happen in stages, not just at the end. Acid brightens dishes, fat carries flavors, and herbs add freshness. Understanding how ingredients interact helps create balanced, memorable meals."},
    
    {"id": "cook_004", "title": "Sauce Making Fundamentals", 
     "text": "Classic mother sauces provide the foundation for countless variations. Roux-based sauces like béchamel use equal parts fat and flour to create smooth, creamy textures. Emulsification binds oil and water-based ingredients, as seen in mayonnaise and hollandaise. Reduction concentrates flavors by evaporating liquid, creating intensely flavored pan sauces and glazes."},
    
    {"id": "cook_005", "title": "Baking Science and Techniques", 
     "text": "Baking relies on precise chemical reactions between ingredients. Gluten development in flour provides structure, while leavening agents create lift through gas production. Temperature control affects texture - higher heat creates crustier exteriors while lower heat ensures even cooking. Understanding ingredient ratios and their functions enables consistent results and successful recipe modifications."},
    
    # Python programming documents  
    {"id": "py_001", "title": "Object-Oriented Programming Concepts", 
     "text": "Object-oriented programming organizes code around objects rather than functions. Classes serve as blueprints defining attributes and methods, while objects are specific instances of classes. Encapsulation hides internal implementation details, inheritance enables code reuse through class hierarchies, and polymorphism allows different objects to respond to the same interface in their own way."},
    
    {"id": "py_002", "title": "Efficient Data Processing with Pandas", 
     "text": "Pandas provides powerful data structures and operations for manipulating structured data. DataFrames offer two-dimensional labeled data structures similar to spreadsheets. Vectorized operations perform element-wise calculations efficiently without explicit loops. Groupby operations enable split-apply-combine workflows, while merge and join operations combine datasets based on common keys."},
    
    {"id": "py_003", "title": "Asynchronous Programming Patterns", 
     "text": "Asynchronous programming enables concurrent execution without blocking operations. The async/await syntax provides a clean way to write asynchronous code that looks synchronous. Event loops manage the execution of asynchronous tasks, while coroutines are functions that can be paused and resumed. This approach is particularly effective for I/O-bound operations like web requests or database queries."},
    
    {"id": "py_004", "title": "Machine Learning Pipeline Design", 
     "text": "ML pipelines automate the workflow from raw data to trained models. Data preprocessing includes cleaning, feature engineering, and transformation steps. Cross-validation ensures model generalization, while hyperparameter tuning optimizes model performance. Production pipelines must handle data drift, model monitoring, and automated retraining to maintain accuracy over time."},
    
    {"id": "py_005", "title": "API Development with FastAPI", 
     "text": "FastAPI enables rapid development of high-performance web APIs with automatic OpenAPI documentation. Type hints provide automatic validation and serialization of request/response data. Dependency injection enables clean separation of concerns and easier testing. Async support handles concurrent requests efficiently, while middleware provides cross-cutting functionality like authentication and logging."},
    
    # History documents
    {"id": "hist_001", "title": "The Industrial Revolution", 
     "text": "The Industrial Revolution transformed society from agricultural to manufacturing economies between 1760 and 1840. Steam power revolutionized transportation and production, while factory systems centralized manufacturing. Urbanization accelerated as workers moved from rural areas to industrial cities. These changes brought both economic growth and social challenges, including harsh working conditions and environmental pollution."},
    
    {"id": "hist_002", "title": "Ancient Civilizations and Trade", 
     "text": "Ancient trade routes connected distant civilizations, facilitating cultural and technological exchange. The Silk Road linked Asia and Europe, carrying not just silk but ideas, religions, and innovations. Maritime trade in the Mediterranean enabled the rise of powerful city-states like Venice and Genoa. These networks spread agricultural techniques, metalworking, and writing systems across vast distances."},
    
    {"id": "hist_003", "title": "The Renaissance Period", 
     "text": "The Renaissance marked a period of renewed interest in classical learning, art, and humanism from the 14th to 17th centuries. Artists like Leonardo da Vinci and Michelangelo revolutionized artistic techniques and scientific observation. The printing press democratized knowledge, while patronage systems supported artistic and intellectual pursuits. This cultural movement laid foundations for modern scientific methods and artistic expression."},
    
    {"id": "hist_004", "title": "World War Impact on Society", 
     "text": "World Wars I and II fundamentally reshaped global society, politics, and technology. Total war mobilized entire populations, advancing manufacturing and medical techniques. Women entered the workforce in unprecedented numbers, challenging traditional gender roles. The wars accelerated decolonization movements and led to new international organizations aimed at preventing future conflicts."},
    
    {"id": "hist_005", "title": "The Cold War Era", 
     "text": "The Cold War (1945-1991) defined international relations through ideological competition between capitalism and communism. Nuclear weapons created a balance of terror, preventing direct conflict while fueling proxy wars. The space race demonstrated technological capabilities, while cultural exchanges like jazz and cinema influenced global perspectives. The period ended with economic reforms and the dissolution of the Soviet Union."},
    
    # Health documents
    {"id": "heal_001", "title": "Nutrition and Metabolism", 
     "text": "Metabolism encompasses all chemical processes that maintain life, including catabolism (breaking down molecules for energy) and anabolism (building complex molecules). Macronutrients - carbohydrates, proteins, and fats - provide energy and building blocks for cellular processes. Micronutrients like vitamins and minerals act as cofactors in enzymatic reactions essential for health."},
    
    {"id": "heal_002", "title": "Cardiovascular Health", 
     "text": "The cardiovascular system pumps blood through a network of vessels, delivering oxygen and nutrients while removing waste products. Regular exercise strengthens the heart muscle and improves circulation. Diet affects cardiovascular health through cholesterol levels, blood pressure, and inflammation. Preventive measures include maintaining healthy weight, avoiding smoking, and managing stress levels."},
    
    {"id": "heal_003", "title": "Mental Health and Wellness", 
     "text": "Mental health encompasses emotional, psychological, and social well-being, affecting thoughts, feelings, and behaviors. Stress management techniques like meditation and deep breathing activate the parasympathetic nervous system. Social connections and meaningful relationships provide emotional support and resilience. Professional treatment options include therapy, medication, and lifestyle modifications tailored to individual needs."},
    
    {"id": "heal_004", "title": "Sleep and Recovery", 
     "text": "Sleep plays a crucial role in physical and mental health through multiple sleep cycles of REM and non-REM stages. During sleep, the brain consolidates memories and clears metabolic waste products. Growth hormone release peaks during deep sleep, supporting tissue repair and immune function. Sleep hygiene practices like consistent schedules and optimal environment promote quality rest."},
    
    {"id": "heal_005", "title": "Immune System Function", 
     "text": "The immune system defends against pathogens through innate and adaptive responses. White blood cells identify and eliminate threats, while antibodies provide specific protection against previously encountered antigens. Vaccination trains the immune system to recognize pathogens without causing disease. Lifestyle factors like nutrition, exercise, and stress management influence immune system effectiveness."},
    
    # Sports documents
    {"id": "sport_001", "title": "Athletic Performance Optimization", 
     "text": "Peak athletic performance requires balancing training stress, recovery, and adaptation. Periodization systematically varies training intensity and volume to peak for competitions. Sport-specific training develops the energy systems and movement patterns most relevant to performance. Recovery protocols including sleep, nutrition, and active recovery prevent overtraining and reduce injury risk."},
    
    {"id": "sport_002", "title": "Sports Psychology and Mental Training", 
     "text": "Mental training enhances athletic performance through focus, confidence, and stress management techniques. Visualization helps athletes mentally rehearse successful performance and overcome challenges. Goal setting provides direction and motivation, while self-talk influences confidence and concentration. Handling pressure situations requires developing coping strategies and maintaining optimal arousal levels."},
    
    {"id": "sport_003", "title": "Injury Prevention Strategies", 
     "text": "Injury prevention combines proper warm-up, strength training, and biomechanical awareness. Dynamic warm-ups prepare muscles and joints for activity-specific movements. Strength imbalances increase injury risk, particularly between opposing muscle groups. Recovery time between training sessions allows tissues to adapt and repair, reducing the likelihood of overuse injuries."},
    
    {"id": "sport_004", "title": "Strength Training Principles", 
     "text": "Effective strength training follows progressive overload, gradually increasing resistance to stimulate adaptation. Compound exercises like squats and deadlifts work multiple muscle groups efficiently. Training frequency, volume, and intensity must be balanced for optimal results. Proper form prevents injury and ensures targeted muscle activation during resistance exercises."},
    
    {"id": "sport_005", "title": "Endurance Training Methodologies", 
     "text": "Endurance training improves the body's ability to sustain prolonged physical activity. Training zones based on heart rate or power output optimize different energy systems. Base training builds aerobic capacity, while high-intensity intervals improve lactate threshold. Periodization varies training stress to promote adaptation while preventing burnout and overtraining syndrome."}
]

# Convert to pandas DataFrame for easy manipulation
corpus_df = pd.DataFrame(documents)

print(f"📊 Created synthetic corpus with {len(corpus_df)} documents")
print(f"📂 Domains covered: {len(corpus_df['id'].str[:4].unique())} unique prefixes")
print(f"📏 Text length range: {corpus_df['text'].str.len().min()} - {corpus_df['text'].str.len().max()} characters")

# Show distribution by domain using DataFrame
domain_counts = corpus_df['id'].str[:4].value_counts()
domain_names = {
    'ast_': 'Astronomy',
    'cook': 'Cooking', 
    'py_0': 'Python/Programming',
    'hist': 'History',
    'heal': 'Health',
    'spor': 'Sports'
}

# Create domain distribution DataFrame
domain_df = pd.DataFrame({
    'Domain': [domain_names.get(prefix, prefix) for prefix in domain_counts.index],
    'Document Count': domain_counts.values,
    'Prefix': domain_counts.index
})

print("\n📈 Documents per domain:")
display(domain_df[['Domain', 'Document Count']])

# Save to CSV for reuse throughout the notebook
corpus_df.to_csv('./data/corpus.csv', index=False)
print("💾 Corpus saved to ./data/corpus.csv")

# Display sample documents as DataFrame
sample_df = corpus_df[['id', 'title', 'text']].head(3).copy()
sample_df['text_preview'] = sample_df['text'].str[:100] + "..."
sample_display = sample_df[['id', 'title', 'text_preview']].copy()
sample_display.columns = ['ID', 'Title', 'Text Preview']

print("\n📋 Sample documents:")
display(sample_display)

📊 Created synthetic corpus with 30 documents
📂 Domains covered: 6 unique prefixes
📏 Text length range: 363 - 444 characters

📈 Documents per domain:


Unnamed: 0,Domain,Document Count
0,Astronomy,5
1,Cooking,5
2,Python/Programming,5
3,History,5
4,Health,5
5,Sports,5


💾 Corpus saved to ./data/corpus.csv

📋 Sample documents:


Unnamed: 0,ID,Title,Text Preview
0,ast_001,Understanding Black Holes,Black holes are regions of spacetime where gra...
1,ast_002,The Life Cycle of Stars,Stars are born from clouds of gas and dust cal...
2,ast_003,Solar System Formation,The solar system formed approximately 4.6 bill...


## TF-IDF: Term Frequency-Inverse Document Frequency

**TF-IDF** is a fundamental text retrieval technique that scores documents based on term importance. It combines two concepts:
- **Term Frequency (TF)**: How often a term appears in a document
- **Inverse Document Frequency (IDF)**: How rare a term is across the entire corpus

### Intuition
Words that appear frequently in a document but rarely across the corpus are most important for distinguishing that document. Common words like "the" and "and" get low scores, while specific terms get higher scores.

### How It Works
Documents are converted to vectors where each dimension represents a unique term's TF-IDF score. Query similarity is computed using cosine similarity between the query vector and document vectors.

### Advantages
- **Fast and interpretable**: Clear scoring rationale
- **No training required**: Works immediately on any corpus
- **Memory efficient**: Sparse vectors for large vocabularies

### Limitations
- **Vocabulary mismatch**: Can't match synonyms ("car" vs "automobile")
- **Word order ignored**: "dog bites man" = "man bites dog"
- **No semantic understanding**: Relies purely on exact word matches

In [183]:
# TF-IDF retrieval implementation using scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize TF-IDF vectorizer with optimized parameters
# - Use both unigrams and bigrams to capture phrases like "black holes"
# - Convert to lowercase for normalization
# - Remove English stop words to focus on meaningful terms
# - Set max_features to control vocabulary size and computation
tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),  # Include both single words and word pairs
    lowercase=True,      # Normalize case
    stop_words='english', # Remove common words like 'the', 'and'
    max_features=10000,  # Limit vocabulary size for efficiency
    min_df=1,            # Include terms that appear in at least 1 document
    max_df=0.8           # Exclude terms that appear in >80% of documents
)

# Fit the vectorizer on our corpus and transform texts to TF-IDF vectors
print("🔄 Building TF-IDF matrix...")
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus_df['text'])

print(f"📊 TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"📝 Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print(f"💾 Matrix sparsity: {(1 - tfidf_matrix.nnz / tfidf_matrix.size) * 100:.1f}% zeros")

def query_tfidf(query_text, top_k=5):
    """
    Retrieve documents using TF-IDF similarity.
    
    Args:
        query_text (str): The search query
        top_k (int): Number of top results to return
    
    Returns:
        list: Tuples of (document_index, similarity_score, document_info)
    """
    # Transform query using the same vectorizer fitted on corpus
    query_vector = tfidf_vectorizer.transform([query_text])
    
    # Compute cosine similarity between query and all documents
    # Cosine similarity ranges from 0 (no similarity) to 1 (identical)
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()
    
    # Get indices of top-k most similar documents
    top_indices = similarity_scores.argsort()[-top_k:][::-1]
    
    # Build results with document info and scores
    results = []
    for idx in top_indices:
        doc_info = {
            'id': corpus_df.iloc[idx]['id'],
            'title': corpus_df.iloc[idx]['title'],
            'text': corpus_df.iloc[idx]['text']
        }
        results.append((idx, similarity_scores[idx], doc_info))
    
    return results

# Test TF-IDF retrieval with example queries
test_queries = [
    "black holes and event horizons",
    "python programming decorators"
]

print("\n🔍 Testing TF-IDF retrieval:")
for query in test_queries:
    print(f"\n📋 Query: '{query}'")
    results = query_tfidf(query, top_k=5)
    
    print("Top 5 results:")
    for rank, (idx, score, doc_info) in enumerate(results, 1):
        print(f"  {rank}. [{doc_info['id']}] {doc_info['title']} (score: {score:.3f})")
        # Show first 100 characters of text as snippet
        snippet = doc_info['text'][:100] + '...' if len(doc_info['text']) > 100 else doc_info['text']
        print(f"      {snippet}")

🔄 Building TF-IDF matrix...
📊 TF-IDF matrix shape: (30, 1938)
📝 Vocabulary size: 1938
💾 Matrix sparsity: 0.0% zeros

🔍 Testing TF-IDF retrieval:

📋 Query: 'black holes and event horizons'
Top 5 results:
  1. [ast_001] Understanding Black Holes (score: 0.442)
      Black holes are regions of spacetime where gravity is so strong that nothing, including light, can e...
  2. [py_003] Asynchronous Programming Patterns (score: 0.046)
      Asynchronous programming enables concurrent execution without blocking operations. The async/await s...
  3. [sport_004] Strength Training Principles (score: 0.000)
      Effective strength training follows progressive overload, gradually increasing resistance to stimula...
  4. [ast_002] The Life Cycle of Stars (score: 0.000)
      Stars are born from clouds of gas and dust called nebulae. Through gravitational collapse, the core ...
  5. [ast_003] Solar System Formation (score: 0.000)
      The solar system formed approximately 4.6 billion years ago from

## BM25: Best Matching 25

**BM25** is a probabilistic ranking function that often outperforms TF-IDF for information retrieval. It improves upon TF-IDF by addressing two key limitations:

### Key Improvements
1. **Term Saturation**: TF-IDF scores increase linearly with term frequency, but BM25 uses a saturation function. After a certain point, additional occurrences contribute less to the score.

2. **Document Length Normalization**: BM25 adjusts for document length, preventing longer documents from having unfair advantages simply due to more term occurrences.

### Parameters
- **k1** (typically 1.2-2.0): Controls term frequency saturation
- **b** (typically 0.75): Controls document length normalization strength

### When BM25 Excels
BM25 typically outperforms TF-IDF for keyword search, especially with:
- Varied document lengths
- Collections where term frequency patterns matter
- Traditional information retrieval tasks

Both TF-IDF and BM25 are **lexical** methods—they rely on exact word matches and don't understand semantics.

In [184]:
# BM25 retrieval implementation using rank_bm25
from rank_bm25 import BM25Okapi
import string
import re

def simple_tokenizer(text):
    """
    Simple tokenizer for BM25 preprocessing.
    Converts to lowercase, removes punctuation, and splits on whitespace.
    
    Args:
        text (str): Input text to tokenize
    
    Returns:
        list: List of tokens
    """
    # Convert to lowercase for case-insensitive matching
    text = text.lower()
    
    # Remove punctuation using regex (keeps alphanumeric and spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    
    # Split on whitespace and filter empty strings
    tokens = [token for token in text.split() if token]
    
    return tokens

# Tokenize all documents in the corpus for BM25
print("🔄 Tokenizing corpus for BM25...")
tokenized_corpus = [simple_tokenizer(doc_text) for doc_text in corpus_df['text']]

# Initialize BM25 with default parameters (k1=1.2, b=0.75)
# These are well-tested values that work well across many domains
bm25 = BM25Okapi(tokenized_corpus)

print(f"📊 BM25 index built for {len(tokenized_corpus)} documents")
print(f"📝 Average document length: {np.mean([len(doc) for doc in tokenized_corpus]):.1f} tokens")

def query_bm25(query_text, top_k=5):
    """
    Retrieve documents using BM25 scoring.
    
    Args:
        query_text (str): The search query
        top_k (int): Number of top results to return
    
    Returns:
        list: Tuples of (document_index, bm25_score, document_info)
    """
    # Tokenize query using same tokenizer as corpus
    query_tokens = simple_tokenizer(query_text)
    
    # Get BM25 scores for all documents
    # Higher scores indicate better matches
    bm25_scores = bm25.get_scores(query_tokens)
    
    # Get indices of top-k highest scoring documents
    top_indices = np.argsort(bm25_scores)[-top_k:][::-1]
    
    # Build results with document info and scores
    results = []
    for idx in top_indices:
        doc_info = {
            'id': corpus_df.iloc[idx]['id'],
            'title': corpus_df.iloc[idx]['title'],
            'text': corpus_df.iloc[idx]['text']
        }
        results.append((idx, bm25_scores[idx], doc_info))
    
    return results

# Compare BM25 vs TF-IDF on the same queries
comparison_queries = [
    "black holes event horizon",
    "python decorators function behavior",
    "exercise cardiovascular health"
]

print("\n🔍 Comparing BM25 vs TF-IDF retrieval:")
for query in comparison_queries:
    print(f"\n📋 Query: '{query}'")
    
    # Get results from both methods
    bm25_results = query_bm25(query, top_k=3)
    tfidf_results = query_tfidf(query, top_k=3)
    
    print("\n🏆 BM25 Top 3:")
    for rank, (idx, score, doc_info) in enumerate(bm25_results, 1):
        print(f"  {rank}. [{doc_info['id']}] {doc_info['title']} (BM25: {score:.2f})")
    
    print("\n📊 TF-IDF Top 3:")
    for rank, (idx, score, doc_info) in enumerate(tfidf_results, 1):
        print(f"  {rank}. [{doc_info['id']}] {doc_info['title']} (TF-IDF: {score:.3f})")
    
    # Show overlap between methods
    bm25_ids = {doc_info['id'] for _, _, doc_info in bm25_results}
    tfidf_ids = {doc_info['id'] for _, _, doc_info in tfidf_results}
    overlap = bm25_ids.intersection(tfidf_ids)
    print(f"\n🔗 Overlap: {len(overlap)}/3 documents match between methods")

🔄 Tokenizing corpus for BM25...
📊 BM25 index built for 30 documents
📝 Average document length: 52.9 tokens

🔍 Comparing BM25 vs TF-IDF retrieval:

📋 Query: 'black holes event horizon'

🏆 BM25 Top 3:
  1. [ast_001] Understanding Black Holes (BM25: 15.22)
  2. [py_003] Asynchronous Programming Patterns (BM25: 2.35)
  3. [sport_004] Strength Training Principles (BM25: 0.00)

📊 TF-IDF Top 3:
  1. [ast_001] Understanding Black Holes (TF-IDF: 0.546)
  2. [py_003] Asynchronous Programming Patterns (TF-IDF: 0.037)
  3. [sport_004] Strength Training Principles (TF-IDF: 0.000)

🔗 Overlap: 3/3 documents match between methods

📋 Query: 'python decorators function behavior'

🏆 BM25 Top 3:
  1. [heal_004] Sleep and Recovery (BM25: 2.90)
  2. [sport_005] Endurance Training Methodologies (BM25: 0.00)
  3. [py_004] Machine Learning Pipeline Design (BM25: 0.00)

📊 TF-IDF Top 3:
  1. [heal_004] Sleep and Recovery (TF-IDF: 0.100)
  2. [sport_005] Endurance Training Methodologies (TF-IDF: 0.000)
  3. [py_0

## Embeddings: Semantic Vector Representations

**Embeddings** represent text as dense vectors in high-dimensional space where semantically similar texts are close together. Unlike lexical methods (TF-IDF, BM25), embeddings can match concepts even with different vocabulary.

### Key Advantages
- **Semantic understanding**: Matches "car" with "automobile"
- **Cross-lingual capability**: Can work across languages
- **Context awareness**: Considers word relationships and context

### Vector Similarity
Cosine similarity measures the angle between vectors, ranging from -1 to 1. Values closer to 1 indicate higher semantic similarity.

### Trade-offs
- **Computational cost**: Embedding models require more resources
- **Model dependence**: Quality depends on training data and architecture
- **Interpretability**: Harder to understand why documents match

### Approximate Nearest Neighbors (ANN)
For large corpora, exact similarity search becomes slow. ANN algorithms like FAISS provide fast approximate search with minimal accuracy loss.

**Privacy note**: Using local models keeps data on your machine, unlike API-based embedding services.

In [185]:
# Text chunking for better embedding performance
# Chunking breaks long documents into smaller, focused pieces
# This improves embedding quality and allows more precise retrieval

def chunk_text(text, chunk_size_words=180, overlap_words=30):
    """
    Split text into overlapping chunks of specified word count.
    Overlap helps maintain context across chunk boundaries.
    
    Args:
        text (str): Input text to chunk
        chunk_size_words (int): Target words per chunk
        overlap_words (int): Words to overlap between chunks
    
    Returns:
        list: List of text chunks
    """
    words = text.split()
    
    # If text is shorter than chunk size, return as single chunk
    if len(words) <= chunk_size_words:
        return [text]
    
    chunks = []
    start = 0
    
    while start < len(words):
        # Define chunk end, ensuring we don't exceed word count
        end = min(start + chunk_size_words, len(words))
        
        # Extract chunk and join words back to text
        chunk_words = words[start:end]
        chunk_text = ' '.join(chunk_words)
        chunks.append(chunk_text)
        
        # Move start position, accounting for overlap
        # If this is the last chunk, break to avoid infinite loop
        if end >= len(words):
            break
        start = end - overlap_words
    
    return chunks

# Create chunked corpus for better embedding performance
print("🔄 Creating chunked corpus...")
chunked_data = []

for _, row in corpus_df.iterrows():
    doc_chunks = chunk_text(row['text'], chunk_size_words=180, overlap_words=30)
    
    for chunk_idx, text_chunk in enumerate(doc_chunks):
        chunked_data.append({
            'doc_id': row['id'],
            'chunk_id': f"{row['id']}_chunk_{chunk_idx}",
            'title': row['title'],
            'chunk_text': text_chunk
        })

# Convert to DataFrame for easy manipulation
chunked_corpus = pd.DataFrame(chunked_data)

print(f"📊 Created {len(chunked_corpus)} chunks from {len(corpus_df)} documents")
print(f"📏 Average chunk length: {chunked_corpus['chunk_text'].str.split().str.len().mean():.1f} words")

# Show chunk length distribution
chunk_lengths = chunked_corpus['chunk_text'].str.split().str.len()
print(f"📈 Chunk length distribution:")
print(f"  Min: {chunk_lengths.min()} words")
print(f"  Max: {chunk_lengths.max()} words")
print(f"  Median: {chunk_lengths.median():.1f} words")

# Show example of chunking
sample_doc = corpus_df.iloc[0]
sample_chunks = chunk_text(sample_doc['text'])
print(f"\n📋 Example chunking for '{sample_doc['title']}':")
print(f"Original text ({len(sample_doc['text'].split())} words):")
print(f"  {sample_doc['text'][:150]}...")
print(f"\nChunks created: {len(sample_chunks)}")
for i, chunk in enumerate(sample_chunks):
    print(f"  Chunk {i+1} ({len(chunk.split())} words): {chunk[:100]}...")

🔄 Creating chunked corpus...
📊 Created 30 chunks from 30 documents
📏 Average chunk length: 52.0 words
📈 Chunk length distribution:
  Min: 45 words
  Max: 64 words
  Median: 51.0 words

📋 Example chunking for 'Understanding Black Holes':
Original text (64 words):
  Black holes are regions of spacetime where gravity is so strong that nothing, including light, can escape once it crosses the event horizon. The event...

Chunks created: 1
  Chunk 1 (64 words): Black holes are regions of spacetime where gravity is so strong that nothing, including light, can e...


In [186]:
# Semantic embeddings using open-source sentence-transformers
from sentence_transformers import SentenceTransformer
import os

# Use a high-quality, lightweight open-source embedding model
# all-MiniLM-L6-v2 provides good performance with reasonable speed
model_name = "sentence-transformers/all-MiniLM-L6-v2"
# Alternative models (commented for reference):
# model_name = "BAAI/bge-small-en-v1.5"  # Better quality, slightly larger
# model_name = "intfloat/e5-small-v2"     # Good multilingual support

print(f"🤖 Loading embedding model: {model_name}")
embedding_model = SentenceTransformer(model_name)

print(f"📐 Model produces {embedding_model.get_sentence_embedding_dimension()}-dimensional vectors")

# Check if embeddings already exist to avoid recomputation
embeddings_file = './data/embeddings.npz'
chunks_file = './data/chunked_corpus.csv'

if os.path.exists(embeddings_file) and os.path.exists(chunks_file):
    print("📂 Loading pre-computed embeddings...")
    embeddings_data = np.load(embeddings_file)
    chunk_embeddings = embeddings_data['embeddings']
    chunked_corpus = pd.read_csv(chunks_file)
    print(f"✅ Loaded {len(chunk_embeddings)} embeddings from cache")
else:
    # Compute embeddings for all chunks with progress bar
    print(f"🔄 Computing embeddings for {len(chunked_corpus)} chunks...")
    
    # Use tqdm for progress tracking during embedding computation
    chunk_texts = chunked_corpus['chunk_text'].tolist()
    
    # sentence-transformers handles batching internally for efficiency
    chunk_embeddings = embedding_model.encode(
        chunk_texts, 
        batch_size=32,          # Process in batches for memory efficiency
        show_progress_bar=True, # Show progress during computation
        convert_to_numpy=True   # Return as numpy array
    )
    
    # Save embeddings and chunks for future use
    np.savez_compressed(embeddings_file, embeddings=chunk_embeddings)
    chunked_corpus.to_csv(chunks_file, index=False)
    print(f"💾 Saved {len(chunk_embeddings)} embeddings to {embeddings_file}")

# Normalize embeddings for cosine similarity using dot product
# This makes cosine similarity equivalent to dot product, which is faster
from sklearn.preprocessing import normalize
chunk_embeddings_normalized = normalize(chunk_embeddings, norm='l2')

print(f"📊 Embedding matrix shape: {chunk_embeddings.shape}")
print(f"🎯 Embeddings normalized for cosine similarity")

# Choose between FAISS and sklearn based on availability
if FAISS_AVAILABLE:
    # Use FAISS for fast approximate nearest neighbor search
    print("🚀 Using FAISS for fast similarity search")
    
    # Create FAISS index for inner product (equivalent to cosine with normalized vectors)
    embedding_dim = chunk_embeddings_normalized.shape[1]
    faiss_index = faiss.IndexFlatIP(embedding_dim)  # Inner Product index
    
    # Add embeddings to index
    faiss_index.add(chunk_embeddings_normalized.astype(np.float32))
    
    # Save FAISS index
    faiss_index_file = './data/faiss.index'
    faiss.write_index(faiss_index, faiss_index_file)
    print(f"💾 FAISS index saved to {faiss_index_file}")
    
    search_backend = 'faiss'
    
else:
    # Use sklearn NearestNeighbors as fallback
    print("📚 Using sklearn NearestNeighbors for similarity search")
    from sklearn.neighbors import NearestNeighbors
    
    # Create sklearn nearest neighbors index
    nn_index = NearestNeighbors(
        n_neighbors=20,      # Maximum neighbors to consider
        metric='cosine',     # Use cosine similarity
        algorithm='brute'    # Exact search for small datasets
    )
    nn_index.fit(chunk_embeddings)
    
    search_backend = 'sklearn'

def embed_query(query_text):
    """
    Convert query text to embedding vector.
    
    Args:
        query_text (str): The search query
    
    Returns:
        np.ndarray: Query embedding vector
    """
    query_embedding = embedding_model.encode([query_text], convert_to_numpy=True)
    return normalize(query_embedding, norm='l2')[0]  # Normalize and return single vector

def semantic_search(query_text, top_k=10):
    """
    Search for semantically similar chunks using embeddings.
    
    Args:
        query_text (str): The search query
        top_k (int): Number of top results to return
    
    Returns:
        list: Tuples of (chunk_index, similarity_score, chunk_info)
    """
    # Get query embedding
    query_embedding = embed_query(query_text)
    
    if search_backend == 'faiss':
        # Use FAISS for fast search
        scores, indices = faiss_index.search(
            query_embedding.reshape(1, -1).astype(np.float32), 
            top_k
        )
        
        results = []
        for i, (idx, score) in enumerate(zip(indices[0], scores[0])):
            chunk_info = {
                'chunk_id': chunked_corpus.iloc[idx]['chunk_id'],
                'doc_id': chunked_corpus.iloc[idx]['doc_id'],
                'title': chunked_corpus.iloc[idx]['title'],
                'chunk_text': chunked_corpus.iloc[idx]['chunk_text']
            }
            results.append((idx, score, chunk_info))
    
    else:
        # Use sklearn for search
        distances, indices = nn_index.kneighbors(
            query_embedding.reshape(1, -1), 
            n_neighbors=min(top_k, len(chunked_corpus))
        )
        
        results = []
        for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
            # Convert cosine distance to similarity (1 - distance)
            similarity = 1 - distance
            chunk_info = {
                'chunk_id': chunked_corpus.iloc[idx]['chunk_id'],
                'doc_id': chunked_corpus.iloc[idx]['doc_id'],
                'title': chunked_corpus.iloc[idx]['title'],
                'chunk_text': chunked_corpus.iloc[idx]['chunk_text']
            }
            results.append((idx, similarity, chunk_info))
    
    return results

# Test semantic search
test_queries = [
    "stellar collapse and gravitational effects",
    "modifying function behavior in programming",
    "heart health and physical activity"
]

print("\n🔍 Testing semantic search:")
for query in test_queries:
    print(f"\n📋 Query: '{query}'")
    results = semantic_search(query, top_k=5)
    
    print("Top 5 semantic matches:")
    for rank, (idx, score, chunk_info) in enumerate(results, 1):
        print(f"  {rank}. [{chunk_info['doc_id']}] {chunk_info['title']} (similarity: {score:.3f})")
        snippet = chunk_info['chunk_text'][:100] + '...'
        print(f"      {snippet}")

🤖 Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
📐 Model produces 384-dimensional vectors
📂 Loading pre-computed embeddings...
✅ Loaded 30 embeddings from cache
📊 Embedding matrix shape: (30, 384)
🎯 Embeddings normalized for cosine similarity
🚀 Using FAISS for fast similarity search
💾 FAISS index saved to ./data/faiss.index

🔍 Testing semantic search:

📋 Query: 'stellar collapse and gravitational effects'
Top 5 semantic matches:
  1. [ast_002] The Life Cycle of Stars (similarity: 0.420)
      Stars are born from clouds of gas and dust called nebulae. Through gravitational collapse, the core ...
  2. [ast_001] Understanding Black Holes (similarity: 0.377)
      Black holes are regions of spacetime where gravity is so strong that nothing, including light, can e...
  3. [ast_003] Solar System Formation (similarity: 0.342)
      The solar system formed approximately 4.6 billion years ago from the gravitational collapse of a mol...
  4. [ast_005] Dark Matter and Dark Energy

## Hybrid Retrieval: Best of Both Worlds

**Hybrid retrieval** combines lexical (BM25) and semantic (embeddings) approaches to leverage their complementary strengths:

### Lexical Strengths
- Exact term matching for technical terms and proper names
- Fast computation and interpretable results
- Robust to domain shifts

### Semantic Strengths
- Conceptual matching beyond exact words
- Better handling of synonyms and paraphrases
- Context-aware understanding

### Hybrid Strategy
1. **Union approach**: Get candidates from both methods
2. **Score normalization**: Make scores comparable across methods
3. **Rank fusion**: Combine rankings intelligently

### Score Normalization
Different retrieval methods produce scores on different scales. Min-max normalization maps all scores to [0,1] range:
```
normalized_score = (score - min_score) / (max_score - min_score)
```

This ensures fair combination across methods, preventing one method from dominating due to larger score magnitudes.

Hybrid retrieval typically improves both **recall** (finding relevant documents) and **robustness** (handling diverse query types).

In [187]:
# Hybrid retrieval combining lexical and semantic approaches
import pandas as pd

def normalize_scores(scores):
    """Normalize scores to [0, 1] range using min-max scaling."""
    if not scores:
        return []
    
    min_score = min(scores)
    max_score = max(scores)
    
    # Handle case where all scores are the same
    if min_score == max_score:
        return [1.0] * len(scores)
    
    # Apply min-max normalization
    normalized = [(score - min_score) / (max_score - min_score) for score in scores]
    return normalized

def hybrid_retrieve(query_text, top_k_lex=15, top_k_sem=15):
    """
    Combine lexical (BM25) and semantic retrieval results.
    Uses union of candidates and normalizes scores for fair comparison.
    
    Args:
        query_text (str): The search query
        top_k_lex (int): Number of lexical results to retrieve
        top_k_sem (int): Number of semantic results to retrieve
    
    Returns:
        pd.DataFrame: Combined results with normalized scores
    """
    # Get BM25 results on original documents (not chunks)
    bm25_results = query_bm25(query_text, top_k=top_k_lex)
    
    # Get semantic results on chunks
    semantic_results = semantic_search(query_text, top_k=top_k_sem)
    
    # Create unified candidate list
    candidates = {}
    
    # Process BM25 results
    bm25_scores = [score for _, score, _ in bm25_results]
    normalized_bm25_scores = normalize_scores(bm25_scores)
    
    for (doc_idx, score, doc_info), norm_score in zip(bm25_results, normalized_bm25_scores):
        doc_id = doc_info['id']
        if doc_id not in candidates:
            candidates[doc_id] = {
                'doc_id': doc_id,
                'title': doc_info['title'],
                'text': doc_info['text'],
                'bm25_score': score,
                'bm25_rank': len(candidates) + 1,
                'bm25_normalized': norm_score,
                'semantic_score': 0,
                'semantic_rank': None,
                'semantic_normalized': 0
            }
    
    # Process semantic results
    semantic_scores = [score for _, score, _ in semantic_results]
    normalized_semantic_scores = normalize_scores(semantic_scores)
    
    for (chunk_idx, score, chunk_info), norm_score in zip(semantic_results, normalized_semantic_scores):
        doc_id = chunk_info['doc_id']
        
        if doc_id not in candidates:
            # Find original document info for new semantic candidates
            # Add error handling for missing documents
            matching_docs = corpus_df[corpus_df['id'] == doc_id]
            if matching_docs.empty:
                print(f"⚠️ Warning: Document {doc_id} from semantic search not found in corpus_df")
                print(f"   Available corpus IDs: {corpus_df['id'].head().tolist()}")
                print(f"   Chunk info: {chunk_info}")
                continue  # Skip this document
                
            orig_doc = matching_docs.iloc[0]
            candidates[doc_id] = {
                'doc_id': doc_id,
                'title': orig_doc['title'],
                'text': orig_doc['text'],
                'bm25_score': 0,
                'bm25_rank': None,
                'bm25_normalized': 0,
                'semantic_score': score,
                'semantic_rank': len([r for r in semantic_results if r[2]['doc_id'] == doc_id]) + 1,
                'semantic_normalized': norm_score
            }
        else:
            # Update existing candidate with semantic info
            # Take best semantic score if multiple chunks from same document
            if score > candidates[doc_id]['semantic_score']:
                candidates[doc_id]['semantic_score'] = score
                candidates[doc_id]['semantic_normalized'] = norm_score
                candidates[doc_id]['semantic_rank'] = len([r for r in semantic_results if r[2]['doc_id'] == doc_id]) + 1
    
    # Convert to DataFrame for easy manipulation
    hybrid_results = pd.DataFrame.from_dict(candidates, orient='index')
    
    return hybrid_results

# Test hybrid retrieval
test_query = "black hole formation from stellar collapse"

print(f"🔍 Testing hybrid retrieval for: '{test_query}'")
hybrid_df = hybrid_retrieve(test_query, top_k_lex=10, top_k_sem=10)

print(f"\n📊 Found {len(hybrid_df)} unique documents from hybrid approach")

# Show method comparison
bm25_only = hybrid_df[hybrid_df['bm25_rank'].notna()].shape[0]
semantic_only = hybrid_df[hybrid_df['semantic_rank'].notna()].shape[0]
both_methods = hybrid_df[(hybrid_df['bm25_rank'].notna()) & (hybrid_df['semantic_rank'].notna())].shape[0]

print(f"\n📈 Method coverage:")
print(f"  BM25 found: {bm25_only} documents")
print(f"  Semantic found: {semantic_only} documents")
print(f"  Both methods found: {both_methods} documents")
print(f"  Union: {len(hybrid_df)} documents")

# Display top hybrid results using proper DataFrame display
print(f"\n📋 Top 10 candidates for rank fusion:")

# Create a clean display DataFrame with selected columns
display_df = hybrid_df[['doc_id', 'title', 'bm25_normalized', 'semantic_normalized', 'bm25_rank', 'semantic_rank']].head(10).copy()

# Format numeric columns for better display
display_df['bm25_normalized'] = display_df['bm25_normalized'].round(3)
display_df['semantic_normalized'] = display_df['semantic_normalized'].round(3)

# Truncate titles for better display
display_df['title'] = display_df['title'].str[:50] + '...'

# Rename columns for cleaner display
display_df.columns = ['Doc ID', 'Title', 'BM25 Score', 'Semantic Score', 'BM25 Rank', 'Semantic Rank']

# Display the formatted DataFrame
display(display_df)

🔍 Testing hybrid retrieval for: 'black hole formation from stellar collapse'

📊 Found 14 unique documents from hybrid approach

📈 Method coverage:
  BM25 found: 10 documents
  Semantic found: 10 documents
  Both methods found: 6 documents
  Union: 14 documents

📋 Top 10 candidates for rank fusion:


Unnamed: 0,Doc ID,Title,BM25 Score,Semantic Score,BM25 Rank,Semantic Rank
ast_001,ast_001,Understanding Black Holes...,1.0,1.0,1.0,2.0
ast_005,ast_005,Dark Matter and Dark Energy...,0.69,0.333,2.0,2.0
ast_002,ast_002,The Life Cycle of Stars...,0.532,0.817,3.0,2.0
ast_003,ast_003,Solar System Formation...,0.511,0.777,4.0,2.0
hist_001,hist_001,The Industrial Revolution...,0.279,0.0,5.0,
py_004,py_004,Machine Learning Pipeline Design...,0.205,0.0,6.0,
hist_003,hist_003,The Renaissance Period...,0.186,0.0,7.0,
ast_004,ast_004,Exoplanet Detection Methods...,0.178,0.197,8.0,2.0
cook_001,cook_001,Essential Knife Skills...,0.0,0.016,9.0,2.0
cook_002,cook_002,Understanding Heat and Cooking Methods...,0.0,0.0,10.0,


## Rank Fusion: Reciprocal Rank Fusion (RRF)

**Reciprocal Rank Fusion (RRF)** is a simple yet effective method for combining rankings from multiple retrieval systems. It's particularly robust because it relies on ranks rather than raw scores.

### RRF Formula
For each document, RRF computes:
```
RRF_score = Σ (1 / (k + rank_i))
```
where:
- `rank_i` is the document's rank in system i
- `k` is a constant (typically 60) that controls the contribution curve, preventing top ranks from dominating too much.
- The sum is over all systems that retrieved the document

### Why RRF Works Well
1. **Rank-based**: Avoids issues with different score scales
2. **Robust**: Less sensitive to outliers than score-based fusion
3. **Simple**: No parameter tuning beyond choosing k
4. **Proven**: Works well across different retrieval types

### Parameter k
- **k=60** (default): Balanced contribution from all ranks
- **Lower k**: Top ranks dominate more
- **Higher k**: More uniform contribution across ranks

Documents appearing in multiple systems get higher RRF scores, while high-ranking documents in any single system also score well.

In [188]:
# Reciprocal Rank Fusion (RRF) implementation
# RRF is a robust method for combining rankings from different retrieval systems

def rrf_fuse(rankings, k=60):
    """
    Apply Reciprocal Rank Fusion to combine multiple ranking methods.
    
    Args:
        rankings (dict): Dictionary mapping method names to lists of (doc_id, rank) tuples
        k (int): RRF parameter controlling rank contribution curve (default: 60)
    
    Returns:
        list: Tuples of (doc_id, rrf_score, method_details)
    """
    # Collect all unique document IDs
    all_doc_ids = set()
    for method_rankings in rankings.values():
        all_doc_ids.update(doc_id for doc_id, _ in method_rankings)
    
    # Calculate RRF scores for each document
    rrf_scores = {}
    method_details = {}
    
    for doc_id in all_doc_ids:
        total_score = 0
        doc_method_info = {}
        
        # Sum reciprocal ranks across all methods that retrieved this document
        for method_name, method_rankings in rankings.items():
            # Find this document's rank in the current method
            doc_rank = None
            for d_id, rank in method_rankings:
                if d_id == doc_id:
                    doc_rank = rank
                    break
            
            if doc_rank is not None:
                # Calculate reciprocal rank contribution
                contribution = 1 / (k + doc_rank)
                total_score += contribution
                doc_method_info[method_name] = {
                    'rank': doc_rank,
                    'contribution': contribution
                }
            else:
                doc_method_info[method_name] = {
                    'rank': None,
                    'contribution': 0
                }
        
        rrf_scores[doc_id] = total_score
        method_details[doc_id] = doc_method_info
    
    # Sort by RRF score (highest first)
    sorted_results = sorted(
        [(doc_id, score, method_details[doc_id]) 
         for doc_id, score in rrf_scores.items()],
        key=lambda x: x[1],
        reverse=True
    )
    
    return sorted_results

def apply_rrf_to_hybrid(hybrid_df, k=60):
    """
    Apply RRF to hybrid retrieval results.
    
    Args:
        hybrid_df (pd.DataFrame): Hybrid retrieval results
        k (int): RRF parameter
    
    Returns:
        list: RRF fused results
    """
    # Prepare rankings for RRF
    rankings = {}
    
    # BM25 rankings (only for documents that have BM25 results)
    bm25_docs = hybrid_df[hybrid_df['bm25_rank'].notna()]
    if len(bm25_docs) > 0:
        bm25_rankings = [(row['doc_id'], row['bm25_rank']) 
                        for _, row in bm25_docs.iterrows()]
        rankings['BM25'] = bm25_rankings
    
    # Semantic rankings (only for documents that have semantic results)
    semantic_docs = hybrid_df[hybrid_df['semantic_rank'].notna()]
    if len(semantic_docs) > 0:
        semantic_rankings = [(row['doc_id'], row['semantic_rank']) 
                           for _, row in semantic_docs.iterrows()]
        rankings['Semantic'] = semantic_rankings
    
    # Apply RRF
    rrf_results = rrf_fuse(rankings, k=k)
    
    # Enrich results with document information
    enriched_results = []
    for doc_id, rrf_score, method_info in rrf_results:
        doc_row = hybrid_df[hybrid_df['doc_id'] == doc_id].iloc[0]
        enriched_results.append({
            'doc_id': doc_id,
            'title': doc_row['title'],
            'text': doc_row['text'],
            'rrf_score': rrf_score,
            'method_info': method_info,
            'bm25_rank': doc_row['bm25_rank'] if pd.notna(doc_row['bm25_rank']) else None,
            'semantic_rank': doc_row['semantic_rank'] if pd.notna(doc_row['semantic_rank']) else None
        })
    
    return enriched_results

# Test RRF on our hybrid results
test_query = "black hole formation from stellar collapse"
print(f"🔍 Applying RRF to query: '{test_query}'")

# Get hybrid results
hybrid_df = hybrid_retrieve(test_query, top_k_lex=10, top_k_sem=10)

# Apply RRF
rrf_results = apply_rrf_to_hybrid(hybrid_df, k=60)

print(f"\n📊 RRF Results (k=60):")

# Create DataFrame for better display of RRF results
rrf_display_data = []
for rank, result in enumerate(rrf_results[:10], 1):
    rrf_display_data.append({
        'Rank': rank,
        'Doc ID': result['doc_id'],
        'Title': result['title'][:40] + '...' if len(result['title']) > 40 else result['title'],
        'RRF Score': round(result['rrf_score'], 4),
        'BM25 Rank': result['bm25_rank'] if result['bm25_rank'] is not None else None,
        'Semantic Rank': result['semantic_rank'] if result['semantic_rank'] is not None else None
    })

rrf_df = pd.DataFrame(rrf_display_data)
print("\nTop 10 documents after RRF fusion:")
display(rrf_df)

# Analyze method contributions
print(f"\n📈 Method contribution analysis for top 5 results:")
for i, result in enumerate(rrf_results[:5], 1):
    print(f"\n{i}. [{result['doc_id']}] {result['title']}")
    print(f"   Total RRF score: {result['rrf_score']:.4f}")
    
    for method, info in result['method_info'].items():
        if info['rank'] is not None:
            print(f"   {method}: rank {info['rank']:.0f} → contribution {info['contribution']:.4f}")
        else:
            print(f"   {method}: not found → contribution 0.0000")

# Test different k values to show effect
print(f"\n🔬 Effect of different k values on top result:")

k_comparison_data = []
for k_val in [10, 30, 60, 100]:
    rrf_k = apply_rrf_to_hybrid(hybrid_df, k=k_val)
    top_result = rrf_k[0]
    k_comparison_data.append({
        'k Value': k_val,
        'Top Doc ID': top_result['doc_id'],
        'Title': top_result['title'][:50] + '...' if len(top_result['title']) > 50 else top_result['title'],
        'RRF Score': round(top_result['rrf_score'], 4)
    })

k_comparison_df = pd.DataFrame(k_comparison_data)
display(k_comparison_df)

🔍 Applying RRF to query: 'black hole formation from stellar collapse'

📊 RRF Results (k=60):

Top 10 documents after RRF fusion:


Unnamed: 0,Rank,Doc ID,Title,RRF Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,1.0,2.0
1,2,ast_005,Dark Matter and Dark Energy,0.0323,2.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,3.0,2.0
3,4,ast_003,Solar System Formation,0.0318,4.0,2.0
4,5,ast_004,Exoplanet Detection Methods,0.0308,8.0,2.0
5,6,cook_001,Essential Knife Skills,0.0306,9.0,2.0
6,7,hist_002,Ancient Civilizations and Trade,0.0161,,2.0
7,8,cook_004,Sauce Making Fundamentals,0.0161,,2.0
8,9,hist_005,The Cold War Era,0.0161,,2.0
9,10,hist_004,World War Impact on Society,0.0161,,2.0



📈 Method contribution analysis for top 5 results:

1. [ast_001] Understanding Black Holes
   Total RRF score: 0.0325
   BM25: rank 1 → contribution 0.0164
   Semantic: rank 2 → contribution 0.0161

2. [ast_005] Dark Matter and Dark Energy
   Total RRF score: 0.0323
   BM25: rank 2 → contribution 0.0161
   Semantic: rank 2 → contribution 0.0161

3. [ast_002] The Life Cycle of Stars
   Total RRF score: 0.0320
   BM25: rank 3 → contribution 0.0159
   Semantic: rank 2 → contribution 0.0161

4. [ast_003] Solar System Formation
   Total RRF score: 0.0318
   BM25: rank 4 → contribution 0.0156
   Semantic: rank 2 → contribution 0.0161

5. [ast_004] Exoplanet Detection Methods
   Total RRF score: 0.0308
   BM25: rank 8 → contribution 0.0147
   Semantic: rank 2 → contribution 0.0161

🔬 Effect of different k values on top result:


Unnamed: 0,k Value,Top Doc ID,Title,RRF Score
0,10,ast_001,Understanding Black Holes,0.1742
1,30,ast_001,Understanding Black Holes,0.0635
2,60,ast_001,Understanding Black Holes,0.0325
3,100,ast_001,Understanding Black Holes,0.0197


## Re-ranking with Cross-Encoders

**Cross-encoders** are transformer models that score (query, passage) pairs directly, providing more accurate relevance estimates than individual embeddings. They're the "second stage" in a two-stage retrieval pipeline.

### How Cross-Encoders Work
Unlike bi-encoders (like sentence-transformers) that encode query and passage (or document) separately, cross-encoders:
1. Concatenate query and passage as input: `[CLS] query [SEP] passage [SEP]` ([CLS] and [SEP] are special tokens used in models like BERT to denote the start and separation of sequences)
2. Use full attention across query-passage pairs
3. Output a single relevance score

### Advantages
- **Higher accuracy**: Full attention between query and passage
- **Better ranking**: Specifically trained for relevance scoring
- **Fine-grained scoring**: Can distinguish subtle relevance differences

### Trade-offs
- **Computational cost**: Must process each (query, candidate) pair
- **Latency**: Slower than embedding-based similarity
- **Scale limitations**: Practical only for re-ranking small candidate sets

### Best Practice
Use cross-encoders to re-rank the top-N (typically 10-50) candidates from faster retrieval methods. This gives you both speed and accuracy.

In [189]:
# Re-ranking with open-source cross-encoder models
# Cross-encoders provide more accurate relevance scoring for final ranking
import time

# Load a lightweight cross-encoder model
# ms-marco-MiniLM is trained on Microsoft's passage ranking dataset
cross_encoder_model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
# Alternative models (commented for reference):
# cross_encoder_model_name = "BAAI/bge-reranker-base"          # Higher quality, larger
# cross_encoder_model_name = "cross-encoder/ms-marco-TinyBERT-L-2-v2"  # Faster, smaller

print(f"🤖 Loading cross-encoder: {cross_encoder_model_name}")
cross_encoder = CrossEncoder(cross_encoder_model_name)
print("✅ Cross-encoder loaded successfully")

def rerank_with_cross_encoder(query_text, candidates, top_k=10):
    """
    Re-rank candidates using a cross-encoder model.
    
    Args:
        query_text (str): The search query
        candidates (list): List of candidate documents with text
        top_k (int): Number of top results to return after re-ranking
    
    Returns:
        list: Re-ranked candidates with cross-encoder scores
    """
    if not candidates:
        return []
    
    start_time = time.time()
    
    # Prepare (query, passage) pairs for the cross-encoder
    # Use document title + text for better context
    query_passage_pairs = []
    for candidate in candidates:
        # Combine title and text for richer passage representation
        passage_text = f"{candidate['title']}. {candidate['text']}"
        query_passage_pairs.append([query_text, passage_text])
    
    # Get relevance scores from cross-encoder
    # Scores are logits that can be interpreted as relevance strength
    print(f"🔄 Computing cross-encoder scores for {len(query_passage_pairs)} candidates...")
    relevance_scores = cross_encoder.predict(query_passage_pairs)
    
    # Combine candidates with their cross-encoder scores
    scored_candidates = []
    for candidate, ce_score in zip(candidates, relevance_scores):
        scored_candidate = candidate.copy()
        scored_candidate['cross_encoder_score'] = float(ce_score)
        scored_candidates.append(scored_candidate)
    
    # Sort by cross-encoder score (highest first)
    reranked = sorted(scored_candidates, 
                     key=lambda x: x['cross_encoder_score'], 
                     reverse=True)
    
    elapsed_time = time.time() - start_time
    print(f"⏱️  Cross-encoder re-ranking completed in {elapsed_time:.2f} seconds")
    
    return reranked[:top_k]

# Test cross-encoder re-ranking on RRF results
test_query = "black hole formation from stellar collapse"
print(f"\n🔍 Testing cross-encoder re-ranking for: '{test_query}'")

# Get RRF results as candidates for re-ranking
hybrid_df = hybrid_retrieve(test_query, top_k_lex=15, top_k_sem=15)
rrf_candidates = apply_rrf_to_hybrid(hybrid_df, k=60)

# Take top 20 RRF candidates for re-ranking (manageable size for cross-encoder)
candidates_for_reranking = rrf_candidates[:20]

print(f"\n📊 Re-ranking top {len(candidates_for_reranking)} RRF candidates")

# Apply cross-encoder re-ranking
reranked_results = rerank_with_cross_encoder(
    test_query, 
    candidates_for_reranking, 
    top_k=10
)

# Display comparison: RRF order vs Cross-encoder order
print(f"\n📈 Comparison: RRF vs Cross-Encoder ranking")

print("\nRRF Ranking (before re-ranking) - Top 5:")
rrf_display_data = []
for i, candidate in enumerate(candidates_for_reranking[:5], 1):
    rrf_display_data.append({
        'Rank': i,
        'Doc ID': candidate['doc_id'],
        'Title': candidate['title'][:40] + '...' if len(candidate['title']) > 40 else candidate['title'],
        'RRF Score': round(candidate['rrf_score'], 4)
    })

rrf_comparison_df = pd.DataFrame(rrf_display_data)
display(rrf_comparison_df)

print("\nCross-Encoder Ranking (after re-ranking) - Top 5:")
ce_display_data = []
for i, result in enumerate(reranked_results[:5], 1):
    ce_display_data.append({
        'Rank': i,
        'Doc ID': result['doc_id'],
        'Title': result['title'][:40] + '...' if len(result['title']) > 40 else result['title'],
        'CE Score': round(result['cross_encoder_score'], 4),
        'RRF Score': round(result['rrf_score'], 4)
    })

ce_comparison_df = pd.DataFrame(ce_display_data)
display(ce_comparison_df)

# Show detailed results with snippets
print(f"\n📋 Top 5 results with snippets:")
for i, result in enumerate(reranked_results[:5], 1):
    print(f"\n{i}. [{result['doc_id']}] {result['title']}")
    print(f"   Cross-encoder score: {result['cross_encoder_score']:.4f}")
    print(f"   Original RRF score: {result['rrf_score']:.4f}")
    
    # Show first 200 characters as snippet
    snippet = result['text'][:200] + '...' if len(result['text']) > 200 else result['text']
    print(f"   Snippet: {snippet}")

# Analyze ranking changes
print(f"\n🔄 Ranking changes analysis:")
rank_changes = []
for i, reranked_doc in enumerate(reranked_results[:10], 1):
    # Find original RRF position
    original_rank = None
    for j, original_doc in enumerate(candidates_for_reranking, 1):
        if original_doc['doc_id'] == reranked_doc['doc_id']:
            original_rank = j
            break
    
    change = original_rank - i if original_rank else "New"
    rank_changes.append({
        'Doc ID': reranked_doc['doc_id'],
        'Original RRF Rank': original_rank,
        'New CE Rank': i,
        'Position Change': f"+{change}" if isinstance(change, int) and change > 0 else str(change),
        'Direction': '↑' if isinstance(change, int) and change > 0 else ('↓' if isinstance(change, int) and change < 0 else '→')
    })

rank_changes_df = pd.DataFrame(rank_changes)
display(rank_changes_df)

🤖 Loading cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
✅ Cross-encoder loaded successfully

🔍 Testing cross-encoder re-ranking for: 'black hole formation from stellar collapse'

📊 Re-ranking top 19 RRF candidates
🔄 Computing cross-encoder scores for 19 candidates...
⏱️  Cross-encoder re-ranking completed in 0.05 seconds

📈 Comparison: RRF vs Cross-Encoder ranking

RRF Ranking (before re-ranking) - Top 5:


Unnamed: 0,Rank,Doc ID,Title,RRF Score
0,1,ast_001,Understanding Black Holes,0.0325
1,2,ast_005,Dark Matter and Dark Energy,0.0323
2,3,ast_002,The Life Cycle of Stars,0.032
3,4,ast_003,Solar System Formation,0.0318
4,5,hist_001,The Industrial Revolution,0.0315



Cross-Encoder Ranking (after re-ranking) - Top 5:


Unnamed: 0,Rank,Doc ID,Title,CE Score,RRF Score
0,1,ast_001,Understanding Black Holes,4.5733,0.0325
1,2,ast_003,Solar System Formation,-4.3658,0.0318
2,3,ast_002,The Life Cycle of Stars,-8.0267,0.032
3,4,ast_005,Dark Matter and Dark Energy,-9.6758,0.0323
4,5,hist_003,The Renaissance Period,-11.2628,0.0149



📋 Top 5 results with snippets:

1. [ast_001] Understanding Black Holes
   Cross-encoder score: 4.5733
   Original RRF score: 0.0325
   Snippet: Black holes are regions of spacetime where gravity is so strong that nothing, including light, can escape once it crosses the event horizon. The event horizon is the boundary beyond which escape becom...

2. [ast_003] Solar System Formation
   Cross-encoder score: -4.3658
   Original RRF score: 0.0318
   Snippet: The solar system formed approximately 4.6 billion years ago from the gravitational collapse of a molecular cloud. The Sun formed at the center while leftover material formed a protoplanetary disk. Thr...

3. [ast_002] The Life Cycle of Stars
   Cross-encoder score: -8.0267
   Original RRF score: 0.0320
   Snippet: Stars are born from clouds of gas and dust called nebulae. Through gravitational collapse, the core temperature rises until nuclear fusion begins, converting hydrogen into helium and releasing enormou...

4. [ast_005] Dark M

Unnamed: 0,Doc ID,Original RRF Rank,New CE Rank,Position Change,Direction
0,ast_001,1,1,0,→
1,ast_003,4,2,2,↑
2,ast_002,3,3,0,→
3,ast_005,2,4,-2,↓
4,hist_003,16,5,11,↑
5,cook_001,7,6,1,↑
6,cook_003,8,7,1,↑
7,cook_002,17,8,9,↑
8,ast_004,6,9,-3,↓
9,heal_004,12,10,2,↑


## LLM Generation: Bringing It All Together

The **generation** step combines our retrieved and re-ranked documents with a language model to produce final answers. This is where RAG "augments" the generation with retrieved knowledge.

### Key Components
1. **Context Selection**: Choose top-k documents within token budget
2. **Prompt Engineering**: Structure context and instructions clearly
3. **Citation**: Enable traceability back to source documents
4. **Grounding**: Instruct the model to use only provided context

### Prompt Structure
A well-structured RAG prompt includes:
- **System message**: Instructions for behavior and citation
- **Context section**: Retrieved documents with clear formatting
- **Query**: User's original question
- **Instructions**: Explicit grounding requirements

### Token Budget Management
Language models have context limits. We must:
- Prioritize highest-ranked documents
- Truncate or summarize if needed
- Leave space for the generated response

### Citation Strategy
Include document IDs in the context so the model can reference specific sources. This enables fact-checking and builds user trust.

**Environment handling**: Read API keys from environment variables and provide graceful fallbacks for missing credentials.

In [190]:
# End-to-end RAG pipeline with OpenAI generation
# This combines all our retrieval components with final answer generation
import os
from openai import OpenAI

def estimate_tokens(text):
    """
    Rough estimation of token count (approximately 4 characters per token).
    This is a simple heuristic; actual tokenization may differ.
    
    Args:
        text (str): Input text
    
    Returns:
        int: Estimated token count
    """
    return len(text) // 4

def select_context_chunks(ranked_results, max_tokens=2000):
    """
    Select top-ranked documents that fit within token budget.
    
    Args:
        ranked_results (list): Ranked documents from retrieval pipeline
        max_tokens (int): Maximum tokens to use for context
    
    Returns:
        list: Selected documents within token budget
    """
    selected_chunks = []
    total_tokens = 0
    
    for result in ranked_results:
        # Estimate tokens for this document (title + text + formatting)
        doc_text = f"[{result['doc_id']}] {result['title']}\n{result['text']}"
        doc_tokens = estimate_tokens(doc_text)
        
        # Check if adding this document would exceed budget
        if total_tokens + doc_tokens <= max_tokens:
            selected_chunks.append(result)
            total_tokens += doc_tokens
        else:
            break
    
    return selected_chunks, total_tokens

def create_rag_prompt(query, context_chunks):
    """
    Create a structured prompt for RAG generation.
    
    Args:
        query (str): User's question
        context_chunks (list): Selected context documents
    
    Returns:
        tuple: (system_message, user_message)
    """
    # System message with clear instructions
    system_message = """You are a helpful assistant that answers questions based on provided context. 

INSTRUCTIONS:
1. Answer the user's question using ONLY the information provided in the context below
2. If you cite information, include the document ID in brackets [doc_id]
3. If the context doesn't contain enough information to answer the question, say so clearly
4. Be accurate and specific - don't make assumptions beyond what's stated in the context
5. Provide a clear, well-structured answer
"""
    
    # Format context documents clearly
    context_text = "\n\nCONTEXT DOCUMENTS:\n\n"
    for i, chunk in enumerate(context_chunks, 1):
        context_text += f"Document {i}: [{chunk['doc_id']}]\n"
        context_text += f"Title: {chunk['title']}\n"
        context_text += f"Content: {chunk['text']}\n\n"
    
    # User message with query and context
    user_message = f"{context_text}\nQUESTION: {query}\n\nPlease provide a comprehensive answer based on the context above."
    
    return system_message, user_message

def answer_query(query_text, max_context_tokens=2000):
    """
    Complete RAG pipeline: retrieve, re-rank, and generate answer.
    
    Args:
        query_text (str): User's question
        max_context_tokens (int): Maximum tokens for context
    
    Returns:
        dict: Complete RAG results including retrieval steps and final answer
    """
    print(f"🔍 Starting RAG pipeline for: '{query_text}'")
    
    # Step 1: Hybrid retrieval (BM25 + Semantic)
    print("📊 Step 1: Hybrid retrieval...")
    hybrid_results = hybrid_retrieve(query_text, top_k_lex=15, top_k_sem=15)
    
    # Step 2: Rank fusion with RRF
    print("🔄 Step 2: Rank fusion (RRF)...")
    rrf_results = apply_rrf_to_hybrid(hybrid_results, k=60)
    
    # Step 3: Cross-encoder re-ranking
    print("🎯 Step 3: Cross-encoder re-ranking...")
    top_candidates = rrf_results[:20]  # Re-rank top 20 candidates
    reranked_results = rerank_with_cross_encoder(query_text, top_candidates, top_k=15)
    
    # Step 4: Context selection within token budget
    print("📝 Step 4: Context selection...")
    selected_context, context_tokens = select_context_chunks(reranked_results, max_context_tokens)
    print(f"   Selected {len(selected_context)} documents using ~{context_tokens} tokens")
    
    # Step 5: Generate answer with OpenAI
    print("🤖 Step 5: Generating answer...")
    
    # Check for OpenAI API key
    openai_api_key = os.getenv('OPENAI_API_KEY')
    if not openai_api_key:
        print("⚠️  OpenAI API key not found in environment variables")
        print("   Set OPENAI_API_KEY environment variable to enable generation")
        return {
            'query': query_text,
            'retrieval_results': len(hybrid_results),
            'rrf_results': len(rrf_results),
            'reranked_results': len(reranked_results),
            'selected_context': selected_context,
            'context_tokens': context_tokens,
            'answer': "[Generation skipped: OpenAI API key not available]",
            'citations': [chunk['doc_id'] for chunk in selected_context]
        }
    
    try:
        # Initialize OpenAI client
        client = OpenAI()
        
        # Create RAG prompt
        system_message, user_message = create_rag_prompt(query_text, selected_context)
        
        # Call OpenAI API
        response = client.chat.completions.create(
            model="gpt-4o-mini",  # Fast, cost-effective model
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            max_tokens=500,  # Limit response length
            temperature=0.1  # Low temperature for factual responses
        )
        
        answer = response.choices[0].message.content
        
    except Exception as e:
        print(f"❌ Error during generation: {str(e)}")
        answer = f"[Generation failed: {str(e)}]"
    
    # Return comprehensive results
    return {
        'query': query_text,
        'retrieval_results': len(hybrid_results),
        'rrf_results': len(rrf_results),
        'reranked_results': len(reranked_results),
        'selected_context': selected_context,
        'context_tokens': context_tokens,
        'answer': answer,
        'citations': [chunk['doc_id'] for chunk in selected_context]
    }

# Test the complete RAG pipeline
test_queries = [
    "How do black holes form and what happens at the event horizon?",
    "What are the health benefits of regular exercise and how does it affect the cardiovascular system?",
    "How do you make a good roux and what are the classic mother sauces in cooking?"
]

print("🚀 Testing complete RAG pipeline:\n")
for i, query in enumerate(test_queries, 1):
    print(f"=" * 80)
    print(f"TEST {i}: {query}")
    print(f"=" * 80)
    
    # Run complete RAG pipeline
    rag_result = answer_query(query)
    
    # Display results
    print(f"\n📊 Pipeline Summary:")
    print(f"   Hybrid retrieval: {rag_result['retrieval_results']} candidates")
    display(hybrid_df.head(5))  # Show top 5 hybrid candidates
    print(f"   RRF fusion: {rag_result['rrf_results']} documents")
    display(rrf_df.head(5))  # Show top 5 RRF candidates
    print(f"   Re-ranked: {rag_result['reranked_results']} documents")
    
    # Convert reranked_results to DataFrame for better display
    reranked_display_data = []
    for rank, doc in enumerate(reranked_results[:10], 1):  # Show top 10
        reranked_display_data.append({
            'Rank': rank,
            'Doc ID': doc['doc_id'],
            'Title': doc['title'][:40] + '...' if len(doc['title']) > 40 else doc['title'],
            'RRF Score': round(doc['rrf_score'], 4),
            'CE Score': round(doc['cross_encoder_score'], 2),
            'BM25 Rank': doc.get('bm25_rank', 'N/A'),
            'Semantic Rank': doc.get('semantic_rank', 'N/A')
        })
    
    reranked_df = pd.DataFrame(reranked_display_data)
    print("🎯 Top 10 Cross-Encoder Reranked Results:")
    display(reranked_df)
    
    print(f"   Context used: {len(rag_result['selected_context'])} documents ({rag_result['context_tokens']} tokens)")
    
    print(f"\n📚 Context Documents:")
    for j, doc in enumerate(rag_result['selected_context'], 1):
        print(f"   {j}. [{doc['doc_id']}] {doc['title']}")
    
    print(f"\n💬 Generated Answer:")
    print(rag_result['answer'])
    print(f"\n🔗 Citations: {', '.join(rag_result['citations'])}")
    print("\n")

🚀 Testing complete RAG pipeline:

TEST 1: How do black holes form and what happens at the event horizon?
🔍 Starting RAG pipeline for: 'How do black holes form and what happens at the event horizon?'
📊 Step 1: Hybrid retrieval...
🔄 Step 2: Rank fusion (RRF)...
🎯 Step 3: Cross-encoder re-ranking...
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.04 seconds
📝 Step 4: Context selection...
   Selected 15 documents using ~1608 tokens
🤖 Step 5: Generating answer...

📊 Pipeline Summary:
   Hybrid retrieval: 23 candidates


Unnamed: 0,doc_id,title,text,bm25_score,bm25_rank,bm25_normalized,semantic_score,semantic_rank,semantic_normalized
ast_001,ast_001,Understanding Black Holes,Black holes are regions of spacetime where gra...,5.871797,1.0,1.0,0.535157,2.0,1.0
ast_005,ast_005,Dark Matter and Dark Energy,Dark matter makes up approximately 27% of the ...,4.049869,2.0,0.689716,0.205811,2.0,0.385096
ast_002,ast_002,The Life Cycle of Stars,Stars are born from clouds of gas and dust cal...,3.122125,3.0,0.531715,0.444952,2.0,0.831583
ast_003,ast_003,Solar System Formation,The solar system formed approximately 4.6 bill...,2.997932,4.0,0.510565,0.42524,2.0,0.79478
hist_001,hist_001,The Industrial Revolution,The Industrial Revolution transformed society ...,1.640867,5.0,0.279449,0.033413,2.0,0.063221


   RRF fusion: 23 documents


Unnamed: 0,Rank,Doc ID,Title,RRF Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,1.0,2.0
1,2,ast_005,Dark Matter and Dark Energy,0.0323,2.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,3.0,2.0
3,4,ast_003,Solar System Formation,0.0318,4.0,2.0
4,5,ast_004,Exoplanet Detection Methods,0.0308,8.0,2.0


   Re-ranked: 15 documents
🎯 Top 10 Cross-Encoder Reranked Results:


Unnamed: 0,Rank,Doc ID,Title,RRF Score,CE Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,4.57,1.0,2.0
1,2,ast_003,Solar System Formation,0.0318,-4.37,4.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,-8.03,3.0,2.0
3,4,ast_005,Dark Matter and Dark Energy,0.0323,-9.68,2.0,2.0
4,5,hist_003,The Renaissance Period,0.0149,-11.26,7.0,
5,6,cook_001,Essential Knife Skills,0.0306,-11.27,9.0,2.0
6,7,cook_003,Building Flavor Profiles,0.0302,-11.29,11.0,2.0
7,8,cook_002,Understanding Heat and Cooking Methods,0.0143,-11.3,10.0,
8,9,ast_004,Exoplanet Detection Methods,0.0308,-11.3,8.0,2.0
9,10,heal_004,Sleep and Recovery,0.0161,-11.31,,2.0


   Context used: 15 documents (1608 tokens)

📚 Context Documents:
   1. [ast_001] Understanding Black Holes
   2. [ast_005] Dark Matter and Dark Energy
   3. [ast_003] Solar System Formation
   4. [ast_002] The Life Cycle of Stars
   5. [ast_004] Exoplanet Detection Methods
   6. [cook_001] Essential Knife Skills
   7. [py_003] Asynchronous Programming Patterns
   8. [heal_005] Immune System Function
   9. [cook_003] Building Flavor Profiles
   10. [heal_004] Sleep and Recovery
   11. [cook_002] Understanding Heat and Cooking Methods
   12. [cook_004] Sauce Making Fundamentals
   13. [sport_004] Strength Training Principles
   14. [cook_005] Baking Science and Techniques
   15. [hist_003] The Renaissance Period

💬 Generated Answer:
Black holes form when massive stars collapse under their own gravity at the end of their lifecycle. This collapse leads to the creation of a region in spacetime where gravity is so strong that nothing, including light, can escape once it crosses the event ho

Unnamed: 0,doc_id,title,text,bm25_score,bm25_rank,bm25_normalized,semantic_score,semantic_rank,semantic_normalized
ast_001,ast_001,Understanding Black Holes,Black holes are regions of spacetime where gra...,5.871797,1.0,1.0,0.535157,2.0,1.0
ast_005,ast_005,Dark Matter and Dark Energy,Dark matter makes up approximately 27% of the ...,4.049869,2.0,0.689716,0.205811,2.0,0.385096
ast_002,ast_002,The Life Cycle of Stars,Stars are born from clouds of gas and dust cal...,3.122125,3.0,0.531715,0.444952,2.0,0.831583
ast_003,ast_003,Solar System Formation,The solar system formed approximately 4.6 bill...,2.997932,4.0,0.510565,0.42524,2.0,0.79478
hist_001,hist_001,The Industrial Revolution,The Industrial Revolution transformed society ...,1.640867,5.0,0.279449,0.033413,2.0,0.063221


   RRF fusion: 26 documents


Unnamed: 0,Rank,Doc ID,Title,RRF Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,1.0,2.0
1,2,ast_005,Dark Matter and Dark Energy,0.0323,2.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,3.0,2.0
3,4,ast_003,Solar System Formation,0.0318,4.0,2.0
4,5,ast_004,Exoplanet Detection Methods,0.0308,8.0,2.0


   Re-ranked: 15 documents
🎯 Top 10 Cross-Encoder Reranked Results:


Unnamed: 0,Rank,Doc ID,Title,RRF Score,CE Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,4.57,1.0,2.0
1,2,ast_003,Solar System Formation,0.0318,-4.37,4.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,-8.03,3.0,2.0
3,4,ast_005,Dark Matter and Dark Energy,0.0323,-9.68,2.0,2.0
4,5,hist_003,The Renaissance Period,0.0149,-11.26,7.0,
5,6,cook_001,Essential Knife Skills,0.0306,-11.27,9.0,2.0
6,7,cook_003,Building Flavor Profiles,0.0302,-11.29,11.0,2.0
7,8,cook_002,Understanding Heat and Cooking Methods,0.0143,-11.3,10.0,
8,9,ast_004,Exoplanet Detection Methods,0.0308,-11.3,8.0,2.0
9,10,heal_004,Sleep and Recovery,0.0161,-11.31,,2.0


   Context used: 15 documents (1632 tokens)

📚 Context Documents:
   1. [heal_002] Cardiovascular Health
   2. [sport_005] Endurance Training Methodologies
   3. [heal_005] Immune System Function
   4. [heal_003] Mental Health and Wellness
   5. [heal_004] Sleep and Recovery
   6. [sport_003] Injury Prevention Strategies
   7. [sport_004] Strength Training Principles
   8. [heal_001] Nutrition and Metabolism
   9. [sport_001] Athletic Performance Optimization
   10. [sport_002] Sports Psychology and Mental Training
   11. [ast_005] Dark Matter and Dark Energy
   12. [ast_001] Understanding Black Holes
   13. [cook_003] Building Flavor Profiles
   14. [ast_004] Exoplanet Detection Methods
   15. [hist_003] The Renaissance Period

💬 Generated Answer:
Regular exercise offers several health benefits, particularly for the cardiovascular system. It strengthens the heart muscle, which enhances its efficiency in pumping blood throughout the body. This improved function leads to better circulat

Unnamed: 0,doc_id,title,text,bm25_score,bm25_rank,bm25_normalized,semantic_score,semantic_rank,semantic_normalized
ast_001,ast_001,Understanding Black Holes,Black holes are regions of spacetime where gra...,5.871797,1.0,1.0,0.535157,2.0,1.0
ast_005,ast_005,Dark Matter and Dark Energy,Dark matter makes up approximately 27% of the ...,4.049869,2.0,0.689716,0.205811,2.0,0.385096
ast_002,ast_002,The Life Cycle of Stars,Stars are born from clouds of gas and dust cal...,3.122125,3.0,0.531715,0.444952,2.0,0.831583
ast_003,ast_003,Solar System Formation,The solar system formed approximately 4.6 bill...,2.997932,4.0,0.510565,0.42524,2.0,0.79478
hist_001,hist_001,The Industrial Revolution,The Industrial Revolution transformed society ...,1.640867,5.0,0.279449,0.033413,2.0,0.063221


   RRF fusion: 18 documents


Unnamed: 0,Rank,Doc ID,Title,RRF Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,1.0,2.0
1,2,ast_005,Dark Matter and Dark Energy,0.0323,2.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,3.0,2.0
3,4,ast_003,Solar System Formation,0.0318,4.0,2.0
4,5,ast_004,Exoplanet Detection Methods,0.0308,8.0,2.0


   Re-ranked: 15 documents
🎯 Top 10 Cross-Encoder Reranked Results:


Unnamed: 0,Rank,Doc ID,Title,RRF Score,CE Score,BM25 Rank,Semantic Rank
0,1,ast_001,Understanding Black Holes,0.0325,4.57,1.0,2.0
1,2,ast_003,Solar System Formation,0.0318,-4.37,4.0,2.0
2,3,ast_002,The Life Cycle of Stars,0.032,-8.03,3.0,2.0
3,4,ast_005,Dark Matter and Dark Energy,0.0323,-9.68,2.0,2.0
4,5,hist_003,The Renaissance Period,0.0149,-11.26,7.0,
5,6,cook_001,Essential Knife Skills,0.0306,-11.27,9.0,2.0
6,7,cook_003,Building Flavor Profiles,0.0302,-11.29,11.0,2.0
7,8,cook_002,Understanding Heat and Cooking Methods,0.0143,-11.3,10.0,
8,9,ast_004,Exoplanet Detection Methods,0.0308,-11.3,8.0,2.0
9,10,heal_004,Sleep and Recovery,0.0161,-11.31,,2.0


   Context used: 15 documents (1607 tokens)

📚 Context Documents:
   1. [cook_004] Sauce Making Fundamentals
   2. [cook_003] Building Flavor Profiles
   3. [cook_002] Understanding Heat and Cooking Methods
   4. [cook_001] Essential Knife Skills
   5. [cook_005] Baking Science and Techniques
   6. [ast_004] Exoplanet Detection Methods
   7. [py_004] Machine Learning Pipeline Design
   8. [ast_001] Understanding Black Holes
   9. [ast_003] Solar System Formation
   10. [heal_004] Sleep and Recovery
   11. [py_003] Asynchronous Programming Patterns
   12. [py_001] Object-Oriented Programming Concepts
   13. [hist_005] The Cold War Era
   14. [ast_002] The Life Cycle of Stars
   15. [py_005] API Development with FastAPI

💬 Generated Answer:
To make a good roux, you need to combine equal parts fat and flour. This mixture is cooked over heat to create a smooth, creamy texture, which serves as the base for various sauces. The classic mother sauces in cooking, which are foundational to many 

## Choosing the Right Approach: Vector Store vs Prompt-Embedded vs Local Files

The choice of knowledge storage and retrieval architecture depends on your specific requirements:

### Vector Stores (e.g., Pinecone, Weaviate, Chroma)
**Best for**: Medium to large corpora, frequent updates, production systems
- **Pros**: Optimized ANN search, metadata filtering, horizontal scaling, real-time updates
- **Cons**: Additional infrastructure, cost, complexity
- **Use when**: >10,000 documents, multiple users, frequent content updates

### Prompt-Embedded Dataset
**Best for**: Very small, static knowledge bases
- **Pros**: Simplest implementation, no retrieval needed, perfect recall
- **Cons**: Limited by context window, expensive tokens, no semantic search
- **Use when**: <10 short documents, completely static content, maximum simplicity

### Local File Embeddings (Our Approach)
**Best for**: Small to medium corpora, single-node applications, development
- **Pros**: No external dependencies, fast development, full control, offline capability
- **Cons**: No horizontal scaling, manual index updates, limited concurrent access
- **Use when**: <100,000 documents, single-node deployment, development/prototyping

### Migration Path
Start with local files for development, then migrate to a vector store when you need:
- More than ~50,000 documents
- Real-time updates
- Multiple concurrent users
- Advanced filtering capabilities

## Testing the Retrieval Pipeline
Now that we've built each component of the RAG pipeline, it's time to test the entire system end-to-end. We'll use a set of diverse queries to evaluate how well our RAG implementation retrieves relevant documents and generates accurate answers.

Let's define some test queries that cover various topics in our synthetic corpus:

In [191]:
# Simple evaluation harness to compare retrieval methods
# This provides quantitative comparison of different approaches

# Define evaluation queries with expected relevant document IDs
# These are hand-crafted based on our synthetic corpus
evaluation_queries_simple = [
    {"query": "event horizon black holes singularity", "relevant_docs": ["ast_001"]},
    {"query": "stellar life cycle nebula nuclear fusion mass lifespan", "relevant_docs": ["ast_002"]},
    {"query": "solar system formation protoplanetary disk accretion collisions", "relevant_docs": ["ast_003"]},
    {"query": "exoplanet detection transit radial velocity direct imaging", "relevant_docs": ["ast_004"]},
    {"query": "dark matter 27 percent dark energy 68 percent accelerating expansion", "relevant_docs": ["ast_005"]},
    {"query": "knife skills pinch grip claw hand brunoise julienne chiffonade", "relevant_docs": ["cook_001"]},
    {"query": "heat transfer conduction convection radiation roasting braising steaming", "relevant_docs": ["cook_002"]},
    {"query": "build flavor profiles aromatics layering seasoning acid fat herbs", "relevant_docs": ["cook_003"]},
    {"query": "sauce making roux emulsification reduction bechamel hollandaise", "relevant_docs": ["cook_004"]},
    {"query": "baking science gluten leavening temperature ratios", "relevant_docs": ["cook_005"]},
    {"query": "OOP classes objects encapsulation inheritance polymorphism", "relevant_docs": ["py_001"]},
    {"query": "pandas dataframe vectorized operations groupby merge join", "relevant_docs": ["py_002"]},
    {"query": "async await event loop coroutines io bound", "relevant_docs": ["py_003"]},
    {"query": "machine learning pipeline preprocessing cross validation hyperparameter tuning monitoring", "relevant_docs": ["py_004"]},
    {"query": "FastAPI OpenAPI type hints dependency injection async middleware", "relevant_docs": ["py_005"]},
    {"query": "industrial revolution steam power factories urbanization pollution", "relevant_docs": ["hist_001"]},
    {"query": "ancient trade silk road mediterranean venice genoa", "relevant_docs": ["hist_002"]},
    {"query": "renaissance humanism printing press patronage scientific methods", "relevant_docs": ["hist_003"]},
    {"query": "world wars impact women workforce decolonization international organizations", "relevant_docs": ["hist_004"]},
    {"query": "cold war nuclear deterrence proxy wars space race soviet dissolution", "relevant_docs": ["hist_005"]},
    {"query": "nutrition metabolism macronutrients micronutrients catabolism anabolism", "relevant_docs": ["heal_001"]},
    {"query": "cardiovascular health exercise diet blood pressure cholesterol prevention", "relevant_docs": ["heal_002"]},
    {"query": "mental health stress management social connections therapy medication", "relevant_docs": ["heal_003"]},
    {"query": "sleep recovery REM non REM memory waste growth hormone hygiene", "relevant_docs": ["heal_004"]},
    {"query": "immune system innate adaptive antibodies vaccination lifestyle", "relevant_docs": ["heal_005"]},
    {"query": "athletic performance periodization recovery adaptation", "relevant_docs": ["sport_001"]},
    {"query": "sports psychology visualization goal setting self talk pressure", "relevant_docs": ["sport_002"]},
    {"query": "injury prevention dynamic warm up strength imbalances recovery time", "relevant_docs": ["sport_003"]},
    {"query": "strength training progressive overload compound exercises frequency volume intensity form", "relevant_docs": ["sport_004"]},
    {"query": "endurance training heart rate zones base intervals lactate threshold periodization", "relevant_docs": ["sport_005"]}
]

evaluation_queries_mixed = [
    {"query": "the boundary where light cannot escape defines a black hole", "relevant_docs": ["ast_001"]},
    {"query": "stars born in nebulae; mass determines how long they live", "relevant_docs": ["ast_002"]},
    {"query": "planets grew inside a dusty disk via accretion and collisions", "relevant_docs": ["ast_003"]},
    {"query": "find worlds by tiny eclipses or stellar wobbles", "relevant_docs": ["ast_004"]},
    {"query": "cosmic budget split: ~27% dark matter and ~68% dark energy", "relevant_docs": ["ast_005"]},
    {"query": "pinch grip and claw hand for consistent dice", "relevant_docs": ["cook_001"]},
    {"query": "touch swirl and radiant glow: three ways heat cooks food", "relevant_docs": ["cook_002"]},
    {"query": "acid brightens, fat carries, herbs finish—layer flavors early to late", "relevant_docs": ["cook_003"]},
    {"query": "roux emulsions and reductions as sauce foundations", "relevant_docs": ["cook_004"]},
    {"query": "gluten builds structure while leavening supplies gas lift", "relevant_docs": ["cook_005"]},
    {"query": "classes hide internals; inheritance and polymorphism reuse and adapt behavior", "relevant_docs": ["py_001"]},
    {"query": "vectorize then groupby; avoid loops in pandas dataframes", "relevant_docs": ["py_002"]},
    {"query": "use async await with coroutines for I O heavy tasks", "relevant_docs": ["py_003"]},
    {"query": "pipeline: preprocess → cross validate → tune → monitor for drift", "relevant_docs": ["py_004"]},
    {"query": "FastAPI uses type hints and DI; OpenAPI docs auto generate", "relevant_docs": ["py_005"]},
    {"query": "steam engines and factories pulled workers into cities", "relevant_docs": ["hist_001"]},
    {"query": "silk and ideas moved along Eurasian overland and Mediterranean sea routes", "relevant_docs": ["hist_002"]},
    {"query": "printing press and patronage powered Renaissance art and science", "relevant_docs": ["hist_003"]},
    {"query": "total war reshaped gender roles and spurred decolonization", "relevant_docs": ["hist_004"]},
    {"query": "deterrence by nukes, contests by proxy, and a space race", "relevant_docs": ["hist_005"]},
    {"query": "catabolism vs anabolism: macronutrients fuel, micronutrients enable enzymes", "relevant_docs": ["heal_001"]},
    {"query": "exercise improves circulation and lowers blood pressure", "relevant_docs": ["heal_002"]},
    {"query": "activate parasympathetic with breathing; relationships build resilience", "relevant_docs": ["heal_003"]},
    {"query": "sleep consolidates memories and clears metabolic waste", "relevant_docs": ["heal_004"]},
    {"query": "vaccination trains immunity without causing disease", "relevant_docs": ["heal_005"]},
    {"query": "periodize load to peak while protecting recovery", "relevant_docs": ["sport_001"]},
    {"query": "visualization goal setting and self talk to handle pressure", "relevant_docs": ["sport_002"]},
    {"query": "warm up dynamically; correct strength imbalances; respect rest", "relevant_docs": ["sport_003"]},
    {"query": "add weight gradually and favor multi joint lifts", "relevant_docs": ["sport_004"]},
    {"query": "build base then add intervals to raise lactate threshold", "relevant_docs": ["sport_005"]}
]

evaluation_queries_hard = [
    {"query": "not the surface—name the invisible boundary no photon escapes", "relevant_docs": ["ast_001"]},
    {"query": "nursery fog to main act; heavier stars burn the candle fast", "relevant_docs": ["ast_002"]},
    {"query": "from dust lanes to gas giants—why are inner worlds rocky", "relevant_docs": ["ast_003"]},
    {"query": "planets spotted by star hiccups and blinkings", "relevant_docs": ["ast_004"]},
    {"query": "cosmic anti gravity accounting for ~68% vs the unseen 27%", "relevant_docs": ["ast_005"]},
    {"query": "keep digits safe: curl then rock—what cutting method is this", "relevant_docs": ["cook_001"]},
    {"query": "browning that is not caramelization—needs amino acids and heat", "relevant_docs": ["cook_002"]},
    {"query": "start with onions celery garlic; lemon at the end wakes it up—why", "relevant_docs": ["cook_003"]},
    {"query": "silky white sauce from equal parts fat and flour then milk", "relevant_docs": ["cook_004"]},
    {"query": "structure from proteins, lift from gas—ratio tweaks change crumb", "relevant_docs": ["cook_005"]},
    {"query": "blueprints and shape shifters that answer the same call", "relevant_docs": ["py_001"]},
    {"query": "split apply combine; join on keys; loops are a smell", "relevant_docs": ["py_002"]},
    {"query": "await your database calls or you’ll jam the loop", "relevant_docs": ["py_003"]},
    {"query": "models rot in production—detect drift and retrain automatically", "relevant_docs": ["py_004"]},
    {"query": "type annotated endpoints, DI, and free docs from the framework", "relevant_docs": ["py_005"]},
    {"query": "smokestacks summoned cities while fields emptied", "relevant_docs": ["hist_001"]},
    {"query": "silk and scripture crossing Eurasia’s arteries", "relevant_docs": ["hist_002"]},
    {"query": "patrons presses and anatomy studies rebooted Europe", "relevant_docs": ["hist_003"]},
    {"query": "total war rewired gender roles and empires", "relevant_docs": ["hist_004"]},
    {"query": "deterrence by terror, wars by proxy, rockets to the moon", "relevant_docs": ["hist_005"]},
    {"query": "the body’s ledger: catabolism versus anabolism", "relevant_docs": ["heal_001"]},
    {"query": "raise HDL and tame BP—move regularly", "relevant_docs": ["heal_002"]},
    {"query": "flip the vagal switch: slow breathing and social ties", "relevant_docs": ["heal_003"]},
    {"query": "the brain’s dishwasher runs at night", "relevant_docs": ["heal_004"]},
    {"query": "train defenders without disease exposure", "relevant_docs": ["heal_005"]},
    {"query": "stress → adapt → peak; change load with the calendar", "relevant_docs": ["sport_001"]},
    {"query": "rehearse success in your head; set targets; tune arousal", "relevant_docs": ["sport_002"]},
    {"query": "imbalances between quads and hamstrings invite trouble", "relevant_docs": ["sport_003"]},
    {"query": "add plates over time; squats and deadlifts first", "relevant_docs": ["sport_004"]},
    {"query": "base miles then intervals to push the threshold rightward", "relevant_docs": ["sport_005"]}
]

## Evaluate with Hit@K, MRR, Precision@K, Recall@K and NDCG

While Hit@K provides a basic measure of retrieval success, more sophisticated metrics offer deeper insights:

### Mean Reciprocal Rank (MRR)
MRR measures how high the first relevant document appears in the ranking. It gives more credit to systems that place relevant documents at the top.

**Formula**: MRR = (1/|Q|) × Σ(1/rank_i) where rank_i is the position of the first relevant document for query i

### Normalized Discounted Cumulative Gain (NDCG)
NDCG accounts for the position of all relevant documents and allows for graded relevance (not just binary relevant/irrelevant).

**Formula**: NDCG@K = DCG@K / IDCG@K where DCG@K = Σ((2^rel_i - 1) / log2(i + 1)) for the top K results, and IDCG@K is the ideal DCG.

### Precision@K and Recall@K
Precision@K measures the proportion of relevant documents found in the top K results, while Recall@K measures the proportion of relevant documents that were retrieved out of all relevant documents.

**Formulas**:
- Precision@K = |Relevant ∩ Retrieved| / |Retrieved| for top K results
- Recall@K = |Relevant ∩ Retrieved| / |Relevant| for top K resultsxw

Let's implement these metrics to evaluate our retrieval pipeline comprehensively:

In [192]:
import math
from typing import List, Dict, Set

def calculate_hit_at_k(retrieved_doc_ids, relevant_doc_ids, k):
    """
    Calculate Hit@K: whether at least one relevant document appears in top-k results.
    
    Args:
        retrieved_doc_ids (list): List of retrieved document IDs in rank order
        relevant_doc_ids (list): List of relevant document IDs
        k (int): Number of top results to consider
    
    Returns:
        float: 1.0 if hit, 0.0 if miss
    """
    top_k_retrieved = set(retrieved_doc_ids[:k])
    relevant_set = set(relevant_doc_ids)
    
    # Hit if intersection is non-empty
    return 1.0 if top_k_retrieved.intersection(relevant_set) else 0.0

def calculate_mrr(retrieved_doc_ids: List[str], relevant_doc_ids: List[str]) -> float:
    """
    Calculate Mean Reciprocal Rank for a single query.
    
    MRR measures the quality of a ranking by looking at the position of the first relevant document.
    Higher scores indicate that relevant documents appear earlier in the ranking.
    
    Args:
        retrieved_doc_ids: List of retrieved document IDs in rank order
        relevant_doc_ids: List of known relevant document IDs
    
    Returns:
        float: MRR score (1/rank of first relevant doc, or 0 if no relevant docs found)
    
    Example:
        retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
        relevant = ['doc3', 'doc6']
        MRR = 1/3 = 0.333 (first relevant doc 'doc3' is at position 3)
    """
    relevant_set = set(relevant_doc_ids)
    
    for rank, doc_id in enumerate(retrieved_doc_ids, 1):
        if doc_id in relevant_set:
            return 1.0 / rank
    
    return 0.0  # No relevant documents found


def calculate_dcg_at_k(retrieved_doc_ids: List[str], relevant_doc_ids: List[str], 
                       relevance_scores: Dict[str, float], k: int) -> float:
    """
    Calculate Discounted Cumulative Gain at position k.
    
    DCG measures the usefulness of documents based on their position in the ranking,
    with higher positions having exponentially more impact.
    
    Args:
        retrieved_doc_ids: List of retrieved document IDs in rank order
        relevant_doc_ids: List of known relevant document IDs  
        relevance_scores: Dict mapping doc_id to relevance score (0-3 scale typically)
        k: Calculate DCG for top-k results
    
    Returns:
        float: DCG@k score
    """
    dcg = 0.0
    
    for i, doc_id in enumerate(retrieved_doc_ids[:k]):
        if doc_id in relevance_scores:
            relevance = relevance_scores[doc_id]
            # DCG formula: rel_i / log2(i + 2) where i is 0-indexed position
            dcg += relevance / math.log2(i + 2)
    
    return dcg


def calculate_ndcg_at_k(retrieved_doc_ids: List[str], relevant_doc_ids: List[str],
                        relevance_scores: Dict[str, float], k: int) -> float:
    """
    Calculate Normalized Discounted Cumulative Gain at position k.
    
    NDCG normalizes DCG by the ideal DCG (IDCG) to get a score between 0 and 1.
    This allows fair comparison between queries with different numbers of relevant documents.
    
    Args:
        retrieved_doc_ids: List of retrieved document IDs in rank order
        relevant_doc_ids: List of known relevant document IDs
        relevance_scores: Dict mapping doc_id to relevance score
        k: Calculate NDCG for top-k results
    
    Returns:
        float: NDCG@k score (0-1, where 1 is perfect ranking)
    """
    # Calculate actual DCG
    dcg = calculate_dcg_at_k(retrieved_doc_ids, relevant_doc_ids, relevance_scores, k)
    
    # Calculate Ideal DCG (IDCG) - what we'd get with perfect ranking
    # Sort relevant docs by relevance score in descending order
    ideal_ranking = sorted(relevance_scores.keys(), 
                          key=lambda x: relevance_scores[x], reverse=True)
    idcg = calculate_dcg_at_k(ideal_ranking, relevant_doc_ids, relevance_scores, k)
    
    # NDCG = DCG / IDCG (avoid division by zero)
    return dcg / idcg if idcg > 0 else 0.0


def calculate_precision_at_k(retrieved_doc_ids: List[str], relevant_doc_ids: List[str], k: int) -> float:
    """
    Calculate Precision@K: fraction of retrieved documents that are relevant.
    
    Args:
        retrieved_doc_ids: List of retrieved document IDs in rank order
        relevant_doc_ids: List of known relevant document IDs
        k: Number of top results to consider
    
    Returns:
        float: Precision@k score (0-1)
    """
    if k == 0:
        return 0.0
        
    top_k_retrieved = set(retrieved_doc_ids[:k])
    relevant_set = set(relevant_doc_ids)
    
    relevant_retrieved = top_k_retrieved.intersection(relevant_set)
    return len(relevant_retrieved) / k


def calculate_recall_at_k(retrieved_doc_ids: List[str], relevant_doc_ids: List[str], k: int) -> float:
    """
    Calculate Recall@K: fraction of relevant documents that were retrieved.
    
    Args:
        retrieved_doc_ids: List of retrieved document IDs in rank order  
        relevant_doc_ids: List of known relevant document IDs
        k: Number of top results to consider
    
    Returns:
        float: Recall@k score (0-1)
    """
    if not relevant_doc_ids:
        return 0.0
        
    top_k_retrieved = set(retrieved_doc_ids[:k])
    relevant_set = set(relevant_doc_ids)
    
    relevant_retrieved = top_k_retrieved.intersection(relevant_set)
    return len(relevant_retrieved) / len(relevant_set)


def calculate_f1_at_k(retrieved_doc_ids: List[str], relevant_doc_ids: List[str], k: int) -> float:
    """
    Calculate F1@K: harmonic mean of Precision@K and Recall@K.
    
    F1 provides a single score that balances precision and recall.
    
    Args:
        retrieved_doc_ids: List of retrieved document IDs in rank order
        relevant_doc_ids: List of known relevant document IDs  
        k: Number of top results to consider
    
    Returns:
        float: F1@k score (0-1)
    """
    precision = calculate_precision_at_k(retrieved_doc_ids, relevant_doc_ids, k)
    recall = calculate_recall_at_k(retrieved_doc_ids, relevant_doc_ids, k)
    
    if precision + recall == 0:
        return 0.0
    
    return 2 * (precision * recall) / (precision + recall)


# Create enhanced evaluation queries with graded relevance scores
# For this demo, we'll use a simple binary relevance (relevant=1, not relevant=0)
# In practice, you might have 3-point or 4-point relevance scales

def create_relevance_scores(eval_queries: List[Dict]) -> Dict[str, Dict[str, float]]:
    """
    Create relevance score mappings for evaluation queries.
    
    For simplicity, we use binary relevance: relevant docs get score 1.0, others get 0.0
    In production, you might have multi-level relevance (0=irrelevant, 1=somewhat, 2=relevant, 3=highly relevant)
    """
    relevance_mapping = {}
    
    for i, query_data in enumerate(eval_queries):
        query_id = f"query_{i}"
        relevance_scores = {}
        
        # All relevant docs get score 1.0, irrelevant docs get 0.0
        for doc_id in query_data['relevant_docs']:
            relevance_scores[doc_id] = 1.0
            
        relevance_mapping[query_id] = relevance_scores
    
    return relevance_mapping

# Test the new metrics with a simple example
print("🧪 Testing enhanced evaluation metrics with examples:\n")

# Example 1: Perfect ranking
retrieved_1 = ['doc_a', 'doc_b', 'doc_c', 'doc_d', 'doc_e']
relevant_1 = ['doc_a', 'doc_b']
relevance_1 = {'doc_a': 1.0, 'doc_b': 1.0}

print("📊 Example 1 - Perfect Ranking:")
print(f"   Retrieved: {retrieved_1[:3]}...")
print(f"   Relevant: {relevant_1}")
print(f"   MRR: {calculate_mrr(retrieved_1, relevant_1):.3f}")
print(f"   NDCG@5: {calculate_ndcg_at_k(retrieved_1, relevant_1, relevance_1, 5):.3f}")
print(f"   Precision@5: {calculate_precision_at_k(retrieved_1, relevant_1, 5):.3f}")
print(f"   Recall@5: {calculate_recall_at_k(retrieved_1, relevant_1, 5):.3f}")
print(f"   F1@5: {calculate_f1_at_k(retrieved_1, relevant_1, 5):.3f}")

# Example 2: Poor ranking  
retrieved_2 = ['doc_x', 'doc_y', 'doc_z', 'doc_a', 'doc_b']
relevant_2 = ['doc_a', 'doc_b']
relevance_2 = {'doc_a': 1.0, 'doc_b': 1.0}

print(f"\n📊 Example 2 - Poor Ranking (relevant docs at positions 4,5):")
print(f"   Retrieved: {retrieved_2}")
print(f"   Relevant: {relevant_2}")
print(f"   MRR: {calculate_mrr(retrieved_2, relevant_2):.3f}")
print(f"   NDCG@5: {calculate_ndcg_at_k(retrieved_2, relevant_2, relevance_2, 5):.3f}")
print(f"   Precision@5: {calculate_precision_at_k(retrieved_2, relevant_2, 5):.3f}")
print(f"   Recall@5: {calculate_recall_at_k(retrieved_2, relevant_2, 5):.3f}")
print(f"   F1@5: {calculate_f1_at_k(retrieved_2, relevant_2, 5):.3f}")

print("\n💡 Key Insights:")
print("   • MRR heavily penalizes when first relevant doc is ranked low")
print("   • NDCG accounts for position of ALL relevant documents")  
print("   • Precision@K = relevant_retrieved / k_retrieved")
print("   • Recall@K = relevant_retrieved / total_relevant")
print("   • F1@K balances precision and recall")

🧪 Testing enhanced evaluation metrics with examples:

📊 Example 1 - Perfect Ranking:
   Retrieved: ['doc_a', 'doc_b', 'doc_c']...
   Relevant: ['doc_a', 'doc_b']
   MRR: 1.000
   NDCG@5: 1.000
   Precision@5: 0.400
   Recall@5: 1.000
   F1@5: 0.571

📊 Example 2 - Poor Ranking (relevant docs at positions 4,5):
   Retrieved: ['doc_x', 'doc_y', 'doc_z', 'doc_a', 'doc_b']
   Relevant: ['doc_a', 'doc_b']
   MRR: 0.250
   NDCG@5: 0.501
   Precision@5: 0.400
   Recall@5: 1.000
   F1@5: 0.571

💡 Key Insights:
   • MRR heavily penalizes when first relevant doc is ranked low
   • NDCG accounts for position of ALL relevant documents
   • Precision@K = relevant_retrieved / k_retrieved
   • Recall@K = relevant_retrieved / total_relevant
   • F1@K balances precision and recall


In [193]:
def evaluate_retrieval_method(method_name: str, retrieval_function, 
                                      eval_queries: List[Dict], k_values: List[int] = [5, 10]) -> Dict:
    """
    Enhanced evaluation using multiple metrics: Hit@K, MRR, NDCG, Precision, Recall, F1.
    
    Args:
        method_name: Name of the retrieval method
        retrieval_function: Function that takes query and returns ranked results
        eval_queries: List of evaluation query dictionaries
        k_values: List of k values for evaluation
    
    Returns:
        Dict containing averaged scores for all metrics
    """
    metrics = {
        'mrr': [],
        **{f'hit_at_{k}': [] for k in k_values},
        **{f'ndcg_at_{k}': [] for k in k_values},
        **{f'precision_at_{k}': [] for k in k_values},
        **{f'recall_at_{k}': [] for k in k_values},
        **{f'f1_at_{k}': [] for k in k_values}
    }
    
    # Create relevance scores for NDCG calculation
    relevance_mapping = create_relevance_scores(eval_queries)
    
    for i, eval_item in enumerate(eval_queries):
        query = eval_item['query']
        relevant_docs = eval_item['relevant_docs']
        query_id = f"query_{i}"
        relevance_scores = relevance_mapping[query_id]
        
        # Get retrieval results
        retrieved_results = retrieval_function(query)
        
        # Extract document IDs from results (handle different return formats)
        if method_name == 'Semantic':
            # For semantic search, extract doc_id from chunk info
            retrieved_doc_ids = [result[2]['doc_id'] for result in retrieved_results]
        else:
            # For TF-IDF and BM25, extract id from doc info
            retrieved_doc_ids = [result[2]['id'] for result in retrieved_results]
        
        # Calculate MRR (only needs to be calculated once per query)
        mrr = calculate_mrr(retrieved_doc_ids, relevant_docs)
        metrics['mrr'].append(mrr)
        
        # Calculate metrics for each k value
        for k in k_values:
            # Hit@K
            hit = calculate_hit_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'hit_at_{k}'].append(hit)
            
            # NDCG@K
            ndcg = calculate_ndcg_at_k(retrieved_doc_ids, relevant_docs, relevance_scores, k)
            metrics[f'ndcg_at_{k}'].append(ndcg)
            
            # Precision@K
            precision = calculate_precision_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'precision_at_{k}'].append(precision)
            
            # Recall@K
            recall = calculate_recall_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'recall_at_{k}'].append(recall)
            
            # F1@K  
            f1 = calculate_f1_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'f1_at_{k}'].append(f1)
    
    # Calculate averages
    avg_metrics = {}
    for metric_name, values in metrics.items():
        avg_metrics[metric_name] = np.mean(values)
    
    return avg_metrics


def evaluate_hybrid_method(method_name: str, eval_queries: List[Dict], 
                                   k_values: List[int] = [5, 10]) -> Dict:
    """
    Enhanced evaluation for hybrid methods (RRF, Cross-encoder) using multiple metrics.
    """
    metrics = {
        'mrr': [],
        **{f'hit_at_{k}': [] for k in k_values},
        **{f'ndcg_at_{k}': [] for k in k_values}, 
        **{f'precision_at_{k}': [] for k in k_values},
        **{f'recall_at_{k}': [] for k in k_values},
        **{f'f1_at_{k}': [] for k in k_values}
    }
    
    # Create relevance scores
    relevance_mapping = create_relevance_scores(eval_queries)
    
    for i, eval_item in enumerate(eval_queries):
        query = eval_item['query']
        relevant_docs = eval_item['relevant_docs']
        query_id = f"query_{i}"
        relevance_scores = relevance_mapping[query_id]
        
        # Get method-specific results
        if method_name == 'RRF':
            hybrid_df = hybrid_retrieve(query, top_k_lex=15, top_k_sem=15)
            rrf_results = apply_rrf_to_hybrid(hybrid_df, k=60)
            retrieved_doc_ids = [result['doc_id'] for result in rrf_results]
        
        elif method_name == 'Cross-encoder':
            hybrid_df = hybrid_retrieve(query, top_k_lex=15, top_k_sem=15)
            rrf_results = apply_rrf_to_hybrid(hybrid_df, k=60)
            reranked_results = rerank_with_cross_encoder(query, rrf_results[:20], top_k=15)
            retrieved_doc_ids = [result['doc_id'] for result in reranked_results]
        
        # Calculate MRR
        mrr = calculate_mrr(retrieved_doc_ids, relevant_docs)
        metrics['mrr'].append(mrr)
        
        # Calculate metrics for each k value
        for k in k_values:
            hit = calculate_hit_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'hit_at_{k}'].append(hit)
            
            ndcg = calculate_ndcg_at_k(retrieved_doc_ids, relevant_docs, relevance_scores, k)
            metrics[f'ndcg_at_{k}'].append(ndcg)
            
            precision = calculate_precision_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'precision_at_{k}'].append(precision)
            
            recall = calculate_recall_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'recall_at_{k}'].append(recall)
            
            f1 = calculate_f1_at_k(retrieved_doc_ids, relevant_docs, k)
            metrics[f'f1_at_{k}'].append(f1)
    
    # Calculate averages
    avg_metrics = {}
    for metric_name, values in metrics.items():
        avg_metrics[metric_name] = np.mean(values)
    
    return avg_metrics

# Run enhanced evaluation on all methods
print("🚀 Running enhanced retrieval evaluation with multiple metrics...\n")

# Use mixed queries for comprehensive evaluation
evaluation_queries = evaluation_queries_hard  # Use subset for demo (faster execution)

# Evaluate individual methods
enhanced_results = {}
retrieval_methods = {
    'TF-IDF': lambda q: query_tfidf(q, top_k=10),
    'BM25': lambda q: query_bm25(q, top_k=10),
    'Semantic': lambda q: semantic_search(q, top_k=10)
}

for method_name, retrieval_func in retrieval_methods.items():
    print(f"📊 Evaluating {method_name} with enhanced metrics...")
    results = evaluate_retrieval_method(method_name, retrieval_func, evaluation_queries)
    enhanced_results[method_name] = results

# Evaluate hybrid methods
print(f"📊 Evaluating RRF with enhanced metrics...")
enhanced_results['RRF'] = evaluate_hybrid_method('RRF', evaluation_queries)

print(f"📊 Evaluating Cross-encoder with enhanced metrics...")
enhanced_results['Cross-encoder'] = evaluate_hybrid_method('Cross-encoder', evaluation_queries)

# Display comprehensive results table
print(f"\n📈 Comprehensive Retrieval Evaluation Results:")

# Create DataFrame for better display
enhanced_eval_data = []
for method_name, results in enhanced_results.items():
    enhanced_eval_data.append({
        'Method': method_name,
        'MRR': round(results['mrr'], 3),
        'Hit@5': round(results['hit_at_5'], 3),
        'Hit@10': round(results['hit_at_10'], 3),
        'NDCG@5': round(results['ndcg_at_5'], 3),
        'NDCG@10': round(results['ndcg_at_10'], 3),
        'P@5': round(results['precision_at_5'], 3),
        'R@5': round(results['recall_at_5'], 3),
        'F1@5': round(results['f1_at_5'], 3)
    })

enhanced_eval_df = pd.DataFrame(enhanced_eval_data)
display(enhanced_eval_df)

# Find best performing methods for each metric
print(f"\n🏆 Best performing methods by metric:")
metrics_to_analyze = ['mrr', 'hit_at_5', 'ndcg_at_5', 'f1_at_5']
best_methods_data = []

for metric in metrics_to_analyze:
    best_method = max(enhanced_results.items(), key=lambda x: x[1][metric])
    best_methods_data.append({
        'Metric': metric.upper().replace('_', '@'),
        'Best Method': best_method[0],
        'Score': round(best_method[1][metric], 3)
    })

best_methods_df = pd.DataFrame(best_methods_data)
display(best_methods_df)

# Calculate relative improvements over baseline
baseline_method = 'TF-IDF'
baseline_results = enhanced_results[baseline_method]

print(f"\n📈 Relative improvements over {baseline_method} baseline:")
improvements_data = []

for method_name, results in enhanced_results.items():
    if method_name != baseline_method:
        mrr_improvement = results['mrr'] - baseline_results['mrr']
        ndcg_improvement = results['ndcg_at_5'] - baseline_results['ndcg_at_5']
        f1_improvement = results['f1_at_5'] - baseline_results['f1_at_5']
        
        improvements_data.append({
            'Method': method_name,
            'MRR Δ': f"{mrr_improvement:+.3f}",
            'NDCG@5 Δ': f"{ndcg_improvement:+.3f}",
            'F1@5 Δ': f"{f1_improvement:+.3f}",
            'Overall Trend': '↑' if (mrr_improvement + ndcg_improvement + f1_improvement) > 0 else '↓'
        })

if improvements_data:
    improvements_df = pd.DataFrame(improvements_data)
    display(improvements_df)

print(f"\n💡 Key Insights:")
print(f"   • Cross-encoder typically provides the highest precision for top results")
print(f"   • Hybrid methods (RRF + Cross-encoder) balance recall and precision")
print(f"   • Semantic search excels at paraphrase and concept matching")
print(f"   • BM25 remains competitive for exact keyword matching")
print(f"   • Combining multiple approaches leverages complementary strengths")

print("\n✅ Enhanced evaluation complete!")

🚀 Running enhanced retrieval evaluation with multiple metrics...

📊 Evaluating TF-IDF with enhanced metrics...
📊 Evaluating BM25 with enhanced metrics...
📊 Evaluating Semantic with enhanced metrics...
📊 Evaluating RRF with enhanced metrics...
📊 Evaluating Cross-encoder with enhanced metrics...
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.04 seconds
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.04 seconds
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.03 seconds
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.04 seconds
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.04 seconds
🔄 Computing cross-encoder scores for 20 candidates...
⏱️  Cross-encoder re-ranking completed in 0.04 seconds
🔄 Computing cross-encoder scores for 20 candidates.

Unnamed: 0,Method,MRR,Hit@5,Hit@10,NDCG@5,NDCG@10,P@5,R@5,F1@5
0,TF-IDF,0.888,0.9,0.933,0.888,0.899,0.18,0.9,0.3
1,BM25,0.886,0.9,0.967,0.883,0.904,0.18,0.9,0.3
2,Semantic,0.983,1.0,1.0,0.988,0.988,0.2,1.0,0.333
3,RRF,0.898,0.9,1.0,0.888,0.922,0.18,0.9,0.3
4,Cross-encoder,0.978,1.0,1.0,0.983,0.983,0.2,1.0,0.333



🏆 Best performing methods by metric:


Unnamed: 0,Metric,Best Method,Score
0,MRR,Semantic,0.983
1,HIT@AT@5,Semantic,1.0
2,NDCG@AT@5,Semantic,0.988
3,F1@AT@5,Semantic,0.333



📈 Relative improvements over TF-IDF baseline:


Unnamed: 0,Method,MRR Δ,NDCG@5 Δ,F1@5 Δ,Overall Trend
0,BM25,-0.002,-0.004,0.0,↓
1,Semantic,0.095,0.1,0.033,↑
2,RRF,0.01,0.0,0.0,↑
3,Cross-encoder,0.09,0.096,0.033,↑



💡 Key Insights:
   • Cross-encoder typically provides the highest precision for top results
   • Hybrid methods (RRF + Cross-encoder) balance recall and precision
   • Semantic search excels at paraphrase and concept matching
   • BM25 remains competitive for exact keyword matching
   • Combining multiple approaches leverages complementary strengths

✅ Enhanced evaluation complete!


## Interpreting Evaluation Metrics

### When to Use Each Metric

**Mean Reciprocal Rank (MRR)**
- **Best for**: Systems where finding the first relevant document quickly is critical
- **Example**: Question answering where users need one good answer
- **Interpretation**: MRR=0.5 means on average, the first relevant document is at position 2

**Normalized Discounted Cumulative Gain (NDCG)**  
- **Best for**: Systems where ranking quality of all results matters
- **Example**: Search engines where users browse multiple results
- **Interpretation**: NDCG@5=0.8 means the ranking achieves 80% of the ideal score

**Hit@K**
- **Best for**: Simple binary assessment of retrieval success
- **Example**: Basic "did we find anything useful?" evaluation
- **Interpretation**: Hit@5=0.7 means 70% of queries had at least one relevant doc in top-5

**Precision@K**
- **Best for**: Systems where result quality (low false positives) is crucial
- **Example**: Medical diagnosis support where wrong results are dangerous
- **Interpretation**: P@5=0.6 means 60% of returned results are relevant

**Recall@K**
- **Best for**: Systems where completeness (low false negatives) is crucial  
- **Example**: Legal discovery where missing documents has consequences
- **Interpretation**: R@5=0.4 means we found 40% of all relevant documents

**F1@K**
- **Best for**: Balanced assessment of precision and recall
- **Example**: General-purpose search systems
- **Interpretation**: F1@5=0.5 balances finding relevant docs with avoiding irrelevant ones

### Choosing the Right Metric for Your Use Case

| Use Case | Primary Metric | Reasoning |
|----------|---------------|-----------|
| **QA Systems** | MRR | Users need one good answer fast |
| **Research/Discovery** | NDCG@10 | Users explore multiple results |
| **Fact Verification** | Precision@5 | Accuracy more important than completeness |
| **Legal/Compliance** | Recall@10 | Can't afford to miss relevant documents |
| **General Search** | F1@5 or NDCG@5 | Balance of multiple factors |

## Glossary of RAG Terms

- **Document**: A single piece of content in your knowledge base (article, page, etc.)
- **Chunk**: A segment of a document, typically 100-500 tokens for better embedding quality
- **Corpus**: The complete collection of documents available for retrieval
- **Index**: Data structure enabling fast search (TF-IDF matrix, embedding vectors, etc.)
- **TF-IDF**: Term Frequency-Inverse Document Frequency; scores terms by frequency vs rarity
- **BM25**: Best Matching 25; probabilistic ranking function improving on TF-IDF
- **Embedding**: Dense vector representation capturing semantic meaning of text
- **Vector Store**: Database optimized for storing and searching high-dimensional vectors
- **ANN**: Approximate Nearest Neighbors; fast similarity search with slight accuracy trade-off
- **FAISS**: Facebook AI Similarity Search; library for efficient similarity search
- **Hybrid Retrieval**: Combining multiple retrieval methods (lexical + semantic)
- **RRF**: Reciprocal Rank Fusion; method for combining rankings from multiple systems
- **Cross-encoder**: Transformer model scoring (query, passage) pairs for re-ranking
- **Top-k**: Retrieving the k highest-scoring results
- **Recall**: Fraction of relevant documents successfully retrieved
- **Precision**: Fraction of retrieved documents that are actually relevant
- **Context Window**: Maximum input length a language model can process
- **Hallucination**: When language models generate factually incorrect information
- **Prompt Template**: Structured format for providing context and instructions to LLMs
- **Grounding**: Ensuring model responses are based on provided evidence rather than training data