# SQuAD FAISS Demo - Python API

This notebook demonstrates the **complete RAGDiff v2.0 workflow** using the Python API, from data preparation to comparison.

## Overview

This example compares two FAISS-based RAG systems using different embedding models:

1. **faiss-small**: `paraphrase-MiniLM-L3-v2` (17MB, 3 layers, fast)
2. **faiss-large**: `all-MiniLM-L12-v2` (120MB, 12 layers, more accurate)

## What You'll Learn

**Part 1: Data Preparation**
- Download and prepare the SQuAD dataset
- Build FAISS indices with different embedding models
- Generate query sets

**Part 2: RAGDiff API Usage**
- Execute queries against providers programmatically
- Compare results using LLM evaluation
- Analyze and export results

# Part 1: Data Preparation

First, we'll set up the demo data. This only needs to be run once.

## Prerequisites

Before running this notebook, ensure you have a uv environment set up:

```bash
# Create virtual environment (if not already created)
uv venv

# Activate the environment
source .venv/bin/activate  # On macOS/Linux
# or
.venv\Scripts\activate  # On Windows

# Install RAGDiff in editable mode
uv pip install -e .

# Start Jupyter
uv run jupyter notebook
```

Once Jupyter is running, you can proceed with the cells below.

## Step 1: Install Dependencies

In [1]:
# Install required packages using uv
import subprocess

packages = [
    "datasets",  # HuggingFace datasets for SQuAD
    "faiss-cpu",  # FAISS for vector search
    "sentence-transformers",  # Embedding models
    "numpy",  # Numerical operations
]

print("Installing required packages with uv...")
for package in packages:
    try:
        __import__(package.replace("-", "_"))
        print(f"✓ {package} already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call(["uv", "pip", "install", "-q", package])
        print(f"✓ {package} installed")

print("\nAll dependencies installed!")

Installing required packages with uv...
✓ datasets already installed
Installing faiss-cpu...
✓ faiss-cpu installed
✓ sentence-transformers already installed
✓ numpy already installed

All dependencies installed!


## Step 2: Download and Prepare SQuAD Dataset

In [None]:
import json
from pathlib import Path

from datasets import load_dataset

# Set up paths
data_dir = Path("examples/squad-demo/data")
data_dir.mkdir(parents=True, exist_ok=True)

output_file = data_dir / "documents.jsonl"
raw_file = data_dir / "squad_raw.json"

# Check if already prepared
if output_file.exists() and raw_file.exists():
    print(f"✓ Data already prepared at {data_dir}/")
    with open(output_file) as f:
        num_docs = sum(1 for _ in f)
    print(f"  - {num_docs} documents")
else:
    print("Loading SQuAD v2.0 dataset from HuggingFace...")
    dataset = load_dataset("squad_v2", split="validation")
    print(f"Loaded {len(dataset)} examples")

    # Extract unique contexts
    print("Extracting unique context paragraphs...")
    contexts_seen = set()
    documents = []

    for idx, example in enumerate(dataset):
        context = example["context"]
        if context in contexts_seen:
            continue
        contexts_seen.add(context)

        documents.append(
            {
                "id": f"squad_{len(documents)}",
                "text": context,
                "source": "SQuAD v2.0",
                "metadata": {
                    "title": example.get("title", "Unknown"),
                    "original_index": idx,
                },
            }
        )

    print(f"Found {len(documents)} unique context paragraphs")

    # Write documents
    print(f"Writing documents to {output_file}...")
    with open(output_file, "w", encoding="utf-8") as f:
        for doc in documents:
            f.write(json.dumps(doc, ensure_ascii=False) + "\n")

    # Save raw dataset for query generation
    print(f"Saving raw dataset to {raw_file}...")
    raw_data = {
        "examples": [
            {
                "id": ex["id"],
                "question": ex["question"],
                "context": ex["context"],
                "answers": ex["answers"],
                "title": ex.get("title", "Unknown"),
            }
            for ex in dataset
        ]
    }

    with open(raw_file, "w", encoding="utf-8") as f:
        json.dump(raw_data, f, indent=2, ensure_ascii=False)

    print("\n✓ Dataset preparation complete!")
    print(f"  Documents: {len(documents)}")
    print(f"  Q&A pairs: {len(dataset)}")

## Step 3: Build FAISS Indices

Build two FAISS indices with different embedding models:

In [None]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

data_dir = Path("examples/squad-demo/data")
documents_file = data_dir / "documents.jsonl"

# Load documents
print("Loading documents...")
documents = []
with open(documents_file, encoding="utf-8") as f:
    for line in f:
        documents.append(json.loads(line.strip()))
print(f"Loaded {len(documents)} documents")

texts = [doc["text"] for doc in documents]

# Build small model index (fast but less accurate)
small_index_file = data_dir / "faiss_small.index"
if small_index_file.exists():
    print(f"\n✓ Small model index already exists at {small_index_file}")
else:
    print("\n" + "=" * 60)
    print("Building FAISS index with SMALL model (paraphrase-MiniLM-L3-v2)")
    print("=" * 60)

    print("Loading embedding model (small/fast)...")
    model = SentenceTransformer("paraphrase-MiniLM-L3-v2")

    print("Generating embeddings...")
    embeddings = model.encode(
        texts, show_progress_bar=True, batch_size=32, convert_to_numpy=True
    )
    embeddings = np.array(embeddings, dtype="float32")

    print(
        f"Generated {len(embeddings)} embeddings with dimension {embeddings.shape[1]}"
    )

    print("Building FAISS index with L2 distance...")
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)

    print(f"Saving index to {small_index_file}...")
    faiss.write_index(index, str(small_index_file))

    print(f"✓ Small model index created ({index.ntotal} vectors, {index.d} dims)")
    print("  - 17MB model, 3 layers, fast but less accurate")

# Build large model index (slower but more accurate)
large_index_file = data_dir / "faiss_large.index"
if large_index_file.exists():
    print(f"\n✓ Large model index already exists at {large_index_file}")
else:
    print("\n" + "=" * 60)
    print("Building FAISS index with LARGE model (all-MiniLM-L12-v2)")
    print("=" * 60)

    print("Loading embedding model (larger/better quality)...")
    model = SentenceTransformer("all-MiniLM-L12-v2")

    print("Generating embeddings...")
    embeddings = model.encode(
        texts, show_progress_bar=True, batch_size=8, convert_to_numpy=True
    )
    embeddings = np.array(embeddings, dtype="float32")

    print(
        f"Generated {len(embeddings)} embeddings with dimension {embeddings.shape[1]}"
    )

    print("Building FAISS index with L2 distance...")
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)

    print(f"Saving index to {large_index_file}...")
    faiss.write_index(index, str(large_index_file))

    print(f"✓ Large model index created ({index.ntotal} vectors, {index.d} dims)")
    print("  - 120MB model, 12 layers, slower but more accurate")

print("\n✓ All FAISS indices ready!")

## Step 4: Generate Query Sets

In [None]:
import random

data_dir = Path("examples/squad-demo/data")
raw_file = data_dir / "squad_raw.json"

query_sets_dir = Path("examples/squad-demo/domains/squad/query-sets")
query_sets_dir.mkdir(parents=True, exist_ok=True)

test_queries_file = query_sets_dir / "test-queries.txt"

if test_queries_file.exists():
    with open(test_queries_file) as f:
        num_queries = sum(1 for _ in f)
    print(f"✓ Query set already exists at {test_queries_file}")
    print(f"  - {num_queries} queries")
else:
    print("Generating query sets...")

    # Load raw dataset
    with open(raw_file, encoding="utf-8") as f:
        data = json.load(f)

    examples = data["examples"]

    # Filter answerable questions
    answerable = [ex for ex in examples if ex["answers"]["text"]]
    print(f"Found {len(answerable)} answerable questions")

    # Sample 100 questions
    random.seed(42)
    sampled = random.sample(answerable, min(100, len(answerable)))

    # Write test queries
    with open(test_queries_file, "w", encoding="utf-8") as f:
        for ex in sampled:
            f.write(ex["question"] + "\n")

    print(f"✓ Created {test_queries_file} ({len(sampled)} queries)")
    print("\nExample questions:")
    for i, ex in enumerate(sampled[:3], 1):
        print(f"  {i}. {ex['question']}")

print("\n✓ Data preparation complete! Ready to use RAGDiff API.")

# Part 2: RAGDiff API Usage

Now let's use the RAGDiff Python API to compare our providers.

## Setup: Import RAGDiff

In [None]:
from datetime import datetime

from rich.console import Console

# Rich for pretty printing
from rich.table import Table

from ragdiff.comparison import compare_runs
from ragdiff.core.loaders import load_domain, load_provider, load_query_set

# RAGDiff v2.0 API
from ragdiff.execution import execute_run

console = Console()

# Configuration
domain = "squad"  # Directory name in domains/
domain_dir = Path("examples/squad-demo/domains/squad")
domains_dir = domain_dir.parent
providers = ["faiss-small", "faiss-large"]
query_set_name = "test-queries"

print(f"Domain: {domain}")
print(f"Providers: {providers}")
print(f"Query set: {query_set_name}")

## 1. Explore Configurations

In [6]:
# Load domain configuration
domain_config = load_domain(domain, domains_dir)

print("=== Domain Configuration ===")
print(f"Name: {domain_config.name}")
print(f"Description: {domain_config.description}")
print(f"\nEvaluator Model: {domain_config.evaluator.model}")

# Load provider configurations
for provider_name in providers:
    provider_config = load_provider(domain, provider_name, domains_dir)
    print(f"\n=== Provider: {provider_name} ===")
    print(f"Tool: {provider_config.tool}")
    print(f"Config: {provider_config.config}")

=== Domain Configuration ===
Name: squad-demo
Description: Example RAG comparison using SQuAD dataset with FAISS providers

Evaluator Model: anthropic/claude-sonnet-4-5

=== Provider: faiss-small ===
Tool: faiss
Config: {'index_path': 'examples/squad-demo/data/faiss_small.index', 'documents_path': 'examples/squad-demo/data/documents.jsonl', 'embedding_service': 'sentence-transformers', 'embedding_model': 'paraphrase-MiniLM-L3-v2', 'dimensions': 384}

=== Provider: faiss-large ===
Tool: faiss
Config: {'index_path': 'examples/squad-demo/data/faiss_large.index', 'documents_path': 'examples/squad-demo/data/documents.jsonl', 'embedding_service': 'sentence-transformers', 'embedding_model': 'all-MiniLM-L12-v2', 'dimensions': 384}


## 2. Load Query Set

In [None]:
queries = load_query_set(domain, query_set_name, domains_dir)

print(f"Query Set: {query_set_name}")
print(f"Total queries: {len(queries.queries)}")
print("\nFirst 5 queries:")
for i, query in enumerate(queries.queries[:5], 1):
    print(f"{i}. {query.text}")

## 3. Execute Runs

Execute queries against both providers:

In [None]:
def progress_callback(current, total, successes, failures):
    """Progress indicator"""
    if current % 10 == 0 or current == total:
        print(
            f"Progress: {current}/{total} queries ({successes} ok, {failures} failed)"
        )


runs = {}

for provider_name in providers:
    print(f"\n{'='*60}")
    print(f"Executing run: {provider_name}")
    print(f"{'='*60}\n")

    run = execute_run(
        domain=domain,
        provider=provider_name,
        query_set=query_set_name,
        label=f"{provider_name}-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
        concurrency=10,
        per_query_timeout=30.0,
        progress_callback=progress_callback,
        domains_dir=domains_dir,
    )

    runs[provider_name] = run

    print(f"\n✓ Run completed: {run.label}")
    print(f"  Status: {run.status.value}")
    print(f"  Successes: {run.metadata.get('successes', 0)}")
    print(f"  Duration: {run.metadata.get('duration_seconds', 0):.2f}s")

## 4. Compare Runs

Use LLM evaluation to compare the results:

In [None]:
print(f"\n{'='*60}")
print(f"Comparing runs: {providers[0]} vs {providers[1]}")
print(f"{'='*60}\n")

comparison = compare_runs(
    domain=domain,
    run_labels=[run.label for run in runs.values()],
    concurrency=10,
    domains_dir=domains_dir,
)

print("\n✓ Comparison completed")
print(f"  Duration: {comparison.metadata.get('duration_seconds', 0):.2f}s")

## 5. Analyze Results

In [None]:
# Create summary table
table = Table(title="Comparison Results")
table.add_column("Provider", style="cyan")
table.add_column("Wins", style="green")
table.add_column("Losses", style="red")
table.add_column("Ties", style="yellow")
table.add_column("Avg Score", style="blue")
table.add_column("Avg Latency", style="magenta")

# Calculate stats
stats = {
    provider: {"wins": 0, "losses": 0, "ties": 0, "scores": [], "latencies": []}
    for provider in providers
}

for eval_result in comparison.evaluations:
    winner = eval_result.winner
    if winner == "tie":
        for provider in providers:
            stats[provider]["ties"] += 1
    else:
        stats[winner]["wins"] += 1
        for provider in providers:
            if provider != winner:
                stats[provider]["losses"] += 1

    for provider in providers:
        score = eval_result.scores.get(provider, 0)
        stats[provider]["scores"].append(score)

# Get latencies
for provider, run in runs.items():
    for result in run.results:
        if result.latency_ms:
            stats[provider]["latencies"].append(result.latency_ms)

# Add rows
for provider in providers:
    avg_score = (
        sum(stats[provider]["scores"]) / len(stats[provider]["scores"])
        if stats[provider]["scores"]
        else 0
    )
    avg_latency = (
        sum(stats[provider]["latencies"]) / len(stats[provider]["latencies"])
        if stats[provider]["latencies"]
        else 0
    )

    table.add_row(
        provider,
        str(stats[provider]["wins"]),
        str(stats[provider]["losses"]),
        str(stats[provider]["ties"]),
        f"{avg_score:.1f}",
        f"{avg_latency:.1f}ms",
    )

console.print(table)

## 6. Export Results

In [None]:
# Export to JSON
output_file = Path("comparison_results.json")
with open(output_file, "w") as f:
    json.dump(comparison.model_dump(mode="json"), f, indent=2, default=str)

print(f"✓ Comparison exported to: {output_file}")
print(f"  File size: {output_file.stat().st_size / 1024:.1f} KB")

## Summary

This notebook demonstrated the complete RAGDiff v2.0 workflow:

**Part 1: Data Preparation**
- ✓ Downloaded SQuAD dataset from HuggingFace
- ✓ Built FAISS indices with different embedding models
- ✓ Generated query sets for testing

**Part 2: RAGDiff API**
- ✓ Executed queries against multiple providers
- ✓ Compared results using LLM evaluation
- ✓ Analyzed and exported results

### Key Takeaways

- The **large model** (all-MiniLM-L12-v2) typically wins more comparisons but is slower
- The **small model** (paraphrase-MiniLM-L3-v2) is much faster but less accurate
- This demonstrates the classic **quality vs speed tradeoff** in embedding models

### Next Steps

- Create custom query sets for your domain
- Try different embedding models (e.g., all-mpnet-base-v2)
- Adjust concurrency for faster execution
- Experiment with different LLM evaluators

For more information, see the [RAGDiff documentation](../../CLAUDE.md).