# SQuAD FAISS Demo - Python API

This notebook demonstrates how to use the **RAGDiff v2.0 Python API** to compare RAG providers programmatically, instead of using the CLI.

## Overview

This example compares two FAISS-based RAG systems using different embedding models:

1. **faiss-small**: `paraphrase-MiniLM-L3-v2` (17MB, 3 layers, fast)
2. **faiss-large**: `all-MiniLM-L12-v2` (120MB, 12 layers, more accurate)

We'll demonstrate:
- Executing query sets against providers programmatically
- Comparing results using LLM evaluation
- Analyzing and exporting results

## Setup

In [None]:
# Import required modules
from pathlib import Path
from datetime import datetime
import json

# RAGDiff v2.0 API
from ragdiff.execution import execute_run
from ragdiff.comparison import compare_runs
from ragdiff.core.loaders import load_domain, load_provider_config, load_query_set, load_run
from ragdiff.core.storage import load_run, list_runs
from ragdiff.core.models_v2 import Run

# Rich for pretty printing
from rich import print as rprint
from rich.table import Table
from rich.console import Console

console = Console()

## Configuration

Set up paths to the domain directory:

In [None]:
# Domain configuration
domain = "squad-demo"
domain_dir = Path("domains/squad")
domains_dir = domain_dir.parent  # examples/squad-demo/domains

# Providers to compare
providers = ["faiss-small", "faiss-large"]

# Query set to use
query_set_name = "test-queries"

print(f"Domain: {domain}")
print(f"Domain directory: {domain_dir}")
print(f"Providers: {providers}")
print(f"Query set: {query_set_name}")

## 1. Explore Domain Configuration

Let's first load and inspect the domain configuration:

In [None]:
# Load domain configuration
domain_config = load_domain(domain, domains_dir)

print("\n=== Domain Configuration ===")
print(f"Name: {domain_config.name}")
print(f"Description: {domain_config.description}")
print(f"\nEvaluator:")
print(f"  Model: {domain_config.evaluator.model}")
print(f"  Temperature: {domain_config.evaluator.temperature}")
print(f"  Prompt template (first 100 chars): {domain_config.evaluator.prompt_template[:100]}...")

## 2. Explore Provider Configurations

Load and inspect the provider configurations:

In [None]:
# Load provider configurations
for provider_name in providers:
    provider_config = load_provider_config(domain, provider_name, domains_dir)
    
    print(f"\n=== Provider: {provider_name} ===")
    print(f"Tool: {provider_config.tool}")
    print(f"Description: {provider_config.description}")
    print(f"\nConfiguration:")
    for key, value in provider_config.config.items():
        print(f"  {key}: {value}")

## 3. Load Query Set

Load the test queries:

In [None]:
# Load query set
queries = load_query_set(domain, query_set_name, domains_dir)

print(f"\n=== Query Set: {query_set_name} ===")
print(f"Total queries: {len(queries)}")
print(f"\nFirst 5 queries:")
for i, query in enumerate(queries[:5], 1):
    print(f"{i}. {query.text}")
    if query.reference:
        print(f"   Reference: {query.reference[:100]}...")

## 4. Execute Runs

Execute the query set against both providers. This is the programmatic equivalent of:

```bash
ragdiff run -d domains/squad -p faiss-small -q test-queries
ragdiff run -d domains/squad -p faiss-large -q test-queries
```

In [None]:
# Progress callback to show execution status
def progress_callback(current, total, successes, failures):
    """Simple progress indicator"""
    if current % 10 == 0 or current == total:
        print(f"Progress: {current}/{total} queries ({successes} ok, {failures} failed)")

# Execute runs for each provider
runs = {}

for provider_name in providers:
    print(f"\n{'='*60}")
    print(f"Executing run: {provider_name}")
    print(f"{'='*60}\n")
    
    # Execute the run
    run = execute_run(
        domain=domain,
        provider=provider_name,
        query_set=query_set_name,
        label=f"{provider_name}-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
        concurrency=10,  # Run 10 queries in parallel
        per_query_timeout=30.0,
        progress_callback=progress_callback,
        domains_dir=domains_dir,
    )
    
    runs[provider_name] = run
    
    # Display run summary
    print(f"\n✓ Run completed: {run.label}")
    print(f"  Run ID: {run.id}")
    print(f"  Status: {run.status.value}")
    print(f"  Total Queries: {run.metadata.get('total_queries', 0)}")
    print(f"  Successes: {run.metadata.get('successes', 0)}")
    print(f"  Failures: {run.metadata.get('failures', 0)}")
    print(f"  Duration: {run.metadata.get('duration_seconds', 0):.2f}s")
    print(f"  Avg latency: {run.metadata.get('avg_latency_ms', 0):.2f}ms")

## 5. Inspect Run Results

Let's look at some individual query results:

In [None]:
# Show results for first query from both providers
query_index = 0

print(f"\n=== Query {query_index + 1} ===")
print(f"Query: {queries[query_index].text}\n")

for provider_name, run in runs.items():
    result = run.results[query_index]
    
    print(f"\n--- {provider_name} ---")
    print(f"Status: {result.status}")
    print(f"Latency: {result.latency_ms:.2f}ms")
    print(f"Chunks retrieved: {len(result.chunks)}")
    
    if result.chunks:
        print(f"\nTop result (score: {result.chunks[0].score:.4f}):")
        print(f"  {result.chunks[0].content[:200]}...")
    
    if result.error:
        print(f"Error: {result.error}")

## 6. Compare Runs

Now let's compare the two runs using LLM evaluation. This is the programmatic equivalent of:

```bash
ragdiff compare -d domains/squad -r <run-1> -r <run-2>
```

In [None]:
# Progress callback for comparison
def comparison_progress_callback(current, total):
    """Simple progress indicator for comparison"""
    if current % 10 == 0 or current == total:
        print(f"Evaluating: {current}/{total} queries")

print(f"\n{'='*60}")
print(f"Comparing runs: {providers[0]} vs {providers[1]}")
print(f"{'='*60}\n")

# Compare the runs
comparison = compare_runs(
    domain=domain,
    run_labels=[run.label for run in runs.values()],
    concurrency=10,  # Evaluate 10 queries in parallel
    progress_callback=comparison_progress_callback,
    domains_dir=domains_dir,
)

print(f"\n✓ Comparison completed: {comparison.id}")
print(f"  Status: {comparison.metadata.get('status', 'completed')}")
print(f"  Duration: {comparison.metadata.get('duration_seconds', 0):.2f}s")

## 7. Analyze Comparison Results

Display the comparison results in a nice table:

In [None]:
# Create summary table
table = Table(title="Comparison Results")
table.add_column("Provider", style="cyan")
table.add_column("Wins", style="green")
table.add_column("Losses", style="red")
table.add_column("Ties", style="yellow")
table.add_column("Avg Score", style="blue")
table.add_column("Avg Latency", style="magenta")

# Count wins/losses/ties for each provider
stats = {provider: {"wins": 0, "losses": 0, "ties": 0, "scores": [], "latencies": []} 
         for provider in providers}

for eval_result in comparison.evaluations:
    winner = eval_result.winner
    
    if winner == "tie":
        for provider in providers:
            stats[provider]["ties"] += 1
    else:
        stats[winner]["wins"] += 1
        for provider in providers:
            if provider != winner:
                stats[provider]["losses"] += 1
    
    # Collect scores
    for provider in providers:
        score = eval_result.scores.get(provider, 0)
        stats[provider]["scores"].append(score)

# Get latencies from runs
for provider, run in runs.items():
    for result in run.results:
        if result.latency_ms:
            stats[provider]["latencies"].append(result.latency_ms)

# Add rows to table
for provider in providers:
    avg_score = sum(stats[provider]["scores"]) / len(stats[provider]["scores"]) if stats[provider]["scores"] else 0
    avg_latency = sum(stats[provider]["latencies"]) / len(stats[provider]["latencies"]) if stats[provider]["latencies"] else 0
    
    table.add_row(
        provider,
        str(stats[provider]["wins"]),
        str(stats[provider]["losses"]),
        str(stats[provider]["ties"]),
        f"{avg_score:.1f}",
        f"{avg_latency:.1f}ms",
    )

console.print(table)

## 8. Examine Individual Evaluations

Look at some specific query evaluations:

In [None]:
# Show first 3 evaluations
print("\n=== Sample Evaluations ===")

for i, eval_result in enumerate(comparison.evaluations[:3], 1):
    print(f"\n--- Query {i} ---")
    print(f"Query: {eval_result.query}")
    print(f"\nWinner: {eval_result.winner}")
    print(f"\nScores:")
    for provider, score in eval_result.scores.items():
        print(f"  {provider}: {score}/100")
    print(f"\nReasoning: {eval_result.reasoning}")
    print(f"{'─'*80}")

## 9. Export Comparison Results

Export the comparison to JSON for further analysis:

In [None]:
# Export to JSON
output_file = Path("comparison_results.json")

with open(output_file, "w") as f:
    json.dump(comparison.model_dump(mode="json"), f, indent=2, default=str)

print(f"\n✓ Comparison exported to: {output_file}")
print(f"  File size: {output_file.stat().st_size / 1024:.1f} KB")

## 10. Generate Markdown Report

Create a markdown report summarizing the comparison:

In [None]:
# Generate markdown report
report_file = Path("comparison_report.md")

with open(report_file, "w") as f:
    f.write(f"# RAG Comparison Report\n\n")
    f.write(f"**Domain**: {domain}\n\n")
    f.write(f"**Date**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    f.write(f"**Query Set**: {query_set_name} ({len(queries)} queries)\n\n")
    
    f.write("## Providers\n\n")
    for provider in providers:
        f.write(f"- **{provider}**: {runs[provider].metadata.get('avg_latency_ms', 0):.2f}ms avg latency\n")
    
    f.write("\n## Results\n\n")
    f.write("| Provider | Wins | Losses | Ties | Avg Score | Avg Latency |\n")
    f.write("|----------|------|--------|------|-----------|-------------|\n")
    
    for provider in providers:
        avg_score = sum(stats[provider]["scores"]) / len(stats[provider]["scores"]) if stats[provider]["scores"] else 0
        avg_latency = sum(stats[provider]["latencies"]) / len(stats[provider]["latencies"]) if stats[provider]["latencies"] else 0
        
        f.write(f"| {provider} | {stats[provider]['wins']} | {stats[provider]['losses']} | "
                f"{stats[provider]['ties']} | {avg_score:.1f} | {avg_latency:.1f}ms |\n")
    
    f.write("\n## Analysis\n\n")
    
    # Determine overall winner
    winner = max(providers, key=lambda p: stats[p]["wins"])
    f.write(f"**Overall Winner**: {winner}\n\n")
    
    # Quality vs Speed tradeoff
    quality_winner = max(providers, key=lambda p: sum(stats[p]["scores"]) / len(stats[p]["scores"]) if stats[p]["scores"] else 0)
    speed_winner = min(providers, key=lambda p: sum(stats[p]["latencies"]) / len(stats[p]["latencies"]) if stats[p]["latencies"] else float('inf'))
    
    f.write(f"- **Best Quality**: {quality_winner}\n")
    f.write(f"- **Fastest**: {speed_winner}\n\n")
    
    f.write("## Sample Evaluations\n\n")
    for i, eval_result in enumerate(comparison.evaluations[:5], 1):
        f.write(f"### Query {i}\n\n")
        f.write(f"**Query**: {eval_result.query}\n\n")
        f.write(f"**Winner**: {eval_result.winner}\n\n")
        f.write(f"**Scores**:\n")
        for provider, score in eval_result.scores.items():
            f.write(f"- {provider}: {score}/100\n")
        f.write(f"\n**Reasoning**: {eval_result.reasoning}\n\n")
        f.write("---\n\n")

print(f"\n✓ Report exported to: {report_file}")
print(f"  File size: {report_file.stat().st_size / 1024:.1f} KB")

## 11. Load Previous Runs (Optional)

You can also load and analyze runs that were previously executed:

In [None]:
# List all available runs for this domain
all_runs = list_runs(domain, domains_dir)

print(f"\n=== Available Runs ===")
print(f"Total runs found: {len(all_runs)}\n")

# Show recent runs
for run_info in sorted(all_runs, key=lambda x: x['started_at'], reverse=True)[:5]:
    print(f"Label: {run_info['label']}")
    print(f"  Provider: {run_info['provider']}")
    print(f"  Query Set: {run_info['query_set']}")
    print(f"  Status: {run_info['status']}")
    print(f"  Started: {run_info['started_at']}")
    print(f"  ID: {run_info['id']}")
    print()

In [None]:
# Load a specific run by label
# run_label = "faiss-small-20240115-120000"  # Replace with actual label
# loaded_run = load_run(domain, run_label, domains_dir)
# print(f"Loaded run: {loaded_run.label}")
# print(f"  Status: {loaded_run.status}")
# print(f"  Total queries: {loaded_run.metadata.get('total_queries', 0)}")

## Summary

This notebook demonstrated the complete RAGDiff v2.0 Python API workflow:

1. ✓ **Loading configurations**: Domain, provider configs, and query sets
2. ✓ **Executing runs**: Running queries against multiple providers in parallel
3. ✓ **Comparing results**: Using LLM evaluation to compare provider performance
4. ✓ **Analyzing results**: Creating tables and visualizations
5. ✓ **Exporting data**: Saving results to JSON and markdown reports
6. ✓ **Loading previous runs**: Accessing historical run data

### Key API Functions

- `execute_run()`: Execute a query set against a provider
- `compare_runs()`: Compare multiple runs using LLM evaluation
- `load_domain()`, `load_provider_config()`, `load_query_set()`: Load configurations
- `load_run()`, `list_runs()`: Access run history

### Next Steps

- Create custom query sets for your domain
- Add new providers by creating YAML configs
- Adjust concurrency for faster execution
- Experiment with different LLM models for evaluation
- Build custom analysis and visualization tools

For more information, see the [RAGDiff documentation](../../CLAUDE.md).