# RAG Experimentation & Evaluation Framework

This notebook systematically evaluates different RAG pipeline configurations to find optimal parameters.
We start with the current baseline, then iteratively change parameters and measure impact.

**Papers in corpus:** ~30 scientific papers on the ATLAS ITk Pixel Detector, covering readout chips (RD53/ITkPixV2), DAQ (FELIX, YARR), optoelectronics, and module assembly.

---
## Section 1 — Setup & Baseline

In [1]:
import sys
import os
import logging

# Add project root to path
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

os.chdir(PROJECT_ROOT)

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

import pandas as pd
import matplotlib.pyplot as plt

from eval.experiment_runner import (
    load_golden_dataset,
    run_experiment,
    compare_experiments,
    plot_comparison,
    plot_radar,
    ExperimentResult,
)

plt.rcParams["figure.dpi"] = 100
pd.set_option("display.precision", 3)

golden = load_golden_dataset()
print(f"Loaded {len(golden)} golden Q&A pairs")
print(f"Sample: {golden[0]['question']}")

Loaded 25 golden Q&A pairs
Sample: What CMOS technology node is the ITkPixV2 chip fabricated in?


In [2]:
# Run baseline with current configuration
baseline = run_experiment(
    name="baseline",
    config_overrides={},  # use current defaults
    golden_dataset=golden,
    reingest=False,
)

all_results = [baseline]
print(f"\nBaseline Results:")
print(f"  MRR: {baseline.metrics['mrr']:.3f}")
print(f"  Recall@4: {baseline.metrics.get('recall@4', 'N/A')}")
print(f"  Duration: {baseline.duration_seconds:.1f}s")

INFO: Loading faiss.
INFO: Successfully loaded faiss.
INFO: BM25 index loaded from vectorstore/bm25_index.pkl
INFO: Loading cross-encoder model: cross-encoder/ms-marco-MiniLM-L-6-v2
INFO: HTTP Request: HEAD https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
INFO: HTTP Request: HEAD https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
INFO: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cross-encoder/ms-marco-MiniLM-L6-v2/c5ee24cb16019beea0893ab7796b1df96625c6b8/config.json "HTTP/1.1 200 OK"


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

[1mBertForSequenceClassification LOAD REPORT[0m from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
INFO: HTTP Request: HEAD https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
INFO: HTTP Request: HEAD https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
INFO: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cross-encoder/ms-marco-MiniLM-L6-v2/c5ee24cb16019beea0893ab7796b1df96625c6b8/config.json "HTTP/1.1 200 OK"
INFO: HTTP Request: HEAD https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
INFO: HTT


Baseline Results:
  MRR: 0.860
  Recall@4: 0.63
  Duration: 154.5s


In [3]:
# View per-question baseline results
baseline_df = pd.DataFrame(baseline.per_question)
baseline_df

Unnamed: 0,question,answer,mrr,recall@4,sources
0,What CMOS technology node is the ITkPixV2 chip...,"According to the paper ""The ITkPixV2 chip"" (Pa...",0.5,0.5,"[Göttingen_RD53B_Seminar_06-21.pdf, 2502.0509..."
1,What is the pixel size of ITkPixV2?,"According to Table 1 in the paper ""ATLAS ITk P...",1.0,0.333,"[2502.05097v1.pdf, 1_ATLAS ITk Pixel Detector ..."
2,How many modules make up the ITk Pixel detecto...,"According to the provided papers, the ITk Pixe...",1.0,0.333,"[6_ATLAS ITk pixel detector overview.pdf, Möb..."
3,What is the trigger rate requirement for the I...,"According to Section 0.4 GHz/cm2, the design r...",1.0,0.333,"[2502.05097v1.pdf, 2502.05097v1.pdf, 2502.0509..."
4,What are the specifications of the FELIX FLX-7...,"According to the provided papers, the FELIX FL...",1.0,1.0,[2_FELIX_the_Detector_Interface_for_the_ATLAS_...
5,How many Optoboards are required for the ITk P...,"According to the provided papers, there are a ...",1.0,1.0,"[2_AT2_IP_MG_0010_v2.7.pdf, 1_ACES20200528_Pos..."
6,What radiation tolerance is required for the i...,According to Chapter 15 of the provided paper ...,1.0,0.75,"[4_20230905_TIPP2023.pdf, introduction_guide.p..."
7,What powering scheme do RD53 chips use and why?,The RD53 chips use a novel serial powering sch...,1.0,0.333,"[introduction_guide.pdf, Loddo_PSD13.pdf, intr..."
8,What are the differences between the 3D and pl...,"According to the provided papers, the main dif...",1.0,0.667,"[6_ATLAS ITk pixel detector overview.pdf, 1_AT..."
9,What threshold and noise performance was measu...,"According to the paper ""Threshold tuning"" by M...",1.0,1.0,"[ITkPixV2_Mironova.pdf, Mironova_PSD13.pdf, Go..."


### Baseline Error Analysis

Inspect failures (MRR=0) and partial hits (MRR<1) to guide which experiments to prioritize.

In [None]:
# Load expected sources for comparison
expected_map = {item["question"]: item["expected_sources"] for item in golden}

error_df = baseline_df.copy()
error_df["expected"] = error_df["question"].map(expected_map)

# Complete misses
print("=== COMPLETE MISSES (MRR = 0.0) ===\n")
misses = error_df[error_df["mrr"] == 0.0]
for _, row in misses.iterrows():
    print(f"Q: {row['question'][:80]}...")
    print(f"  Expected: {row['expected']}")
    print(f"  Got:      {row['sources']}")
    print()

# Partial hits (correct source found but not ranked first)
print("=== PARTIAL HITS (0 < MRR < 1) ===\n")
partial = error_df[(error_df["mrr"] > 0) & (error_df["mrr"] < 1)]
for _, row in partial.iterrows():
    print(f"Q: {row['question'][:80]}...")
    print(f"  MRR={row['mrr']:.2f}, Recall@4={row['recall@4']:.2f}")
    print(f"  Expected: {row['expected']}")
    print(f"  Got:      {row['sources']}")
    print()

# LLM failures (retrieval ok but answer says "I don't have enough")
print("=== LLM FAILURES (good retrieval, bad answer) ===\n")
llm_fails = error_df[
    (error_df["mrr"] >= 1.0) &
    error_df["answer"].str.contains("don't have enough", case=False, na=False)
]
for _, row in llm_fails.iterrows():
    print(f"Q: {row['question'][:80]}...")
    print(f"  Recall@4={row['recall@4']:.2f}")
    print(f"  Sources: {row['sources']}")
    print()

print(f"Summary: {len(misses)} misses, {len(partial)} partial, {len(llm_fails)} LLM failures out of {len(error_df)} questions")

---
## Section 2 — Experiment: Chunk Size

Test different chunk sizes to find the optimal balance between context granularity and completeness.
Each chunk size requires re-ingestion.

In [None]:
chunk_sizes = [500, 1000, 1500, 2000, 3000]
chunk_results = []

for cs in chunk_sizes:
    overlap = cs // 7  # keep roughly similar overlap ratio
    result = run_experiment(
        name=f"chunk_{cs}",
        config_overrides={
            "CHUNK_SIZE": cs,
            "CHUNK_OVERLAP": overlap,
        },
        golden_dataset=golden,
        reingest=True,
    )
    chunk_results.append(result)
    print(f"  chunk_size={cs}: MRR={result.metrics['mrr']:.3f}")

all_results.extend(chunk_results)

In [None]:
chunk_df = compare_experiments(chunk_results)
display(chunk_df)

fig = plot_comparison(chunk_df, title="Chunk Size Comparison")
plt.show()

---
## Section 3 — Experiment: Retrieval Weights

Vary BM25/Dense weight ratios. No re-ingestion needed — just changes retriever scoring.

In [None]:
weight_configs = [
    ("dense_only", 0.0, 1.0),
    ("bm25_only", 1.0, 0.0),
    ("bm25_0.1", 0.1, 0.9),
    ("bm25_0.3", 0.3, 0.7),  # current default
    ("bm25_0.5", 0.5, 0.5),
    ("bm25_0.7", 0.7, 0.3),
]

weight_results = []
for name, bm25_w, dense_w in weight_configs:
    result = run_experiment(
        name=name,
        config_overrides={
            "BM25_WEIGHT": bm25_w,
            "DENSE_WEIGHT": dense_w,
        },
        golden_dataset=golden,
        reingest=False,
    )
    weight_results.append(result)
    print(f"  {name}: MRR={result.metrics['mrr']:.3f}")

all_results.extend(weight_results)

In [None]:
weight_df = compare_experiments(weight_results)
display(weight_df)

fig = plot_comparison(weight_df, title="Retrieval Weight Comparison")
plt.show()

---
## Section 4 — Experiment: Top-K and Reranking

Vary TOP_K (final documents) and TOP_K_CANDIDATES (pre-reranking pool).

In [None]:
topk_configs = [
    ("k2_c10", 2, 10),
    ("k2_c20", 2, 20),
    ("k4_c10", 4, 10),
    ("k4_c20", 4, 20),  # current default
    ("k4_c40", 4, 40),
    ("k6_c20", 6, 20),
    ("k6_c40", 6, 40),
    ("k8_c20", 8, 20),
    ("k8_c40", 8, 40),
]

topk_results = []
for name, k, candidates in topk_configs:
    result = run_experiment(
        name=name,
        config_overrides={
            "TOP_K": k,
            "TOP_K_CANDIDATES": candidates,
        },
        golden_dataset=golden,
        reingest=False,
    )
    topk_results.append(result)
    recall_key = f"recall@{k}"
    print(f"  {name}: MRR={result.metrics['mrr']:.3f}, {recall_key}={result.metrics[recall_key]:.3f}")

all_results.extend(topk_results)

In [None]:
topk_df = compare_experiments(topk_results)
display(topk_df)

fig = plot_comparison(topk_df, metrics=["mrr"], title="Top-K / Candidates Comparison")
plt.show()

---
## Section 4b — Experiment: Candidate Pool Size (fixed TOP_K=4)

Isolate the effect of widening the pre-reranking candidate pool. More candidates means
the cross-encoder sees more documents, increasing the chance of surfacing niche papers.

In [None]:
candidate_pools = [10, 20, 30, 40, 60, 80]
candidate_results = []

for c in candidate_pools:
    result = run_experiment(
        name=f"candidates_{c}",
        config_overrides={
            "TOP_K": 4,
            "TOP_K_CANDIDATES": c,
        },
        golden_dataset=golden,
        reingest=False,
    )
    candidate_results.append(result)
    print(f"  candidates={c}: MRR={result.metrics['mrr']:.3f}, Recall@4={result.metrics['recall@4']:.3f}")

all_results.extend(candidate_results)

In [None]:
cand_df = compare_experiments(candidate_results)
display(cand_df)

fig = plot_comparison(cand_df, title="Candidate Pool Size (fixed TOP_K=4)")
plt.show()

---
## Section 5 — Experiment: Embedding Models

Compare different Ollama embedding models. Requires re-ingestion per model.

**Note:** Make sure models are pulled in Ollama before running (`ollama pull <model>`).

In [None]:
embedding_models = [
    "nomic-embed-text",      # current default
    "mxbai-embed-large",
]

embedding_results = []
for model in embedding_models:
    result = run_experiment(
        name=f"embed_{model}",
        config_overrides={"EMBEDDING_MODEL": model},
        golden_dataset=golden,
        reingest=True,
    )
    embedding_results.append(result)
    print(f"  {model}: MRR={result.metrics['mrr']:.3f}")

all_results.extend(embedding_results)

In [None]:
embed_df = compare_experiments(embedding_results)
display(embed_df)

fig = plot_comparison(embed_df, title="Embedding Model Comparison")
plt.show()

---
## Section 6 — Experiment: LLM Models

Compare generation quality across different LLMs. Same retrieval, different generation.

**Note:** Pull models first with `ollama pull <model>`.

In [None]:
llm_models = [
    "llama3.1:8b",   # current default
    # Add more models as available in your Ollama instance, e.g.:
    # "mistral:7b",
    # "gemma2:9b",
]

llm_results = []
for model in llm_models:
    result = run_experiment(
        name=f"llm_{model}",
        config_overrides={"LLM_MODEL": model},
        golden_dataset=golden,
        reingest=False,
    )
    llm_results.append(result)
    print(f"  {model}: MRR={result.metrics['mrr']:.3f}")

all_results.extend(llm_results)

In [None]:
llm_df = compare_experiments(llm_results)
display(llm_df)

fig = plot_comparison(llm_df, title="LLM Model Comparison")
plt.show()

---
## Section 7 — Experiment: Parent-Document Retrieval

Use small child chunks for retrieval precision, but expand to larger parent chunks when
passing context to the LLM. This helps when retrieval finds the right spot but the chunk
is too small for the LLM to synthesize a full answer (e.g. Q14 YARR processing times).

**Requires re-ingestion** to create child→parent mappings.

In [None]:
parent_configs = [
    ("parent_off", False, 400, 50),         # baseline (no parent retrieval)
    ("parent_c300", True, 300, 50),          # smaller child chunks
    ("parent_c400", True, 400, 50),          # default child size
    ("parent_c200", True, 200, 30),          # very small child chunks
]

parent_results = []
for name, enabled, child_size, child_overlap in parent_configs:
    result = run_experiment(
        name=name,
        config_overrides={
            "ENABLE_PARENT_RETRIEVAL": enabled,
            "CHILD_CHUNK_SIZE": child_size,
            "CHILD_CHUNK_OVERLAP": child_overlap,
        },
        golden_dataset=golden,
        reingest=True,
    )
    parent_results.append(result)
    print(f"  {name}: MRR={result.metrics['mrr']:.3f}, Recall@4={result.metrics['recall@4']:.3f}")

all_results.extend(parent_results)

In [None]:
parent_df = compare_experiments(parent_results)
display(parent_df)

fig = plot_comparison(parent_df, title="Parent-Document Retrieval Comparison")
plt.show()

---
## Section 8 — Combined Best Configuration

Take the winning parameters from each experiment above and combine them into a single run.
Update the overrides below based on your results from previous sections.

In [None]:
# Fill in the best values from your experiments above
best_config = {
    # "CHUNK_SIZE": 1500,           # update with best from Section 2
    # "CHUNK_OVERLAP": 200,         # update with best from Section 2
    # "BM25_WEIGHT": 0.3,           # update with best from Section 3
    # "DENSE_WEIGHT": 0.7,          # update with best from Section 3
    # "TOP_K": 4,                   # update with best from Section 4
    # "TOP_K_CANDIDATES": 20,       # update with best from Section 4/4b
    # "EMBEDDING_MODEL": "nomic-embed-text",  # update with best from Section 5
    # "LLM_MODEL": "llama3.1:8b",   # update with best from Section 6
}

# Uncomment the lines above and set values, then run:
if best_config:
    needs_reingest = any(k in best_config for k in ["CHUNK_SIZE", "CHUNK_OVERLAP", "EMBEDDING_MODEL", "CHILD_CHUNK_SIZE"])
    combined = run_experiment(
        name="combined_best",
        config_overrides=best_config,
        golden_dataset=golden,
        reingest=needs_reingest,
    )
    all_results.append(combined)
    print(f"\nCombined Best Results:")
    print(f"  MRR: {combined.metrics['mrr']:.3f}")
    print(f"  Recall@4: {combined.metrics.get('recall@4', 'N/A')}")
    print(f"  Duration: {combined.duration_seconds:.1f}s")

    # Per-question comparison vs baseline
    combined_df = pd.DataFrame(combined.per_question)
    comparison = baseline_df[["question", "mrr", "recall@4"]].copy()
    comparison.columns = ["question", "baseline_mrr", "baseline_recall"]
    comparison["combined_mrr"] = combined_df["mrr"]
    comparison["combined_recall"] = combined_df["recall@4"]
    comparison["mrr_delta"] = comparison["combined_mrr"] - comparison["baseline_mrr"]
    display(comparison[comparison["mrr_delta"] != 0].sort_values("mrr_delta"))
else:
    print("Uncomment and fill in best_config values from your experiments above.")

---
## Section 9 — Results Dashboard

Combined comparison of all experiments.

In [None]:
# Combined comparison table
full_df = compare_experiments(all_results)
display(full_df.sort_values("mrr", ascending=False))

In [None]:
# Best configuration summary
if "mrr" in full_df.columns:
    best_idx = full_df["mrr"].idxmax()
    print(f"Best experiment by MRR: {best_idx}")
    print(f"  MRR = {full_df.loc[best_idx, 'mrr']:.3f}")
    print(f"\nFull config:")
    display(full_df.loc[best_idx])

In [None]:
# Heatmap of all retrieval metrics across experiments
metric_cols = [c for c in full_df.columns if c in {
    "mrr", "recall@2", "recall@4", "recall@6", "recall@8",
    "context_precision", "context_recall", "answer_relevancy", "faithfulness",
}]

if metric_cols:
    fig, ax = plt.subplots(figsize=(10, max(6, len(full_df) * 0.4)))
    heatmap_data = full_df[metric_cols].apply(pd.to_numeric, errors="coerce")
    im = ax.imshow(heatmap_data.values, cmap="YlGn", aspect="auto", vmin=0, vmax=1)
    ax.set_xticks(range(len(metric_cols)))
    ax.set_xticklabels([m.replace('_', ' ').title() for m in metric_cols], rotation=45, ha="right")
    ax.set_yticks(range(len(heatmap_data)))
    ax.set_yticklabels(heatmap_data.index)
    # Add text annotations
    for i in range(len(heatmap_data)):
        for j in range(len(metric_cols)):
            val = heatmap_data.iloc[i, j]
            if pd.notna(val):
                ax.text(j, i, f"{val:.2f}", ha="center", va="center", fontsize=8)
    plt.colorbar(im, ax=ax, label="Score")
    ax.set_title("All Experiments — Metrics Heatmap", fontsize=14, fontweight="bold")
    plt.tight_layout()
    plt.show()

In [None]:
# Radar chart for top experiments
# Select top 5 experiments by MRR for readability
if "mrr" in full_df.columns and len(full_df) > 2:
    top5 = full_df.nlargest(5, "mrr")
    fig = plot_radar(top5, title="Top 5 Experiments — Radar Chart")
    if fig:
        plt.show()