# RAG Retrieval Evaluation with Golden Chunks

This notebook evaluates chunk-level retrieval quality against a golden dataset of 25 queries with ground-truth chunk IDs.

**Pipeline:** embed query → hybrid retrieval (FAISS + BM25 with RRF) → cross-encoder reranking → compare against golden chunks

**Prerequisites:**
- Run `python ingest.py` to build the vectorstore
- Run `python eval/generate_chunk_labels.py` to ensure chunk IDs are current
- Ollama running at `localhost:11434` (only needed for Section 7: generation)

## 1. Setup & Configuration

In [1]:
import sys
import json
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

# Resolve repo root
REPO_ROOT = Path.cwd().resolve()
if not (REPO_ROOT / "config.py").exists():
    REPO_ROOT = REPO_ROOT.parent
assert (REPO_ROOT / "config.py").exists(), f"Cannot find repo root (tried {REPO_ROOT})"

if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from config import CHUNK_SIZE, CHUNK_OVERLAP, TOP_K, TOP_K_CANDIDATES, LLM_MODEL, EMBEDDING_MODEL, RERANK_MODEL

print(f"Repo root: {REPO_ROOT}")
print(f"Chunk size: {CHUNK_SIZE} tokens, overlap: {CHUNK_OVERLAP}")
print(f"Retrieval: TOP_K={TOP_K}, candidates={TOP_K_CANDIDATES}")
print(f"Embedding: {EMBEDDING_MODEL}")
print(f"Reranker: {RERANK_MODEL}")
print(f"LLM: {LLM_MODEL}")

Repo root: /Users/dennispudzha/github_projects/rag_science
Chunk size: 500 tokens, overlap: 50
Retrieval: TOP_K=4, candidates=50
Embedding: nomic-embed-text
Reranker: BAAI/bge-reranker-v2-m3
LLM: llama3.1:8b


In [2]:
from retriever import load_vectorstore, build_retriever

vs = load_vectorstore()
retriever = build_retriever(vs)

print(f"FAISS vectors: {vs.index.ntotal:,}")
print(f"BM25 docs: {len(retriever.bm25_docs):,}")
print(f"Weights: dense={retriever.dense_weight}, bm25={retriever.bm25_weight}")

Loading weights:   0%|          | 0/393 [00:00<?, ?it/s]



FAISS vectors: 5,892
BM25 docs: 5,892
Weights: dense=0.7, bm25=0.3


## 2. Golden Dataset Overview

In [3]:
golden_path = REPO_ROOT / "eval" / "golden_dataset.json"
golden_dataset = json.loads(golden_path.read_text())

print(f"Queries: {len(golden_dataset)}")
print(f"Fields: {list(golden_dataset[0].keys())}")

# Show how many golden chunks per query
chunk_counts = [len(q.get("expected_chunk_ids", [])) for q in golden_dataset]
print(f"Golden chunks per query: min={min(chunk_counts)}, max={max(chunk_counts)}, mean={sum(chunk_counts)/len(chunk_counts):.1f}")

Queries: 25
Fields: ['question', 'ideal_answer', 'expected_sources', 'expected_pages', 'expected_chunk_ids']
Golden chunks per query: min=1, max=3, mean=2.4


In [4]:
# Display first 5 entries
df_golden = pd.DataFrame(golden_dataset)
df_golden[["question", "ideal_answer", "expected_chunk_ids"]].head()

Unnamed: 0,question,ideal_answer,expected_chunk_ids
0,What CMOS technology node is the ITkPixV2 chip...,ITkPixV2 is fabricated in 65nm CMOS technology.,"[2502.05097v1.pdf|p2|23b357c95116, 1_ATLAS ITk..."
1,What is the pixel size of ITkPixV2?,ITkPixV2 has a pixel size of 50 by 50 micromet...,"[2502.05097v1.pdf|p2|23b357c95116, rd53cATLAS1..."
2,How many modules make up the ITk Pixel detecto...,The ITk Pixel detector consists of 9716 module...,[6_ATLAS ITk pixel detector overview.pdf|p1|23...
3,What is the trigger rate requirement for the I...,The trigger rate requirement for ITkPixV2 is 1...,"[2502.05097v1.pdf|p2|23b357c95116, 6_ATLAS ITk..."
4,What are the specifications of the FELIX FLX-7...,The FLX-712 is a PCIe card with a 16-lane PCIe...,[2_FELIX_the_Detector_Interface_for_the_ATLAS_...


## 3. Single Query Demo

Walk through one query end-to-end: retrieve chunks, inspect content, compare against golden chunks.

In [5]:
from eval.evaluate import doc_chunk_id, extract_retrieved_chunk_ids

QUERY_IDX = 0
q = golden_dataset[QUERY_IDX]
docs = retriever.invoke(q["question"])
expected = set(q.get("expected_chunk_ids", []))

print(f"Question: {q['question']}")
print(f"Golden answer: {q['ideal_answer']}")
print(f"\nExpected chunks ({len(expected)}):")
for cid in q.get("expected_chunk_ids", []):
    print(f"  {cid}")

print(f"\nRetrieved chunks ({len(docs)}):")
for doc in docs:
    cid = doc_chunk_id(doc)
    tag = "MATCH" if cid in expected else "miss "
    print(f"  [{tag}] {cid}")
    print(f"          {doc.page_content[:120]}...")

Question: What CMOS technology node is the ITkPixV2 chip fabricated in?
Golden answer: ITkPixV2 is fabricated in 65nm CMOS technology.

Expected chunks (3):
  2502.05097v1.pdf|p2|23b357c95116
  1_ATLAS ITk Pixel Detector Overview.pdf|p3|fd78f7da6afe
  2502.05097v1.pdf|p2|4fa71090f2d2

Retrieved chunks (4):
  [miss ] 2502.05097v1.pdf|p1|c6811c3baa8f
          [Paper: Prepared for submission to JINST]
Prepared for submission to JINST

## Page 1

Prepared for submission to JINST
...
  [miss ] Göttingen_RD53B_Seminar_06-21.pdf|p20|0bdcdbefb22b
          [Paper: PARTICLE PHYSICS SEMINAR,]
large size transistors are basically insensitive

to TID

−

For digital design modif...
  [miss ] Mironova_PSD13.pdf|p1|0c2f22d6974b
          [Paper: First results from the ATLAS ITkPixV2]
First results from the ATLAS ITkPixV2

## Page 1

First results from the ...
  [miss ] 6_ATLAS ITk pixel detector overview.pdf|p3|fdb404f03cd8
          [Paper: PoS(ICHEP2020)878] [Section: 3.

Pixel modules]
as well 

## 4. Chunk-Level Retrieval Evaluation

For all 25 queries, retrieve top-K chunks and compare against golden chunk IDs.

**Metrics:**
- **Chunk MRR** — reciprocal rank of the first matching golden chunk
- **Chunk Precision@K** — fraction of retrieved chunks that are golden
- **Chunk Recall@K** — fraction of golden chunks that were retrieved
- **Source MRR / Recall@K** — same metrics at the source (paper) level

In [None]:
from eval.evaluate import (
    reciprocal_rank_chunks,
    precision_at_k_chunks,
    recall_at_k_chunks,
    reciprocal_rank,
    recall_at_k,
)

K = retriever.k
rows = []

for q in golden_dataset:
    expected_chunks = q.get("expected_chunk_ids", [])
    expected_sources = q.get("expected_sources", [])
    docs = retriever.invoke(q["question"])

    rows.append({
        "question": q["question"],
        "chunk_mrr": reciprocal_rank_chunks(expected_chunks, docs),
        "chunk_prec": precision_at_k_chunks(expected_chunks, docs, K),
        "chunk_recall": recall_at_k_chunks(expected_chunks, docs, K),
        "src_mrr": reciprocal_rank(expected_sources, docs),
        "src_recall": recall_at_k(expected_sources, docs, K),
    })

df_eval = pd.DataFrame(rows)
df_eval[["question", "chunk_mrr", "chunk_prec", "chunk_recall", "src_mrr", "src_recall"]]

## 5. Aggregate Metrics Summary

In [None]:
metrics = {
    "Chunk MRR": df_eval["chunk_mrr"].mean(),
    f"Chunk Precision@{K}": df_eval["chunk_prec"].mean(),
    f"Chunk Recall@{K}": df_eval["chunk_recall"].mean(),
    "Source MRR": df_eval["src_mrr"].mean(),
    f"Source Recall@{K}": df_eval["src_recall"].mean(),
}

print(f"{'Metric':<25} {'Value':>8}")
print("-" * 35)
for name, val in metrics.items():
    print(f"{name:<25} {val:>8.3f}")

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
names = list(metrics.keys())
values = list(metrics.values())
colors = ["#2196F3"] * 3 + ["#4CAF50"] * 2  # blue for chunk, green for source

bars = ax.barh(names, values, color=colors)
ax.set_xlim(0, 1)
ax.set_xlabel("Score")
ax.set_title("Retrieval Evaluation Metrics")

for bar, val in zip(bars, values):
    ax.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height() / 2,
            f"{val:.3f}", va="center", fontsize=10)

ax.invert_yaxis()
plt.tight_layout()
plt.show()

## 6. Error Analysis

Inspect queries with the worst chunk recall — where did retrieval fail?

In [None]:
# Sort by chunk recall (ascending) to show worst performers first
df_worst = df_eval.sort_values("chunk_recall").head(5)

for _, row in df_worst.iterrows():
    q = next(q for q in golden_dataset if q["question"] == row["question"])
    docs = retriever.invoke(q["question"])
    retrieved_ids = extract_retrieved_chunk_ids(docs)
    expected_ids = q.get("expected_chunk_ids", [])

    print(f"Q: {q['question'][:80]}")
    print(f"  Chunk MRR={row['chunk_mrr']:.2f}  Recall={row['chunk_recall']:.2f}  Prec={row['chunk_prec']:.2f}")
    print(f"  Expected:  {expected_ids[:3]}")
    print(f"  Retrieved: {retrieved_ids[:3]}")
    print()

## 7. End-to-End Generation

Run a few queries through the full pipeline (retrieve + generate) and compare generated answers to golden answers.

**Requires Ollama running.**

In [None]:
from langchain_ollama import ChatOllama
from config import OLLAMA_BASE_URL

llm = ChatOllama(model=LLM_MODEL, base_url=OLLAMA_BASE_URL, temperature=0)


def generate_answer(question: str, docs) -> str:
    """Stuff retrieved chunks into a prompt and generate an answer."""
    context = "\n\n".join(doc.page_content for doc in docs)
    prompt = (
        "Answer the question using only the provided context. "
        "Be precise and cite specific numbers.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\nAnswer:"
    )
    return llm.invoke(prompt).content


SAMPLE_N = 5
for q in golden_dataset[:SAMPLE_N]:
    docs = retriever.invoke(q["question"])
    answer = generate_answer(q["question"], docs)
    retrieved_ids = extract_retrieved_chunk_ids(docs)
    expected_ids = set(q.get("expected_chunk_ids", []))
    chunk_hits = sum(1 for cid in retrieved_ids if cid in expected_ids)

    print(f"Q: {q['question']}")
    print(f"Golden:    {q['ideal_answer'][:150]}")
    print(f"Generated: {answer[:150]}")
    print(f"Chunks: {chunk_hits}/{len(expected_ids)} golden chunks retrieved")
    print("-" * 80)