# Simple RAG with Golden Chunks — Real Papers

This notebook demonstrates a minimal RAG evaluation workflow using the **actual ingested scientific papers** and the **real golden dataset** with ground-truth chunk IDs.

**Evaluation approach:**
1. Load the real FAISS vectorstore and BM25 index (from `ingest.py`)
2. Load the golden dataset with `expected_chunk_ids` (ground-truth evidence)
3. For each query, retrieve chunks and compare against golden chunks
4. Compute chunk-level Precision@K, Recall@K, and MRR
5. Run generation with Ollama and display answers alongside sources

**Prerequisites:** Run `python ingest.py` first and ensure Ollama is running.

In [1]:
import sys
import json
import hashlib
from pathlib import Path

import pandas as pd

# Resolve repo root and add to path
REPO_ROOT = Path.cwd().resolve()
if not (REPO_ROOT / "config.py").exists():
    REPO_ROOT = REPO_ROOT.parent
assert (REPO_ROOT / "config.py").exists(), f"Cannot find repo root (tried {REPO_ROOT})"

if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

print(f"Repo root: {REPO_ROOT}")

Repo root: /Users/dennispudzha/github_projects/rag_science


In [2]:
# Load golden dataset
golden_path = REPO_ROOT / "eval" / "golden_dataset.json"
golden_dataset = json.loads(golden_path.read_text())

print(f"Golden dataset: {len(golden_dataset)} queries")
print(f"Fields per entry: {list(golden_dataset[0].keys())}")

df_golden = pd.DataFrame(golden_dataset)
df_golden[["question", "expected_sources", "expected_chunk_ids"]].head()

Golden dataset: 25 queries
Fields per entry: ['question', 'ideal_answer', 'expected_sources', 'expected_pages', 'expected_chunk_ids']


Unnamed: 0,question,expected_sources,expected_chunk_ids
0,What CMOS technology node is the ITkPixV2 chip...,"[2502.05097v1.pdf, 1_ATLAS ITk Pixel Detector ...","[2502.05097v1.pdf|p2|9c1b459136ea, 1_ATLAS ITk..."
1,What is the pixel size of ITkPixV2?,"[2502.05097v1.pdf, rd53cATLAS1v92.pdf, 6_ATLAS...","[2502.05097v1.pdf|p2|9c1b459136ea, 6_ATLAS ITk..."
2,How many modules make up the ITk Pixel detecto...,"[1_Alkakhi_2024_J._Inst._19_C11013.pdf, 6_ATLA...",[6_ATLAS ITk pixel detector overview.pdf|p1|52...
3,What is the trigger rate requirement for the I...,"[2502.05097v1.pdf, introduction_guide.pdf, 6_A...","[2502.05097v1.pdf|p2|dc807300811f, 2502.05097v..."
4,What are the specifications of the FELIX FLX-7...,[2_FELIX_the_Detector_Interface_for_the_ATLAS_...,[2_FELIX_the_Detector_Interface_for_the_ATLAS_...


## 1) Load Real Vectorstore and Build Retriever

Load the FAISS index, BM25 index, and cross-encoder — the same components used by `query.py`.

In [3]:
from query import load_vectorstore, build_retriever

vs = load_vectorstore()
retriever = build_retriever(vs)

# Show vectorstore stats
n_vectors = vs.index.ntotal
print(f"FAISS vectors: {n_vectors}")
print(f"BM25 docs: {len(retriever.bm25_docs)}")
print(f"TOP_K = {retriever.k}, TOP_K_CANDIDATES = {retriever.k_candidates}")
print(f"Weights: dense={retriever.dense_weight}, bm25={retriever.bm25_weight}")

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

[1mBertForSequenceClassification LOAD REPORT[0m from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


FAISS vectors: 2946
BM25 docs: 2946
TOP_K = 4, TOP_K_CANDIDATES = 50
Weights: dense=0.7, bm25=0.3


## 2) Chunk-Level Retrieval Evaluation

For each golden query, retrieve chunks and compare against `expected_chunk_ids`. This is the core evaluation — **if retrieval is bad, generation cannot be good.**

In [4]:
from eval.evaluate import (
    doc_chunk_id,
    extract_retrieved_chunk_ids,
    reciprocal_rank_chunks,
    precision_at_k_chunks,
    recall_at_k_chunks,
    reciprocal_rank,
    recall_at_k,
)

K = retriever.k
rows = []

for q in golden_dataset:
    question = q["question"]
    expected_chunks = q.get("expected_chunk_ids", [])
    expected_sources = q.get("expected_sources", [])

    docs = retriever.invoke(question)
    retrieved_ids = extract_retrieved_chunk_ids(docs)
    retrieved_sources = [d.metadata.get("source", "?") for d in docs]

    chunk_mrr = reciprocal_rank_chunks(expected_chunks, docs)
    chunk_prec = precision_at_k_chunks(expected_chunks, docs, K)
    chunk_rec = recall_at_k_chunks(expected_chunks, docs, K)
    src_mrr = reciprocal_rank(expected_sources, docs)
    src_recall = recall_at_k(expected_sources, docs, K)

    rows.append({
        "question": question[:70],
        "chunk_mrr": chunk_mrr,
        f"chunk_prec@{K}": chunk_prec,
        f"chunk_recall@{K}": chunk_rec,
        "src_mrr": src_mrr,
        f"src_recall@{K}": src_recall,
        "retrieved_chunks": retrieved_ids[:3],
        "expected_chunks": expected_chunks[:3],
    })

df_eval = pd.DataFrame(rows)
df_eval

Unnamed: 0,question,chunk_mrr,chunk_prec@4,chunk_recall@4,src_mrr,src_recall@4,retrieved_chunks,expected_chunks
0,What CMOS technology node is the ITkPixV2 chip...,0.0,0.0,0.0,1.0,0.5,"[2502.05097v1.pdf|p1|c6811c3baa8f, ATL-ITK-SLI...","[2502.05097v1.pdf|p2|9c1b459136ea, 1_ATLAS ITk..."
1,What is the pixel size of ITkPixV2?,0.0,0.0,0.0,0.0,0.0,[1_ATLAS ITk Pixel Detector Overview.pdf|p3|f4...,"[2502.05097v1.pdf|p2|9c1b459136ea, 6_ATLAS ITk..."
2,How many modules make up the ITk Pixel detecto...,0.0,0.0,0.0,1.0,0.666667,[6_ATLAS ITk pixel detector overview.pdf|p2|6a...,[6_ATLAS ITk pixel detector overview.pdf|p1|52...
3,What is the trigger rate requirement for the I...,0.0,0.0,0.0,1.0,0.666667,"[2502.05097v1.pdf|p2|23b357c95116, introductio...","[2502.05097v1.pdf|p2|dc807300811f, 2502.05097v..."
4,What are the specifications of the FELIX FLX-7...,0.0,0.0,0.0,1.0,1.0,[2_FELIX_the_Detector_Interface_for_the_ATLAS_...,[2_FELIX_the_Detector_Interface_for_the_ATLAS_...
5,How many Optoboards are required for the ITk P...,0.0,0.0,0.0,1.0,0.5,[1_ACES20200528_Poster_MC_v1.pdf|p1|6a8e87b65e...,[1_ACES20200528_Poster_MC_v1.pdf|p1|0becaab802...
6,What radiation tolerance is required for the i...,0.0,0.0,0.0,1.0,0.75,"[introduction_guide.pdf|p6|481a7cdd3f66, 2502....",[6_ATLAS ITk pixel detector overview.pdf|p1|52...
7,What powering scheme do RD53 chips use and why?,0.0,0.0,0.0,1.0,0.333333,"[introduction_guide.pdf|p7|17087fb1a710, Loddo...","[introduction_guide.pdf|p6|b0b703129bb2, intro..."
8,What are the differences between the 3D and pl...,0.0,0.0,0.0,1.0,0.666667,[6_ATLAS ITk pixel detector overview.pdf|p3|fd...,[6_ATLAS ITk pixel detector overview.pdf|p2|ce...
9,What threshold and noise performance was measu...,0.0,0.0,0.0,0.0,0.0,"[Mironova_PSD13.pdf|p10|7eb7903d1a9f, Mironova...","[ITkPixV2_Mironova.pdf|p9|8e08ac4b4f9c, ITkPix..."


## 3) Aggregate Metrics Summary

In [5]:
metrics = {
    "Chunk MRR": df_eval["chunk_mrr"].mean(),
    f"Chunk Precision@{K}": df_eval[f"chunk_prec@{K}"].mean(),
    f"Chunk Recall@{K}": df_eval[f"chunk_recall@{K}"].mean(),
    "Source MRR": df_eval["src_mrr"].mean(),
    f"Source Recall@{K}": df_eval[f"src_recall@{K}"].mean(),
}

print(f"{'Metric':<25} {'Value':>8}")
print("-" * 35)
for name, val in metrics.items():
    print(f"{name:<25} {val:>8.3f}")

# Highlight queries where chunk retrieval completely missed
misses = df_eval[df_eval["chunk_mrr"] == 0.0]
if len(misses) > 0:
    print(f"\nChunk-level misses: {len(misses)}/{len(df_eval)}")
    for _, row in misses.iterrows():
        print(f"  - {row['question']}")

Metric                       Value
-----------------------------------
Chunk MRR                    0.000
Chunk Precision@4            0.000
Chunk Recall@4               0.000
Source MRR                   0.660
Source Recall@4              0.497

Chunk-level misses: 25/25
  - What CMOS technology node is the ITkPixV2 chip fabricated in?
  - What is the pixel size of ITkPixV2?
  - How many modules make up the ITk Pixel detector and how are they arran
  - What is the trigger rate requirement for the ITkPixV2 chip and how doe
  - What are the specifications of the FELIX FLX-712 PCIe card used in the
  - How many Optoboards are required for the ITk Pixel Detector and what d
  - What radiation tolerance is required for the inner pixel layers at the
  - What powering scheme do RD53 chips use and why?
  - What are the differences between the 3D and planar sensor technologies
  - What threshold and noise performance was measured for ITkPixV2?
  - What is the total pixel array size of the ATLAS

## 4) Detailed Chunk Comparison for a Single Query

Pick one query and inspect exactly which chunks were retrieved vs expected — useful for debugging retrieval misses.

In [6]:
QUERY_IDX = 0  # Change this to inspect different queries

q = golden_dataset[QUERY_IDX]
docs = retriever.invoke(q["question"])

print(f"Question: {q['question']}\n")
print(f"Expected chunk IDs:")
for cid in q.get("expected_chunk_ids", []):
    print(f"  {cid}")

print(f"\nRetrieved chunk IDs:")
for doc in docs:
    cid = doc_chunk_id(doc)
    match = "MATCH" if cid in q.get("expected_chunk_ids", []) else "     "
    print(f"  [{match}] {cid}")
    print(f"          {doc.page_content[:120]}...")
    print()

Question: What CMOS technology node is the ITkPixV2 chip fabricated in?

Expected chunk IDs:
  2502.05097v1.pdf|p2|9c1b459136ea
  1_ATLAS ITk Pixel Detector Overview.pdf|p3|175a786d050a
  1_ATLAS ITk Pixel Detector Overview.pdf|p3|d7601bc75775

Retrieved chunk IDs:
  [     ] 2502.05097v1.pdf|p1|c6811c3baa8f
          [Paper: Prepared for submission to JINST]
Prepared for submission to JINST

## Page 1

Prepared for submission to JINST
...

  [     ] ATL-ITK-SLIDE-2022-287.pdf|p9|9ccf7b057cc2
          [Paper: ATLAS ITk Pixel Detector]
test finalized for all sites

• Most of flip chip bonding made in companies: SnAg and ...

  [     ] Loddo_PSD13.pdf|p17|29f09f4b0864
          [Paper: RD53 pixel chips for the ATLAS and CMS]
3-8, 2023

F.Loddo INFN-Bari

17

•

The readout chips for the ATLAS and...

  [     ] Loddo_PSD13.pdf|p1|ecad74c146f3
          [Paper: RD53 pixel chips for the ATLAS and CMS]
RD53 pixel chips for the ATLAS and CMS

## Page 1

RD53 pixel chips for ...



## 5) End-to-End Generation with Ollama

Run the full pipeline: retrieve chunks, then generate an answer using the local LLM. Compare against the golden `ideal_answer`.

In [7]:
from langchain_ollama import ChatOllama
from config import LLM_MODEL, OLLAMA_BASE_URL

llm = ChatOllama(model=LLM_MODEL, base_url=OLLAMA_BASE_URL, temperature=0)

def generate_answer(question: str, docs) -> str:
    """Simple RAG generation: stuff retrieved chunks into a prompt."""
    context = "\n\n".join(doc.page_content for doc in docs)
    prompt = (
        "Answer the question using only the provided context. "
        "Be precise and cite specific numbers.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\nAnswer:"
    )
    return llm.invoke(prompt).content

# Run on a few sample queries
SAMPLE_N = 5
for q in golden_dataset[:SAMPLE_N]:
    docs = retriever.invoke(q["question"])
    answer = generate_answer(q["question"], docs)
    retrieved_ids = extract_retrieved_chunk_ids(docs)
    expected_ids = set(q.get("expected_chunk_ids", []))
    chunk_hits = sum(1 for cid in retrieved_ids if cid in expected_ids)

    print(f"Q: {q['question']}")
    print(f"Golden: {q['ideal_answer'][:150]}")
    print(f"Generated: {answer[:150]}")
    print(f"Chunks: {chunk_hits}/{len(expected_ids)} golden chunks retrieved")
    print("-" * 80)

Q: What CMOS technology node is the ITkPixV2 chip fabricated in?
Golden: ITkPixV2 is fabricated in 65nm CMOS technology.
Generated: The ITkPixV2 chip is fabricated in 65nm feature size CMOS technology. (Source: Page 2 of the provided context)
Chunks: 0/3 golden chunks retrieved
--------------------------------------------------------------------------------
Q: What is the pixel size of ITkPixV2?
Golden: ITkPixV2 has a pixel size of 50 by 50 micrometers.
Generated: The context does not explicitly mention "ITkPixV2", but it mentions "Pixel size" in two tables. 

In both tables, the pixel size is mentioned as 50 × 
Chunks: 0/3 golden chunks retrieved
--------------------------------------------------------------------------------
Q: How many modules make up the ITk Pixel detector and how are they arranged?
Golden: The ITk Pixel detector consists of 9716 modules arranged in 5 cylindrical barrel layers, covering approximately 13 square meters of instrumented area 
Generated: According to th