# 02. Evaluate Retrieval (Invest RAG)

Measure retrieval quality **before** adding generation:
- **Recall@k**, **MRR@k**
- Compare **Vector baseline** vs **LLM rerank (Top-1 promotion)**

> This notebook is orchestration-only. All logic lives under `src/`.


## Evaluation Design

- Corpus: 5 companies (10-K filings)
- ~4,000 chunks
- 40 evaluation questions
- Labels:
  - `gold_doc_ids` (section-level)
  - `gold_chunk_ids` (chunk-level)

We evaluate retrieval at two levels:
- **Doc-level** (routing correctness)
- **Chunk-level** (fine-grained grounding precision)

## Environment Setup

Resolve project root and load index / evaluation paths.

In [2]:
# Standard library
from pathlib import Path
import json

# Third-party
import pandas as pd

# Local modules
from src.data_pipeline.io_utils import read_jsonl
from src.llm.embedding import embed_query
from src.retrieval.vector_store import VectorStore
from src.eval.search_wrappers import (
    make_vectorstore_search_fn,
    make_llm_rerank_search_fn,
)
from src.eval.retrieval_eval import run_eval_suite

In [3]:
PROJECT_ROOT = Path.cwd()
assert (PROJECT_ROOT / "src").exists(), (
    f"Run this notebook from project root (invest-rag/). Current cwd={PROJECT_ROOT}"
)

INDEX_DIR  = PROJECT_ROOT / "indexes" / "faiss"
INDEX_PATH = INDEX_DIR / "index.bin"
META_PATH  = INDEX_DIR / "meta.jsonl"
EVAL_PATH  = PROJECT_ROOT / "eval" / "questions.jsonl"

print("PROJECT_ROOT:", PROJECT_ROOT)

PROJECT_ROOT: c:\Users\CG\Desktop\invest-rag


## Load Evaluation Data & Index

Load:
- evaluation questions
- FAISS vector index
- metadata mapping

In [None]:
# Load data + index
questions = read_jsonl(EVAL_PATH)
vs = VectorStore.load(index_path=INDEX_PATH, meta_path=META_PATH)

print("n_questions:", len(questions))


n_questions: 40


## Build `search_fn` (injection)

Evaluator controls cutoff **k** by calling `search_fn(query, k)`.


In [None]:
# Baseline: vector search
vector_search_fn = make_vectorstore_search_fn(
    vs,
    embed_query=embed_query,
    normalize=True,   # keep consistent with index build
)

# Rerank v1: vector candidates -> LLM chooses best -> promote to rank #1
rerank_search_fn = make_llm_rerank_search_fn(
    vector_search_fn,
    k_vec=10,
    rerank_model="gpt-4.1-mini", 
)


## Evaluation Helpers

Utility functions for:
- switching between doc-level and chunk-level evaluation
- normalizing question format for the evaluator
- performing lightweight sanity checks on retrieval outputs

In [None]:
KS_DOC   = (1, 3, 5, 10)
KS_CHUNK = (1, 3, 5, 10)   


def prepare_questions_for_level(questions, level: str):
    if level == "doc":
        questions_eval = []
        for q in questions:
            qq = dict(q)
            qq["question"] = q["query"]   # <- alias added
            questions_eval.append(qq)
        return questions_eval, "doc_id", True

    if level == "chunk":
        questions_eval = []
        for q in questions:
            qq = dict(q)
            qq["question"] = q["query"]  
            qq["gold_doc_ids"] = q.get("gold_chunk_ids", [])
            questions_eval.append(qq)
        return questions_eval, "chunk_id", False

    raise ValueError(f"Unknown level: {level}")


def quick_sanity_check(search_fn, q_item, id_key: str, n=5):
    # q_item uses standardized key: "question"
    sample = search_fn(q_item["question"], n)
    got = [r.get(id_key, None) for r in sample]
    print(f"[Sanity] qid={q_item.get('qid')} | id_key={id_key} | sample_ids={got}")
    if any(x is None for x in got):
        print(f"⚠️ WARNING: Some results missing '{id_key}'. Eval may be broken.")
    return sample

## Run Evaluation (Doc / Chunk)

Run the same evaluation protocol for both levels and save results to `eval/`.

In [None]:
def run_level_suite(*, questions, level: str, ks, vector_search_fn, rerank_search_fn, out_dir: Path):
    questions_eval, id_key, dedupe = prepare_questions_for_level(questions, level)

    print(f"[Eval] level={level} | id_key={id_key} | dedupe={dedupe} | n={len(questions_eval)} | ks={ks}")

    # --- sanity check 
    _ = quick_sanity_check(vector_search_fn, questions_eval[0], id_key=id_key, n=3)
    _ = quick_sanity_check(rerank_search_fn, questions_eval[0], id_key=id_key, n=3)

    out_vec = out_dir / f"results_vector_{level}.json"
    out_rr  = out_dir / f"results_rerank_llm_{level}.json"

    suite_vec = run_eval_suite(
        questions_eval,
        ks=ks,
        search_fn=vector_search_fn,
        out_path=out_vec,
        id_key=id_key,
        dedupe=dedupe,
    )

    suite_rr = run_eval_suite(
        questions_eval,
        ks=ks,
        search_fn=rerank_search_fn,
        out_path=out_rr,
        id_key=id_key,
        dedupe=dedupe,
    )

    return {
        "level": level,
        "id_key": id_key,
        "dedupe": dedupe,
        "ks": ks,
        "suite_vec": suite_vec,
        "suite_rr": suite_rr,
        "out_vec": str(out_vec),
        "out_rr": str(out_rr),
    }

## Failure Logs

Save failed queries as JSONL for quick inspection:
- k=10 → coverage failures
- k=1 → ranking failures

In [None]:
def save_fail_cases(*, questions, level: str, k: int, search_fn, out_path: Path):
    questions_eval, id_key, _dedupe = prepare_questions_for_level(questions, level)

    fails = []
    for q in questions_eval:
        gold = set(map(str, q.get("gold_doc_ids", [])))
        if not gold:
            continue

        results = search_fn(q["question"], k) 
        pred = [str(r.get(id_key, "")) for r in results if r.get(id_key) is not None]

        if not any(p in gold for p in pred):
            fails.append({
                "qid": q.get("qid"),
                "tier": q.get("tier"),
                "type": q.get("type"),
                "question": q.get("question"),
                "notes": q.get("notes"),
                "level": level,
                "k": k,
                "id_key": id_key,
                "gold_ids": list(gold)[:50],
                "pred_ids": pred,
                "top_results_preview": [
                    {
                        id_key: r.get(id_key),
                        "doc_id": r.get("doc_id"),
                        "chunk_id": r.get("chunk_id"),
                        "score": r.get("score"),
                    }
                    for r in results[:min(5, len(results))]
                ],
            })

    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8") as f:
        for row in fails:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

    print(f"[FailLog] saved={len(fails)} -> {out_path}")
    return fails

## Metric Comparison Table

Summarize Recall@k and MRR@k for:

- **Vector baseline**
- **Vector + LLM rerank**

We also report:
- ΔRecall = Rerank − Vector
- ΔMRR = Rerank − Vector

To avoid re-running expensive LLM reranking, we load cached evaluation results from `eval/`.

In [None]:
from IPython.display import display

EVAL_DIR = PROJECT_ROOT / "eval"

def load_json(path: Path) -> dict:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def suite_to_df(suite: dict, label: str):
    rows = []
    for k in suite["ks"]:
        r = suite["results"][str(k)]
        rows.append({
            "model": label,
            "k": int(k),
            "n": r.get("n"),
            "recall_at_k": r.get("recall_at_k"),
            "mrr_at_k": r.get("mrr_at_k"),
            "n_fail": r.get("n_fail"),
        })
    return pd.DataFrame(rows).sort_values(["k"])

def compare_table(suite_vec: dict, suite_rr: dict):
    dfv = suite_to_df(suite_vec, "Vector")
    dfr = suite_to_df(suite_rr, "Rerank")

    df = dfv.merge(dfr, on="k", suffixes=("_vec", "_rr"))
    df["ΔRecall"] = df["recall_at_k_rr"] - df["recall_at_k_vec"]
    df["ΔMRR"]    = df["mrr_at_k_rr"] - df["mrr_at_k_vec"]

    out = df[[
        "k",
        "recall_at_k_vec", "mrr_at_k_vec",
        "recall_at_k_rr",  "mrr_at_k_rr",
        "ΔRecall", "ΔMRR"
    ]].copy()

    for c in ["recall_at_k_vec","mrr_at_k_vec","recall_at_k_rr","mrr_at_k_rr","ΔRecall","ΔMRR"]:
        out[c] = out[c].astype(float).round(4)

    return out

def load_and_display(level: str):
    vec_path = EVAL_DIR / f"results_vector_{level}.json"
    rr_path  = EVAL_DIR / f"results_rerank_llm_{level}.json"

    if not vec_path.exists() or not rr_path.exists():
        raise FileNotFoundError(
            f"Missing results for level='{level}'.\n"
            f"- {vec_path}\n- {rr_path}\n"
            "Run the evaluation cell first."
        )

    suite_vec = load_json(vec_path)
    suite_rr  = load_json(rr_path)

    print(f"\n=== {level.upper()}-LEVEL ===")
    display(compare_table(suite_vec, suite_rr))

# Show both
load_and_display("doc")
load_and_display("chunk")


=== DOC-LEVEL ===


Unnamed: 0,k,recall_at_k_vec,mrr_at_k_vec,recall_at_k_rr,mrr_at_k_rr,ΔRecall,ΔMRR
0,1,0.75,0.75,0.875,0.875,0.125,0.125
1,3,0.85,0.8,0.925,0.9,0.075,0.1
2,5,0.9,0.8188,0.95,0.9125,0.05,0.0938
3,10,0.975,0.8375,0.975,0.9208,0.0,0.0833



=== CHUNK-LEVEL ===


Unnamed: 0,k,recall_at_k_vec,mrr_at_k_vec,recall_at_k_rr,mrr_at_k_rr,ΔRecall,ΔMRR
0,1,0.15,0.15,0.35,0.35,0.2,0.2
1,3,0.425,0.2792,0.525,0.4375,0.1,0.1583
2,5,0.525,0.3017,0.575,0.4488,0.05,0.1471
3,10,0.625,0.3135,0.625,0.4544,0.0,0.1408


## Results Interpretation

### Doc-level Retrieval

At the document level, vector retrieval already achieves high coverage by k=10 (R@10 ≈ 0.98), indicating that relevant sections are almost always present in the candidate set.

LLM reranking significantly improves:
- **R@1** (0.75 → 0.875)
- **MRR** across all k

This suggests that the primary bottleneck at the doc level is **ranking quality rather than candidate generation**.  
Reranking effectively promotes the correct document to the top when it is already within the candidate pool.

---

### Chunk-level Retrieval

At the chunk level, recall remains substantially lower even at k=10 (R@10 ≈ 0.63), indicating incomplete candidate coverage.

However, reranking produces strong improvements in:
- **R@1** (0.15 → 0.35)
- **MRR** across all k

This shows that when the correct chunk is present in the candidate set, reranking significantly improves answer positioning.  
The main bottleneck at the chunk level appears to be **candidate generation (coverage), not ranking**.

---

### Key Takeaways

- **Doc-level:** Candidate generation is strong; ranking refinement provides measurable gains.
- **Chunk-level:** Coverage remains limited at k=10; increasing retrieval depth or improving chunking strategy may further improve performance.
- LLM reranking consistently improves early precision (R@1) and ranking quality (MRR), even when Recall@k is unchanged.

## Failure Analysis

We inspect failure cases to distinguish between:

- **Coverage failures** (k=10 miss): correct item not retrieved
- **Ranking failures** (k=1 miss): correct item retrieved but not ranked first

Representative examples are shown below.

In [None]:
def preview_fail(path, n=3):
    with open(path, "r", encoding="utf-8") as f:
        rows = [json.loads(line) for line in f]
    return rows[:n]

preview_fail(EVAL_DIR / "fails_vector_chunk_k10.jsonl", n=3)

[{'question': "Which types of workloads are explicitly listed as examples where the company's accelerated computing stack is used? Provide at least three from the text.",
  'level': 'chunk',
  'k': 10,
  'id_key': 'chunk_id',
  'gold_ids': ['nvidia_2024_item_1_business_c02_085ed035dcb2'],
  'pred_ids': ['nvidia_2024_item_1_business_c22_44d1d4a16ddb',
   'nvidia_2024_item_1_business_c03_18f3bf09b826',
   'nvidia_2024_item_1_business_c23_09f6a7d3148f',
   'nvidia_2024_item_1_business_c81_5257a3ad726c',
   'nvidia_2024_item_1_business_c50_393fc5c3a9ed',
   'nvidia_2024_item_1_business_c35_494c72a9061d',
   'nvidia_2024_item_1a_risk_factors_c20_91bcdbc9d8d1',
   'nvidia_2024_item_1_business_c21_b61f1b73881b',
   'nvidia_2024_item_1_business_c39_f414782c55e9',
   'nvidia_2024_item_1_business_c38_112a1f913afe'],
  'top_results_preview': [{'chunk_id': 'nvidia_2024_item_1_business_c22_44d1d4a16ddb',
    'doc_id': 'nvidia_2024_item_1_business',
    'score': 0.5790205001831055},
   {'chunk_id': 

**Observation**

This failure occurs at the chunk level despite retrieving the correct document.
The model retrieves semantically related chunks within the same section, but not the exact gold chunk.

This suggests:
- The evidence is distributed across multiple nearby chunks.
- Chunk boundary granularity may affect strict chunk-level recall.
- Increasing retrieval depth or using larger overlapping chunks could improve coverage.