# 02. Evaluate Retrieval (Invest RAG)

Measure retrieval quality **before** adding generation:
- **Recall@k**, **MRR@k**
- Compare **Vector baseline** vs **LLM rerank (Top-1 promotion)**

> This notebook is orchestration-only. All logic lives under `src/`.


In [5]:
from pathlib import Path

# Robust project root detection (repo root should contain /src)
CWD = Path.cwd().resolve()
PROJECT_ROOT = CWD if (CWD / "src").exists() else CWD.parent

INDEX_DIR = PROJECT_ROOT / "indexes" / "faiss"
INDEX_PATH = INDEX_DIR / "index.bin"
META_PATH  = INDEX_DIR / "meta.jsonl"
EVAL_PATH  = PROJECT_ROOT / "eval" / "questions.jsonl"

assert INDEX_PATH.exists(), f"Missing: {INDEX_PATH}"
assert META_PATH.exists(), f"Missing: {META_PATH}"
assert EVAL_PATH.exists(), f"Missing: {EVAL_PATH}"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("INDEX_PATH  :", INDEX_PATH)
print("META_PATH   :", META_PATH)
print("EVAL_PATH   :", EVAL_PATH)


PROJECT_ROOT: C:\Users\CG\Desktop\invest-rag
INDEX_PATH  : C:\Users\CG\Desktop\invest-rag\indexes\faiss\index.bin
META_PATH   : C:\Users\CG\Desktop\invest-rag\indexes\faiss\meta.jsonl
EVAL_PATH   : C:\Users\CG\Desktop\invest-rag\eval\questions.jsonl


In [7]:
from src.data_pipeline.io_utils import read_jsonl
from src.llm.embedding import embed_query
from src.retrieval.vector_store import VectorStore
from src.eval.search_wrappers import make_vectorstore_search_fn, make_llm_rerank_search_fn
from src.eval.retrieval_eval import run_eval_suite

# Load data + index
questions = read_jsonl(EVAL_PATH)
vs = VectorStore.load(index_path=INDEX_PATH, meta_path=META_PATH)

print("n_questions:", len(questions))


n_questions: 20


## Build `search_fn` (injection)

Evaluator controls cutoff **k** by calling `search_fn(query, k)`.


In [8]:
# Baseline: vector search
vector_search_fn = make_vectorstore_search_fn(
    vs,
    embed_query=embed_query,
    normalize=True,   # keep consistent with index build
)

# Rerank v1: vector candidates -> LLM chooses best -> promote to rank #1
rerank_search_fn = make_llm_rerank_search_fn(
    vector_search_fn,
    k_vec=10,
    rerank_model="gpt-4.1-mini",  # or None to use default inside reranker
)


## Run evaluation (baseline vs rerank)

In [10]:
KS = (1, 3, 5, 10)

suite_vec = run_eval_suite(
    questions,
    ks=KS,
    search_fn=vector_search_fn,
    out_path=PROJECT_ROOT / "eval" / "results_vector.json",
    id_key="doc_id",  # switch to "chunk_id" if labels/results are chunk-level
    dedupe=True,      # doc-level eval: True, chunk-level eval: False
)

suite_rr = run_eval_suite(
    questions,
    ks=KS,
    search_fn=rerank_search_fn,
    out_path=PROJECT_ROOT / "eval" / "results_rerank_llm.json",
    id_key="doc_id",
    dedupe=True,
)

suite_vec, suite_rr


({'ks': [1, 3, 5, 10],
  'results': {'1': {'k': 1,
    'n': 20,
    'n_scored': 20,
    'skipped_no_gold': 0,
    'recall_at_k': 0.5416666666666667,
    'mrr_at_k': 0.85,
    'n_fail': 3},
   '3': {'k': 3,
    'n': 20,
    'n_scored': 20,
    'skipped_no_gold': 0,
    'recall_at_k': 0.7083333333333333,
    'mrr_at_k': 0.8916666666666666,
    'n_fail': 1},
   '5': {'k': 5,
    'n': 20,
    'n_scored': 20,
    'skipped_no_gold': 0,
    'recall_at_k': 0.85,
    'mrr_at_k': 0.8916666666666666,
    'n_fail': 1},
   '10': {'k': 10,
    'n': 20,
    'n_scored': 20,
    'skipped_no_gold': 0,
    'recall_at_k': 0.925,
    'mrr_at_k': 0.9,
    'n_fail': 0}}},
 {'ks': [1, 3, 5, 10],
  'results': {'1': {'k': 1,
    'n': 20,
    'n_scored': 20,
    'skipped_no_gold': 0,
    'recall_at_k': 0.6416666666666667,
    'mrr_at_k': 1.0,
    'n_fail': 0},
   '3': {'k': 3,
    'n': 20,
    'n_scored': 20,
    'skipped_no_gold': 0,
    'recall_at_k': 0.8083333333333332,
    'mrr_at_k': 1.0,
    'n_fail': 0},


## Results

Compact comparison table (and delta).

In [11]:
def _row(suite, k: int):
    m = suite["results"][str(k)]
    return m["recall_at_k"], m["mrr_at_k"], m.get("n_fail", None)

def print_compare(suite_a, suite_b, ks=KS, name_a="Vector", name_b="Rerank"):
    print(f"{'k':>2} | {name_a:>10} (R@k / MRR) | {name_b:>10} (R@k / MRR) | ΔRecall | ΔMRR")
    print("-"*78)
    for k in ks:
        ra, ma, _ = _row(suite_a, k)
        rb, mb, _ = _row(suite_b, k)
        print(f"{k:>2} | {ra:>6.4f} / {ma:>6.4f}     | {rb:>6.4f} / {mb:>6.4f}     | {rb-ra:+.4f} | {mb-ma:+.4f}")

print_compare(suite_vec, suite_rr)


 k |     Vector (R@k / MRR) |     Rerank (R@k / MRR) | ΔRecall | ΔMRR
------------------------------------------------------------------------------
 1 | 0.5417 / 0.8500     | 0.6417 / 1.0000     | +0.1000 | +0.1500
 3 | 0.7083 / 0.8917     | 0.8083 / 1.0000     | +0.1000 | +0.1083
 5 | 0.8500 / 0.8917     | 0.9000 / 1.0000     | +0.0500 | +0.1083
10 | 0.9250 / 0.9000     | 0.9250 / 1.0000     | +0.0000 | +0.1000


## Results — Standard Set

Reranking consistently improves top-ranked retrieval quality.

- **Recall@1** improves by +10pp (0.54 → 0.64).
- Gains diminish as k increases, indicating that reranking mainly refines ranking rather than expanding coverage.
- MRR is relatively high because evaluation is performed at the document level (`doc_id`), which is less strict than chunk-level evaluation.

Note: MRR values are relatively high because evaluation is conducted at the document level (`doc_id`) rather than chunk level, which makes the metric less strict.

## Hard Questions (Stress Test)

We evaluate the same retrieval pipeline on a harder query set
to test robustness under increased ambiguity and reduced lexical overlap.

Same setup:
- Same FAISS index
- Same search functions (vector / rerank)
- Same evaluator (Recall@k, MRR)

Only the question set changes:
`questions.jsonl` → `questions_hard.jsonl`

We expect:
- Overall performance to drop
- Reranking to mainly improve top-ranked precision (Recall@1 / MRR)

In [12]:
EVAL_PATH  = PROJECT_ROOT / "eval" / "questions_hard.jsonl"

questions_hard = read_jsonl(EVAL_PATH)
print("n_questions:", len(questions_hard))

KS = (1, 3, 5, 10)

suite_vec_hard = run_eval_suite(
    questions_hard,
    ks=KS,
    search_fn=vector_search_fn,
    out_path=PROJECT_ROOT / "eval" / "results_vector_hard.json",
    id_key="doc_id",  # switch to "chunk_id" if labels/results are chunk-level
    dedupe=True,      # doc-level eval: True, chunk-level eval: False
)

suite_rr_hard = run_eval_suite(
    questions_hard,
    ks=KS,
    search_fn=rerank_search_fn,
    out_path=PROJECT_ROOT / "eval" / "results_rerank_llm_hard.json",
    id_key="doc_id",
    dedupe=True,
)

print_compare(suite_vec_hard, suite_rr_hard)

n_questions: 20
 k |     Vector (R@k / MRR) |     Rerank (R@k / MRR) | ΔRecall | ΔMRR
------------------------------------------------------------------------------
 1 | 0.6917 / 0.9500     | 0.7417 / 1.0000     | +0.0500 | +0.0500
 3 | 0.9083 / 0.9667     | 0.9333 / 1.0000     | +0.0250 | +0.0333
 5 | 0.9750 / 0.9667     | 0.9750 / 1.0000     | +0.0000 | +0.0333
10 | 0.9750 / 0.9667     | 0.9750 / 1.0000     | +0.0000 | +0.0333


## Results — Hard Set (Stress Test)

On the hard query set (n=20), retrieval performance remains strong.

Interestingly, Recall@1 is higher than on the standard set.
This suggests that embedding-based retrieval may favor queries that closely
resemble document phrasing, even if they appear conceptually more complex.

Because evaluation is conducted at the document level (`doc_id`),
matching any relevant document is sufficient for success,
which may further amplify this effect.

Reranking continues to improve top-1 precision,
while gains diminish at larger k — indicating that reranking primarily
refines ranking rather than expanding coverage.

All experiments are conducted at the document level (`doc_id`).
Chunk-level evaluation may provide a stricter and more fine-grained assessment.

## Failure Analysis (k=1)

We inspect k=1 failures to understand where the retriever struggles.

Representative failure patterns:

1. **Lexical Mismatch**  
   The query uses different terminology from the document, reducing embedding similarity.

2. **Ambiguity / Underspecified Query**  
   The query is broad, leading the retriever to select a plausible but incorrect document.

3. **Ranking Limitation (Doc-level Evaluation)**  
   The correct document may appear within top-k, but not at rank 1.

These observations suggest potential improvements in query rewriting,
more expressive reranking, and chunk-level evaluation.

In [13]:
KS = (1, 3, 5, 10)

suite_vec = run_eval_suite(
    questions,
    ks=KS,
    search_fn=vector_search_fn,
    id_key="doc_id",
    dedupe=True,
)

suite_rr = run_eval_suite(
    questions,
    ks=KS,
    search_fn=rerank_search_fn,
    id_key="doc_id",
    dedupe=True,
)

print_compare(suite_vec, suite_rr)

 k |     Vector (R@k / MRR) |     Rerank (R@k / MRR) | ΔRecall | ΔMRR
------------------------------------------------------------------------------
 1 | 0.5417 / 0.8500     | 0.6417 / 1.0000     | +0.1000 | +0.1500
 3 | 0.7083 / 0.8917     | 0.8083 / 1.0000     | +0.1000 | +0.1083
 5 | 0.8500 / 0.8917     | 0.9000 / 1.0000     | +0.0500 | +0.1083
10 | 0.9250 / 0.9000     | 0.9250 / 1.0000     | +0.0000 | +0.1000


In [14]:
suite_vec_hard = run_eval_suite(
    questions_hard,
    ks=KS,
    search_fn=vector_search_fn,
    out_path=PROJECT_ROOT / "eval" / "results_vector_hard.json",
    id_key="doc_id",
    dedupe=True,
)

suite_rr_hard = run_eval_suite(
    questions_hard,
    ks=KS,
    search_fn=rerank_search_fn,
    out_path=PROJECT_ROOT / "eval" / "results_rerank_hard.json",
    id_key="doc_id",
    dedupe=True,
)

print_compare(suite_vec_hard, suite_rr_hard)

 k |     Vector (R@k / MRR) |     Rerank (R@k / MRR) | ΔRecall | ΔMRR
------------------------------------------------------------------------------
 1 | 0.6917 / 0.9500     | 0.7417 / 1.0000     | +0.0500 | +0.0500
 3 | 0.9083 / 0.9667     | 0.9333 / 1.0000     | +0.0250 | +0.0333
 5 | 0.9750 / 0.9667     | 0.9750 / 1.0000     | +0.0000 | +0.0333
10 | 0.9750 / 0.9667     | 0.9750 / 1.0000     | +0.0000 | +0.0333


In [15]:
from src.eval.retrieval_eval import evaluate_retrieval

RESULTS_DIR = PROJECT_ROOT / "eval" / "results"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

def save_fails_k1(questions, tag: str):
    m_vec, _ = evaluate_retrieval(
        questions,
        k=1,
        search_fn=vector_search_fn,
        id_key="doc_id",
        dedupe=True,
        save_fail_path=RESULTS_DIR / f"fails_vector_{tag}_k1.jsonl",
    )
    m_rr, _ = evaluate_retrieval(
        questions,
        k=1,
        search_fn=rerank_search_fn,
        id_key="doc_id",
        dedupe=True,
        save_fail_path=RESULTS_DIR / f"fails_rerank_{tag}_k1.jsonl",
    )

    print(f"[{tag}] saved fail logs (k=1)")
    print(" -", RESULTS_DIR / f"fails_vector_{tag}_k1.jsonl")
    print(" -", RESULTS_DIR / f"fails_rerank_{tag}_k1.jsonl")
    print(" vector:", f"Recall@1={m_vec['recall_at_k']:.4f}, MRR@1={m_vec['mrr_at_k']:.4f}, n_fail={m_vec['n_fail']}")
    print(" rerank:", f"Recall@1={m_rr['recall_at_k']:.4f}, MRR@1={m_rr['mrr_at_k']:.4f}, n_fail={m_rr['n_fail']}")

save_fails_k1(questions, tag="standard")
save_fails_k1(questions_hard, tag="hard")

[standard] saved fail logs (k=1)
 - C:\Users\CG\Desktop\invest-rag\eval\results\fails_vector_standard_k1.jsonl
 - C:\Users\CG\Desktop\invest-rag\eval\results\fails_rerank_standard_k1.jsonl
 vector: Recall@1=0.5417, MRR@1=0.8500, n_fail=3
 rerank: Recall@1=0.6417, MRR@1=1.0000, n_fail=0
[hard] saved fail logs (k=1)
 - C:\Users\CG\Desktop\invest-rag\eval\results\fails_vector_hard_k1.jsonl
 - C:\Users\CG\Desktop\invest-rag\eval\results\fails_rerank_hard_k1.jsonl
 vector: Recall@1=0.6917, MRR@1=0.9500, n_fail=1
 rerank: Recall@1=0.7417, MRR@1=1.0000, n_fail=0


In [16]:
from src.data_pipeline.io_utils import read_jsonl

fails = read_jsonl(PROJECT_ROOT / "eval/results/fails_vector_standard_k1.jsonl")

print("n_fails:", len(fails))
print("-" * 80)

for r in fails[:3]:
    print("Query:", r["query"])
    print("Gold:", r.get("gold_ids") or r.get("gold_doc_ids"))
    print("Top1:", r["top5"][0])
    print("-" * 80)

n_fails: 3
--------------------------------------------------------------------------------
Query: NSIL 3Q 출하와 마진에 영향을 주는 핵심 변수는?
Gold: ['news_0001', 'report_0001']
Top1: {'rank': 1, 'score': 0.47367626428604126, 'chunk_id': 'disc_0004_c00_00ae883d7406', 'doc_id': 'disc_0004', 'source': 'disclosure_note', 'date': '2025-10-02', 'company': 'NeuroSilicon', 'ticker': 'NSIL', 'sector': 'Semiconductor/AI', 'title': 'R&D 투자 확대 계획(요약)', 'tags': ['R&D', 'investment'], 'text': 'NSIL은 차세대 아키텍처 개발을 위해 R&D 투자를 확대할 계획을 공시. 2026년까지 단계적으로 집행될 예정이며 세부 항목은 추후 공개.'}
--------------------------------------------------------------------------------
Query: QuantaMemory HBM 증설 관련 투자 부담은?
Gold: ['report_0006', 'news_0005']
Top1: {'rank': 1, 'score': 0.5867177248001099, 'chunk_id': 'disc_0002_c00_e61a698b5841', 'doc_id': 'disc_0002', 'source': 'disclosure_note', 'date': '2025-09-08', 'company': 'QuantaMemory', 'ticker': 'QMEM', 'sector': 'Semiconductor/AI', 'title': 'HBM 생산라인 증설 계획(요약)', 'tags': ['HBM', 'capaci