# 02 – Multi-Stage Evaluation Loop (Stage A/B/C)

This notebook operationalizes Step 2 from the blog:

- **Stage A** – Pre-retrieval checks (query and routing)
- **Stage B** – Post-retrieval checks (relevance, coverage, redundancy)
- **Stage C** – Post-generation checks (intrinsic quality / grounding)

It uses:

- `rag_eval.metrics` for intrinsic/extrinsic/behavioral metrics
- `rag_eval.stages` for stage-level evaluations

The goal is to show how a vague complaint like:

> "Answers are not consistent anymore."

can be converted into **localized failures** in a structured evaluation loop.

In [None]:
import os
import sys

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
SRC_PATH = os.path.join(PROJECT_ROOT, "src")

if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)

from rag_eval.metrics import compute_all_metrics
from rag_eval.stages import (
    stage_a_pre_retrieval,
    stage_b_post_retrieval,
    stage_c_post_generation,
    evaluate_all_stages,
)

PROJECT_ROOT, SRC_PATH

## Example 1: Healthy Pipeline

We start with a case where everything works as expected:

- Query is valid
- Retrieval returns enough unique documents
- Generated answer aligns well with the reference


In [None]:
query = "What changed in the AI auditing policy in 2024?"
metadata = {"domain": "policy", "max_query_tokens": 50}

retrieved_docs = [
    {"id": "doc1", "content": "Policy updated in 2024 to include AI auditing guidelines."},
    {"id": "doc2", "content": "Details of AI auditing process introduced in 2024."},
    {"id": "doc3", "content": "Background on the AI auditing requirements."},
]

reference = "The policy was updated in 2024 to include new AI auditing guidelines."
output = "The policy was updated in 2024 with new guidelines for AI auditing and oversight."

metrics = compute_all_metrics(
    reference=reference,
    output=output,
    latency_ms=120,
    token_count=90,
    retrieval_ms=30,
)

stage_results = evaluate_all_stages(
    query=query,
    metadata=metadata,
    retrieved_docs=retrieved_docs,
    gen_metrics={"intrinsic": metrics.intrinsic},
)

stage_results

In this case, all stages should pass, and `overall_passed` will be `True`.

The diagnostics dictionary for each stage is what we will later feed into
controllers and feedback loops.

## Example 2: Retrieval Drift (Stage B Fails)

Now we simulate a case where retrieval no longer returns enough
distinct documents. This often appears when indexing or filtering
changes unintentionally.

In [None]:
query = "What changed in the AI auditing policy in 2024?"
metadata = {"domain": "policy", "max_query_tokens": 50}

retrieved_docs_drifted = [
    {"id": "doc1", "content": "Some unrelated content."},
]

reference = "The policy was updated in 2024 to include new AI auditing guidelines."
output = "The policy changed recently, but details are unclear."

metrics_drifted = compute_all_metrics(
    reference=reference,
    output=output,
    latency_ms=130,
    token_count=70,
    retrieval_ms=25,
)

stage_results_drifted = evaluate_all_stages(
    query=query,
    metadata=metadata,
    retrieved_docs=retrieved_docs_drifted,
    gen_metrics={"intrinsic": metrics_drifted.intrinsic},
)

stage_results_drifted

Now we should see `stage_b["passed"] == False`, while `stage_a` may still pass.

This is exactly the localization behavior we want: instead of “the system
is bad,” we can say “retrieval is not returning enough quality context.”

## Example 3: Reasoning / Grounding Degradation (Stage C Fails)

Here we keep the query and retrieval healthy, but generate an answer that
has low intrinsic overlap with the reference (simulating hallucination or
poor grounding).

In [None]:
query = "What changed in the AI auditing policy in 2024?"
metadata = {"domain": "policy", "max_query_tokens": 50}

retrieved_docs_ok = retrieved_docs  # reuse from healthy example

reference = "The policy was updated in 2024 to include new AI auditing guidelines."
output_bad = "The policy introduced new marketing requirements unrelated to AI."

metrics_bad = compute_all_metrics(
    reference=reference,
    output=output_bad,
    latency_ms=115,
    token_count=75,
    retrieval_ms=28,
)

stage_results_bad = evaluate_all_stages(
    query=query,
    metadata=metadata,
    retrieved_docs=retrieved_docs_ok,
    gen_metrics={"intrinsic": metrics_bad.intrinsic},
)

stage_results_bad

Now we should see Stage C fail, while Stage A and B can still pass.

In the next notebook, we will use these stage outputs as inputs to an
evaluation controller that decides which corrective action to apply: fix
the query, adjust retrieval, or adjust the prompt.