# 01 – Metrics and Signals for RAG Evaluation

This notebook demonstrates the three metric families from the blog:

- **Intrinsic metrics** – output quality (simple lexical overlap in this demo)
- **Extrinsic metrics** – operational behavior (latency, token usage, retrieval time)
- **Behavioral metrics** – reasoning behavior (verbosity)

It uses the functions implemented in `src/rag_eval/metrics.py`.

You can extend this notebook with your own queries, reference answers,
and simulated drift scenarios.

In [None]:
import os
import sys

# Adjust this path if you run the notebook from a different working directory.
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
SRC_PATH = os.path.join(PROJECT_ROOT, "src")

if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)

from rag_eval.metrics import (
    compute_intrinsic_metrics,
    compute_extrinsic_metrics,
    compute_behavioral_metrics,
    compute_all_metrics,
)
from rag_eval.metrics import MetricsResult

PROJECT_ROOT, SRC_PATH

## Single Example (Blog Snippet)

We start with the same reference and output pair used in the blog to
highlight how intrinsic, extrinsic, and behavioral metrics are computed.

In [None]:
reference = "The policy was updated in 2024 to include new AI auditing guidelines."
output = "The new guidelines were introduced recently for AI auditing."

result: MetricsResult = compute_all_metrics(
    reference=reference,
    output=output,
    latency_ms=120,
    token_count=85,
    retrieval_ms=30,
)

print("Intrinsic score:", result.intrinsic)
print("Extrinsic metrics:", result.extrinsic)
print("Behavioral metrics:", result.behavioral)

## Multiple Examples

Now we simulate several reference/output pairs, with different levels of
alignment and different latency/usage patterns.

This mimics what you would see if you logged evaluation metrics for
real RAG answers over time.

In [None]:
examples = [
    {
        "id": "policy_guidelines_good",
        "reference": "The policy was updated in 2024 to include new AI auditing guidelines.",
        "output": "The policy was updated in 2024 with new guidelines for AI auditing.",
        "latency_ms": 110,
        "token_count": 80,
        "retrieval_ms": 25,
    },
    {
        "id": "refund_policy_paraphrased",
        "reference": "Refund requests must be submitted within 30 days of purchase.",
        "output": "Customers can ask for a refund within a month of buying.",
        "latency_ms": 95,
        "token_count": 64,
        "retrieval_ms": 20,
    },
    {
        "id": "incorrect_statement",
        "reference": "The model does not support bulk exports in the free tier.",
        "output": "Bulk exports are available in all subscription tiers.",
        "latency_ms": 140,
        "token_count": 72,
        "retrieval_ms": 35,
    },
]

rows = []
for ex in examples:
    metrics = compute_all_metrics(
        reference=ex["reference"],
        output=ex["output"],
        latency_ms=ex["latency_ms"],
        token_count=ex["token_count"],
        retrieval_ms=ex["retrieval_ms"],
    )
    rows.append({
        "id": ex["id"],
        "intrinsic": metrics.intrinsic,
        "latency_ms": metrics.extrinsic["latency_ms"],
        "token_count": metrics.extrinsic["token_count"],
        "retrieval_ms": metrics.extrinsic.get("retrieval_ms", None),
        "length": metrics.behavioral["length"],
    })

import pandas as pd

df = pd.DataFrame(rows)
df

You can use this table as an input into later stages:

- Stage A/B/C evaluation (next notebook)
- Drift detection over time
- Visualization in a dashboard

At this point, we have a basic but concrete representation of the
three metric families discussed in the blog.