# 04 â€“ End-to-End Continuous Evaluation Simulation

This notebook combines all components:

- metrics (Step 1)
- multi-stage evaluation loop (Step 2)
- evaluation controller (Step 3)
- feedback loops (Step 4)
- drift scenarios (Step 5 / scenarios.py)

The goal is to simulate how a continuous evaluation pipeline detects and
responds to different kinds of drift in a RAG system.

In [None]:
import os
import sys

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
SRC_PATH = os.path.join(PROJECT_ROOT, "src")
CONFIG_PATH = os.path.join(PROJECT_ROOT, "config")

if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)

from rag_eval.metrics import compute_all_metrics
from rag_eval.stages import evaluate_all_stages
from rag_eval.controller import EvaluationController
from rag_eval.feedback_loops import OfflineFeedbackLoop, OnlineFeedbackLoop
from rag_eval.scenarios import load_scenarios

PROJECT_ROOT, SRC_PATH, CONFIG_PATH

## Load Drift Scenarios

We load synthetic scenarios from `config/demo_scenarios.yaml`.

Each scenario describes a different kind of drift:

- retrieval quality degradation
- reasoning / grounding drift
- benign changes (no failure)

You can customize these scenarios to reflect your own system.

In [None]:
scenario_file = os.path.join(CONFIG_PATH, "demo_scenarios.yaml")
scenarios = load_scenarios(scenario_file)

[(s.name, s.description) for s in scenarios]

## Set Up Controller and Feedback Loops

We will:

- run each scenario
- compute metrics
- evaluate stages
- ask the controller for an action
- record results in offline and online feedback loops


In [None]:
controller = EvaluationController()
offline = OfflineFeedbackLoop()
online = OnlineFeedbackLoop()

## Helper: Run a Single Scenario

Each scenario is responsible for describing:

- query
- metadata (domain, thresholds, etc.)
- retrieved_docs
- reference
- output

In a real system, these would come from logs. Here we simulate them.


In [None]:
def run_scenario(scenario):
    params = scenario.parameters

    query = params["query"]
    metadata = params.get("metadata", {})
    retrieved_docs = params.get("retrieved_docs", [])
    reference = params["reference"]
    output = params["output"]

    metrics = compute_all_metrics(
        reference=reference,
        output=output,
        latency_ms=metadata.get("latency_ms", 120),
        token_count=metadata.get("token_count", 80),
        retrieval_ms=metadata.get("retrieval_ms", 30),
    )

    stage_results = evaluate_all_stages(
        query=query,
        metadata=metadata,
        retrieved_docs=retrieved_docs,
        gen_metrics={"intrinsic": metrics.intrinsic},
    )

    action_name = controller.choose_action(stage_results)
    correction = controller.execute(action_name)

    record = {
        "scenario": scenario.name,
        "description": scenario.description,
        "metrics_intrinsic": metrics.intrinsic,
        "stage_results": stage_results,
        "action_name": action_name,
        "correction": correction,
    }

    offline.record(record)

    if not stage_results["overall_passed"]:
        online.report_issue({
            "scenario": scenario.name,
            "action": action_name,
        })

    return record

## Run All Scenarios

We now simulate the full continuous evaluation loop for each scenario.

In [None]:
results = [run_scenario(s) for s in scenarios]

len(results), results[0]

## Inspect Offline Summary

Offline summary is useful for regression testing and historical analysis.

In [None]:
offline.summarize()

## Check Online Drift Indicator

Online loop only cares if the current system appears to be drifting.

In [None]:
online.has_drift(), online.live_issues[:5]

From here, you can:

- export `results` as JSON to `data/logs/sample_evaluation_runs.jsonl`
- feed them into a dashboard (see `dashboards/examples/`)
- refine thresholds and actions based on real behavior

This completes the end-to-end simulation of the continuous evaluation
pipeline described in the blog.