# Evaluation Pipeline — Interactive Notebook

This notebook walks through the full **generate traces → evaluate → upload to Foundry portal** pipeline.

**Two-step workflow:**
1. **Generate traces** — run the real agent against gold queries with configurable model/params
2. **Evaluate & upload** — score traces (deterministic + AI judge) and push results to Foundry

Each section below is self-contained. Edit the configuration cells, then run top-to-bottom.

---
## 0. Environment Setup

In [None]:
import json
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Ensure project root is importable
PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

load_dotenv(PROJECT_ROOT / ".env")

print(f"Project root : {PROJECT_ROOT}")
print(f"Model (env)  : {os.getenv('AZURE_OPENAI_MODEL')}")
print(f"Endpoint     : {os.getenv('AZURE_OPENAI_ENDPOINT', '')[:50]}...")
print(f"API version  : {os.getenv('AZURE_OPENAI_API_VERSION')}")
print(f"SQL conn     : {'set' if os.getenv('AZURE_SQL_CONNECTIONSTRING') else 'MISSING'}")
print(f"Foundry conn : {'set' if os.getenv('AZURE_AI_PROJECT_CONNECTION_STRING') else 'MISSING'}")

---
## 1. Review Configuration

The pipeline draws from three configuration sources:
- **System prompts** — `config/prompts/system.yaml` (one prompt per chat profile)
- **Tool descriptions** — `config/prompts/tools.yaml` (single source of truth for all tool guidance)
- **Gold dataset** — `eval/datasets/sql_agent_gold_starter.jsonl` (test queries + expected tools)

### 1a. System Prompts

In [None]:
import yaml

with open("config/prompts/system.yaml", "r") as f:
    system_config = yaml.safe_load(f)

for profile_key, profile in system_config["profiles"].items():
    prompt_preview = profile["text"][:200].replace("\n", " ")
    print(f"\n{'='*60}")
    print(f"Profile : {profile_key}")
    print(f"ID      : {profile['id']}")
    print(f"Version : {profile['version']}")
    print(f"Preview : {prompt_preview}...")

### 1b. Tool Definitions

In [None]:
with open("config/prompts/tools.yaml", "r") as f:
    tools_config = yaml.safe_load(f)

for tool_def in tools_config["tools"]:
    name = tool_def["tool_name"]
    version = tool_def["version"]
    profiles = tool_def.get("enabled_profiles", [])
    desc_preview = tool_def["description"].strip().split("\n")[0]
    rules_count = len(tool_def.get("usage_rules", []))
    print(f"  {name:<20s}  v{version}  profiles={profiles}  rules={rules_count}")
    print(f"    {desc_preview}")

### 1c. Gold Dataset

In [None]:
gold_path = Path("eval/datasets/sql_agent_gold_starter.jsonl")
gold_rows = [json.loads(line) for line in gold_path.read_text().strip().splitlines()]

print(f"Total gold cases: {len(gold_rows)}\n")
print(f"{'Case':<10} {'Profile':<25} {'Intent':<22} {'Expected Tools'}")
print("-" * 90)
for row in gold_rows:
    print(
        f"{row['case_id']:<10} "
        f"{row['chat_profile']:<25} "
        f"{row.get('intent_class', ''):<22} "
        f"{row['expected_tools']}"
    )

---
## 2. Generate Traces (Step 1)

This runs the **real agent** against every gold query using the current system prompts and tool configs.

### Configuration

Edit the cell below to change model, temperature, and top-p before generating.

In [None]:
# ── Trace generation config ──────────────────────────────────────────────────
# Change these values to experiment with different models and parameters.

TRACE_MODEL       = None       # None = use AZURE_OPENAI_MODEL from .env (gpt-4o)
                                # Set to "gpt-4o-mini" or another deployment name to override

TRACE_TEMPERATURE = None       # None = model default. Range: 0.0 (deterministic) to 2.0 (creative)

TRACE_TOP_P       = None       # None = model default. Range: 0.0 to 1.0 (nucleus sampling)

TRACE_DELAY       = 3.0        # Seconds between queries (rate-limit courtesy for S0 tier)

PROFILE_FILTER    = None       # None = all profiles. Set to "Tactical Readiness AI" etc. for one profile

GOLD_DATASET      = "eval/datasets/sql_agent_gold_starter.jsonl"
TRACES_OUTPUT     = "eval/traces/agent_runs.jsonl"

# ─────────────────────────────────────────────────────────────────────────────
print("Trace generation config:")
print(f"  Model       : {TRACE_MODEL or os.getenv('AZURE_OPENAI_MODEL', '(not set)')}")
print(f"  Temperature : {TRACE_TEMPERATURE or '(model default)'}")
print(f"  Top-P       : {TRACE_TOP_P or '(model default)'}")
print(f"  Delay       : {TRACE_DELAY}s")
print(f"  Profile     : {PROFILE_FILTER or '(all)'}")
print(f"  Gold dataset: {GOLD_DATASET}")
print(f"  Output      : {TRACES_OUTPUT}")

In [None]:
# ── Run trace generation ─────────────────────────────────────────────────────
# This calls generate_traces.py's generate() function directly (no subprocess).
# Takes ~2-5 minutes depending on query count and rate limits.

from eval.generate_traces import generate

await generate(
    gold_path=Path(GOLD_DATASET),
    output_path=Path(TRACES_OUTPUT),
    profile_filter=PROFILE_FILTER,
    inter_query_delay=TRACE_DELAY,
    model_override=TRACE_MODEL,
    temperature=TRACE_TEMPERATURE,
    top_p=TRACE_TOP_P,
)

print("\n✓ Traces written to:", TRACES_OUTPUT)

### 2a. Inspect Generated Traces

In [None]:
traces_path = Path(TRACES_OUTPUT)
traces = [json.loads(line) for line in traces_path.read_text().strip().splitlines()]

# ── Run-level config from first trace ──
if traces:
    _t0 = traces[0]
    print("Run Config:")
    print(f"  Model       : {_t0.get('model', 'n/a')}")
    _temp = _t0.get("temperature")
    _topp = _t0.get("top_p")
    print(f"  Temperature : {_temp if _temp is not None else '(model default)'}")
    print(f"  Top-P       : {_topp if _topp is not None else '(model default)'}")
    _manifest = _t0.get("prompt_manifest", {})
    if _manifest:
        print(f"  Prompts     : {_manifest}")
    print()

print(f"Traces loaded: {len(traces)}\n")
print(f"{'Case':<10} {'Profile':<25} {'Tools Used':<45} {'Response Preview'}")
print("-" * 120)

for t in traces:
    tools_used = [e["name"] for e in t.get("tool_events", []) if e.get("name")]
    resp = (t.get("output") or "")[:50].replace("\n", " ")
    print(
        f"{t.get('case_id', '?'):<10} "
        f"{t.get('chat_profile', '?'):<25} "
        f"{str(tools_used):<45} "
        f"{resp}..."
    )

---
## 3. Run Evaluation & Upload to Portal (Step 2)

This scores every trace using:
- **Deterministic metrics** — required/forbidden tools, sequence, F1
- **AI judge metrics** — IntentResolution, TaskAdherence, ToolCallAccuracy
- **Portal upload** — results pushed to Azure AI Foundry for tracking

In [None]:
import logging
import sys
from importlib import reload

# Suppress noisy SDK logging that floods notebook output and breaks
# VS Code's per-cell output routing during "Run All".
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("azure").setLevel(logging.WARNING)
logging.getLogger("promptflow").setLevel(logging.WARNING)
logging.getLogger("execution").setLevel(logging.WARNING)
logging.getLogger("execution.bulk").setLevel(logging.WARNING)

import eval.foundry.run_eval as _run_eval_mod
reload(_run_eval_mod)
from eval.foundry.run_eval import run

config_path = Path("eval/foundry/config.yaml")
results = run(config_path)

# Reset logging so later cells can use it normally
logging.getLogger("httpx").setLevel(logging.INFO)
logging.getLogger("azure").setLevel(logging.INFO)

sys.stdout.flush()
sys.stderr.flush()

print("\n" + "=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)
print(f"Rows evaluated : {results.get('counts', {}).get('evaluated_rows', '?')}")
print(f"Output saved   : {results.get('paths', {}).get('output_json', 'n/a')}")

### 3a. Summary Metrics

In [None]:
import sys; sys.stdout.flush(); sys.stderr.flush()

summary = results.get("summary", {})
counts = results.get("counts", {})

print("Counts:")
for k, v in counts.items():
    print(f"  {k:<25} {v}")

print("\nDeterministic Metrics:")
for k, v in summary.items():
    if k == "row_count":
        continue
    if isinstance(v, float):
        print(f"  {k:<35} {v:.1%}")
    else:
        print(f"  {k:<35} {v}")

### 3b. Portal Results

In [None]:
import sys; sys.stdout.flush(); sys.stderr.flush()

portal = results.get("portal_evaluation", {})

print(f"Status : {portal.get('status', 'unknown')}")
print(f"Name   : {portal.get('evaluation_name', 'n/a')}")

# ── Config metadata from traces ──
print(f"\nRun Configuration (visible as Tags in Foundry portal):")
_traces_path = Path(results.get("paths", {}).get("traces_jsonl", TRACES_OUTPUT))
if _traces_path.exists():
    _first = json.loads(_traces_path.read_text().strip().splitlines()[0])
    print(f"  Model           : {_first.get('model', 'n/a')}")
    _temp = _first.get("temperature")
    _topp = _first.get("top_p")
    print(f"  Temperature     : {_temp if _temp is not None else '(model default)'}")
    print(f"  Top-P           : {_topp if _topp is not None else '(model default)'}")
    _manifest = _first.get("prompt_manifest", {})
    if _manifest:
        print(f"  Prompt versions :")
        for pk, pv in _manifest.items():
            print(f"    {pk}: {pv}")

studio_url = portal.get("studio_url")
if studio_url:
    print(f"\nView in Foundry Portal:")
    print(f"   {studio_url}")
else:
    print(f"\nNo studio URL returned.")
    if portal.get("error"):
        print(f"Error: {portal['error']}")

portal_metrics = portal.get("metrics")
if portal_metrics:
    print("\nPortal AI Judge Metrics:")
    for k, v in sorted(portal_metrics.items()):
        if isinstance(v, float):
            print(f"  {k:<45} {v:.4f}")
        else:
            print(f"  {k:<45} {v}")

### 3c. Per-Row Detail

In [None]:
import sys; sys.stdout.flush(); sys.stderr.flush()

for row in results.get("rows", []):
    case_id = row["case_id"]
    det = row.get("deterministic_metrics", {})
    ai = row.get("ai_judge_metrics", {})

    req_pass = "PASS" if det.get("required_tools_pass") else "FAIL"
    forb_pass = "PASS" if det.get("forbidden_tools_pass") else "FAIL"
    seq_pass = "PASS" if det.get("expected_sequence_pass") else "FAIL"
    f1 = det.get("tool_f1", 0)

    intent = ai.get("intent_resolution", {})
    task = ai.get("task_adherence", {})
    tool_acc = ai.get("tool_call_accuracy", {})

    print(f"\n{'─'*60}")
    print(f"{case_id}: {row['query'][:70]}")
    print(f"  Deterministic: req={req_pass}  forb={forb_pass}  seq={seq_pass}  F1={f1:.2f}")
    print(f"  Actual tools : {det.get('actual_tools', [])}")

    if det.get("required_missing"):
        print(f"  Missing required: {det['required_missing']}")
    if det.get("forbidden_hit"):
        print(f"  Forbidden used  : {det['forbidden_hit']}")

    # AI judge scores
    intent_score = intent.get("intent_resolution", "n/a")
    task_score = task.get("task_adherence", "n/a")
    tool_score = tool_acc.get("tool_call_accuracy", "n/a")
    print(f"  AI Judge     : intent={intent_score}  task={task_score}  tool_acc={tool_score}")

---
## 4. Compare Experiments

Load and compare multiple result files side-by-side.  
After each run, results are saved to `eval/results/foundry_eval_latest.json`.  
To preserve a baseline, copy the file before re-running:

```powershell
Copy-Item eval/results/foundry_eval_latest.json eval/results/baseline_gpt4o_t0.json
```

In [None]:
# ── List available result files ──────────────────────────────────────────────
results_dir = Path("eval/results")
result_files = sorted(results_dir.glob("*.json"))

print("Available result files:")
for f in result_files:
    print(f"  {f.name}")

In [None]:
# ── Compare two result files ──────────────────────────────────────────────────
# Edit these paths to point to the two runs you want to compare.

FILE_A = "eval/results/foundry_eval_latest.json"  # e.g., baseline
FILE_B = None  # Set to a second file path to compare, e.g.:
# FILE_B = "eval/results/baseline_gpt4o_t0.json"


def _load_result(path_str):
    if not path_str:
        return None
    p = Path(path_str)
    if not p.exists():
        print(f"  File not found: {p}")
        return None
    return json.loads(p.read_text())


def _print_run(label, run_data, file_name):
    """Print config + metrics for a single run."""
    print(f"{label}: {file_name}")
    print(f"  Timestamp  : {run_data.get('timestamp_utc', 'n/a')}")
    print(f"  Rows       : {run_data.get('counts', {}).get('evaluated_rows', '?')}")

    # Config metadata from traces
    traces_file = run_data.get("paths", {}).get("traces_jsonl")
    if traces_file and Path(traces_file).exists():
        _t0 = json.loads(Path(traces_file).read_text().strip().splitlines()[0])
        _temp = _t0.get("temperature")
        _topp = _t0.get("top_p")
        print(f"  Model      : {_t0.get('model', 'n/a')}")
        print(f"  Temperature: {_temp if _temp is not None else '(default)'}")
        print(f"  Top-P      : {_topp if _topp is not None else '(default)'}")
        _manifest = _t0.get("prompt_manifest", {})
        if _manifest:
            for pk, pv in _manifest.items():
                print(f"  prompt.{pk}: {pv}")

    s = run_data.get("summary", {})
    print(f"  Req pass   : {s.get('required_tools_pass_rate', 0):.1%}")
    print(f"  Forb pass  : {s.get('forbidden_tools_pass_rate', 0):.1%}")
    print(f"  Seq pass   : {s.get('expected_sequence_pass_rate', 0):.1%}")
    print(f"  Avg F1     : {s.get('avg_tool_f1', 0):.2f}")

    portal = run_data.get("portal_evaluation", {})
    pm = portal.get("metrics", {})
    if pm:
        for k in sorted(pm):
            print(f"  {k}: {pm[k]}")


run_a = _load_result(FILE_A)
run_b = _load_result(FILE_B)

if run_a:
    _print_run("Run A", run_a, Path(FILE_A).name)

if run_b:
    print()
    _print_run("Run B", run_b, Path(FILE_B).name)

if run_a and run_b:
    print("\n" + "=" * 60)
    print("DELTA (B - A):")
    sa = run_a.get("summary", {})
    sb = run_b.get("summary", {})
    for key in ["required_tools_pass_rate", "forbidden_tools_pass_rate", "expected_sequence_pass_rate", "avg_tool_f1"]:
        va = sa.get(key, 0)
        vb = sb.get(key, 0)
        delta = vb - va
        arrow = "↑" if delta > 0 else "↓" if delta < 0 else "="
        print(f"  {key:<35} {delta:+.4f}  {arrow}")
elif not run_b:
    print("\nSet FILE_B to a second result file to see a comparison.")

---
## 5. Quick-Reference: Editing Prompts & Tools

### System Prompts (`config/prompts/system.yaml`)

Each profile has a `text` field with the system prompt and a `version` field.  
To iterate:
1. Edit the `text` under the profile you want to change
2. Bump the `version` (e.g., `1.0.0` → `1.1.0`)
3. Re-run **Section 2** (generate traces) then **Section 3** (evaluate)

### Tool Descriptions (`config/prompts/tools.yaml`)

Each tool has `description`, `usage_rules`, and `examples`.  
These are the **single source of truth** — changes here flow to both the agent at runtime and the evaluator.

### Gold Dataset (`eval/datasets/sql_agent_gold_starter.jsonl`)

One JSON line per test case. Required fields:
```json
{
  "case_id": "SQL-011",
  "chat_profile": "Tactical Readiness AI",
  "query": "Your test question",
  "expected_tools": ["list_views", "describe_table", "read_query"],
  "required_tools": ["read_query"],
  "forbidden_tools": ["semantic_search"]
}
```

### Model Parameters

Edit `TRACE_MODEL`, `TRACE_TEMPERATURE`, `TRACE_TOP_P` in **Section 2** config cell.  
Each trace records which model/params produced it, so you can correlate scores with configuration.

### Tracking in Foundry Portal

Every evaluation run creates a uniquely-named entry (e.g., `foundry-eval-20260223-211910`).  
In the portal under **Build → Evaluation**, select multiple runs to compare side-by-side.

---
## 6. File Map

| File | Purpose |
|------|--------|
| `config/prompts/system.yaml` | System prompts per profile |
| `config/prompts/tools.yaml` | Tool descriptions & usage rules (single source of truth) |
| `eval/datasets/sql_agent_gold_starter.jsonl` | Gold test cases with expected tools |
| `eval/generate_traces.py` | Headless agent runner (Step 1) |
| `eval/foundry/run_eval.py` | Evaluation scorer + portal upload (Step 2) |
| `eval/foundry/config.yaml` | Eval runner config (paths, metrics, Azure settings) |
| `eval/traces/agent_runs.jsonl` | Generated traces (output of Step 1) |
| `eval/results/foundry_eval_latest.json` | Full local results (output of Step 2) |