# Azure OpenAI Migration - Evaluation Guide

This notebook demonstrates how to **evaluate your model migration** using two complementary approaches:

| Approach | When to use | Speed | Where results live |
|----------|-------------|-------|-------------------|
| **Local (SDK)** | Quick prototyping, small test sets | Fast (seconds) | In this notebook |
| **Cloud (Foundry)** | Full A/B testing, CI/CD, final decision | ~1 min | Azure AI Foundry portal |

**Per scenario, we run:**
1. **Quick local eval** — SDK `azure-ai-evaluation` on 2-3 test cases for fast feedback
2. **Full cloud eval** — Foundry Evals API on ALL test cases for the official comparison

> **Microsoft recommendation**: *"Run local evaluations on small test data to assess prototypes, then move into predeployment testing to run evaluations on a large dataset."*

## Pre-built scenarios

| Scenario | Use Case | Key Metrics |
|----------|----------|-------------|
| **RAG** | Q&A over documents | Groundedness, Relevance, Coherence |
| **Tool Calling** | Function/tool selection | Tool Accuracy, Parameter Accuracy |
| **Translation** | Multilingual translation | Fluency, Coherence, Relevance |
| **Classification** | Sentiment / categorization | Accuracy, Consistency |

## 1. Setup

In [None]:
# Install dependencies
# %pip install openai>=1.40.0 azure-identity>=1.15.0 python-dotenv>=1.0.0 -q
# %pip install azure-ai-evaluation>=1.0.0 azure-ai-projects>=2.0.0b3 -q

In [None]:
import sys
import os
import logging

# Add project root to path so we can import src/
sys.path.insert(0, os.path.abspath('..'))
if os.path.basename(os.getcwd()) != 'AOAI-models-migration':
    sys.path.insert(0, os.getcwd())

from dotenv import load_dotenv
load_dotenv()

from IPython.display import display, Markdown

# Suppress verbose SDK logs (promptflow execution.bulk, Run Summary, etc.)
# Must be set BEFORE importing azure.ai.evaluation so child processes inherit it.
os.environ["PF_LOGGING_LEVEL"] = "CRITICAL"
for _name in ("azure.ai.evaluation", "promptflow", "execution", "execution.bulk",
              "azure.core.pipeline.policies.http_logging_policy"):
    logging.getLogger(_name).setLevel(logging.CRITICAL)

# Verify configuration
from src.config import load_config, MODEL_REGISTRY, MIGRATION_PATHS
config = load_config()
display(Markdown(f"**Endpoint:** `{config['endpoint'][:50]}...`"))
display(Markdown(f"**Deployments:** `{list(config['deployments'].keys())}`"))

# SDK model config for local evaluation (same evaluators as Foundry cloud)
from src.evaluate.local_eval import get_model_config, compare_local
model_config = get_model_config()
display(Markdown(f"**Local eval model:** `{model_config['azure_deployment']}`"))

# Foundry client for cloud evaluation
from src.evaluate.foundry import FoundryEvalsClient, FOUNDRY_AVAILABLE
foundry = None
if FOUNDRY_AVAILABLE and os.getenv("AZURE_AI_PROJECT_ENDPOINT"):
    foundry = FoundryEvalsClient()
    display(Markdown(f"**Foundry endpoint:** `{os.getenv('AZURE_AI_PROJECT_ENDPOINT')[:60]}...`"))
else:
    display(Markdown("**Foundry** not configured — cloud evaluation cells will be skipped"))


def foundry_compare(source_items, target_items, metrics,
                    scenario="eval",
                    source_label="gpt-4o", target_label="gpt-4.1"):
    """Run Foundry cloud eval on source vs target items, display summary.

    Both runs share the same timestamp so they can be identified as a pair
    in the Foundry portal. Run names follow the pattern:
        {scenario}_{model}_{timestamp}
    Example: rag_gpt-4o_20260217_1037 / rag_gpt-4.1_20260217_1037
    """
    from datetime import datetime

    # Shared timestamp to link the pair of runs
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    pair_id = f"{scenario}_{source_label}_vs_{target_label}_{ts}"

    display(Markdown(f"**Comparison pair:** `{pair_id}`"))

    src_name = f"{scenario}_{source_label}_{ts}"
    tgt_name = f"{scenario}_{target_label}_{ts}"

    display(Markdown(f"#### Source: {source_label} ({len(source_items)} items) — `{src_name}`"))
    source_result = foundry.evaluate_items(
        source_items, metrics=metrics, eval_name=src_name,
    )

    display(Markdown(f"#### Target: {target_label} ({len(target_items)} items) — `{tgt_name}`"))
    target_result = foundry.evaluate_items(
        target_items, metrics=metrics, eval_name=tgt_name,
    )

    # Summary as markdown table
    lines = [
        f"### Foundry Comparison: {source_label} vs {target_label} (`{scenario}`)\n",
        f"| Metric | {source_label} | {target_label} | Delta |",
        "|--------|-------:|-------:|------:|",
    ]
    for metric in metrics:
        src = [s[metric]["score"] for s in source_result["scores"] if s.get(metric, {}).get("score") is not None]
        tgt = [s[metric]["score"] for s in target_result["scores"] if s.get(metric, {}).get("score") is not None]
        src_avg = sum(src) / max(len(src), 1)
        tgt_avg = sum(tgt) / max(len(tgt), 1)
        delta = tgt_avg - src_avg
        lines.append(f"| {metric} | {src_avg:.2f} | {tgt_avg:.2f} | {delta:+.2f} |")

    portal_src = source_result.get('report_url', 'N/A')
    portal_tgt = target_result.get('report_url', 'N/A')
    lines.append(f"\n[Source portal]({portal_src}) | [Target portal]({portal_tgt})")
    display(Markdown("\n".join(lines)))

    return {"source": source_result, "target": target_result, "pair_id": pair_id}

## 2. Scenario 1: RAG (Retrieval-Augmented Generation)

Tests whether the new model generates responses that are **grounded in the provided context**.

**Key question**: Does the new model hallucinate more or less than the old one?

### Pre-built examples include:
- Company policy documents (remote work, data retention)
- Technical documentation (API configuration, webhooks)
- Financial reports
- Questions NOT in context (tests hallucination resistance)

In [None]:
from src.evaluate.scenarios.rag import RAG_TEST_CASES, create_rag_evaluator

# Preview the test data
lines = [f"**RAG test cases: {len(RAG_TEST_CASES)}**\n"]
for i, tc in enumerate(RAG_TEST_CASES[:3]):
    lines.append(f"---\n**Test {i+1}**")
    lines.append(f"- **Question:** {tc.prompt}")
    lines.append(f"- **Context:** `{tc.context[:100]}...`")
    lines.append(f"- **Expected:** `{tc.expected_output[:100]}...`\n")
display(Markdown("\n".join(lines)))

In [None]:
# Step 1: Collect responses from both models (all test cases)
from src.evaluate.scenarios.rag import RAG_TEST_CASES, create_rag_evaluator

rag_evaluator = create_rag_evaluator(
    source_model="gpt-4o",
    target_model="gpt-4.1",
    source_deployment=os.getenv("GPT4O_DEPLOYMENT"),
    target_deployment=os.getenv("GPT41_DEPLOYMENT"),
)

rag_source_all, rag_target_all = rag_evaluator.collect()

# Step 2: Quick local eval on first 3 test cases (SDK azure-ai-evaluation)
display(Markdown(f"### Quick local eval (3 / {len(RAG_TEST_CASES)} test cases)"))
rag_local = compare_local(
    rag_source_all[:3], rag_target_all[:3],
    metrics=["coherence", "fluency", "relevance", "groundedness"],
    model_config=model_config,
    source_label="gpt-4o",
    target_label="gpt-4.1",
)

In [None]:
# Full cloud eval — RAG (all 8 test cases scored in Foundry)
if foundry:
    rag_cloud = foundry_compare(
        rag_source_all, rag_target_all,
        metrics=["coherence", "fluency", "relevance", "groundedness"],
        scenario="rag",
        source_label="gpt-4o", target_label="gpt-4.1",
    )
else:
    display(Markdown("**Foundry not configured** — skipping cloud evaluation.  \nSet `AZURE_AI_PROJECT_ENDPOINT` in `.env` and install `azure-ai-projects`."))

## 3. Scenario 2: Tool Calling

Tests whether the new model **selects the correct tools** and **extracts parameters accurately**.

### Pre-built examples include:
- Weather queries (simple tool selection)
- Product search (multiple parameter extraction)
- Calendar events (complex parameters with attendees, location)
- Email composition (content generation + tool call)
- Stock inventory check (product and warehouse extraction)
- Ambiguous requests (tests tool selection logic)
- No-tool-needed queries (should respond directly)

In [None]:
from src.evaluate.scenarios.tool_calling import (
    TOOL_CALLING_TEST_CASES, SAMPLE_TOOLS, create_tool_calling_evaluator
)

# Preview available tools
lines = ["**Available tools:**\n"]
for tool in SAMPLE_TOOLS:
    name = tool["function"]["name"]
    desc = tool["function"]["description"]
    lines.append(f"- `{name}`: {desc}")

lines.append(f"\n**Test cases: {len(TOOL_CALLING_TEST_CASES)}**\n")
for i, tc in enumerate(TOOL_CALLING_TEST_CASES[:3]):
    lines.append(f"{i+1}. {tc.prompt[:60]}... — expected: `{tc.metadata['expected_tool']}`")
display(Markdown("\n".join(lines)))

In [None]:
# Step 1: Collect responses from both models (all test cases, with tools)
from src.evaluate.scenarios.tool_calling import (
    TOOL_CALLING_TEST_CASES, create_tool_calling_evaluator
)

tc_evaluator = create_tool_calling_evaluator(
    source_model="gpt-4o",
    target_model="gpt-4.1",
    source_deployment=os.getenv("GPT4O_DEPLOYMENT"),
    target_deployment=os.getenv("GPT41_DEPLOYMENT"),
)

tc_source_all, tc_target_all = tc_evaluator.collect()

# Step 2: Show deterministic scores (tool_accuracy, param_accuracy) on first 3
lines = [
    f"### Deterministic scores (3 / {len(TOOL_CALLING_TEST_CASES)} test cases)\n",
    "| Test | Metric | Source | Target |",
    "|------|--------|-------:|-------:|",
]
for i in range(min(3, len(tc_source_all))):
    desc = TOOL_CALLING_TEST_CASES[i].metadata.get("description", "")[:45]
    for metric in ["tool_accuracy", "param_accuracy"]:
        s = tc_source_all[i]["_scores"].get(metric, 0)
        t = tc_target_all[i]["_scores"].get(metric, 0)
        lines.append(f"| {desc} | {metric} | {s:.1f} | {t:.1f} |")
        desc = ""
display(Markdown("\n".join(lines)))

# Step 3: Quick local eval with SDK (text quality on first 3)
display(Markdown("### SDK local eval (text quality)"))
tc_local = compare_local(
    tc_source_all[:3], tc_target_all[:3],
    metrics=["coherence", "relevance"],
    model_config=model_config,
    source_label="gpt-4o",
    target_label="gpt-4.1",
)

In [None]:
# Full cloud eval — Tool Calling (all 8 test cases scored in Foundry)
if foundry:
    tc_cloud = foundry_compare(
        tc_source_all, tc_target_all,
        metrics=["coherence", "relevance", "tool_call_accuracy"],
        scenario="tool_calling",
        source_label="gpt-4o", target_label="gpt-4.1",
    )
else:
    display(Markdown("**Foundry not configured** — skipping cloud evaluation.  \nSet `AZURE_AI_PROJECT_ENDPOINT` in `.env` and install `azure-ai-projects`."))

## 4. Scenario 3: Translation

Tests whether the new model maintains **translation quality** across languages.

### Pre-built examples include:
- FR→EN: Business, legal, idiomatic expressions, cultural references
- EN→FR: Technical docs, marketing, UI text, medical
- EN→DE: Informal/conversational
- Tricky cases: homographs, ambiguous sentences

In [None]:
from src.evaluate.scenarios.translation import TRANSLATION_TEST_CASES, create_translation_evaluator

# Preview test cases
lines = [f"**Translation test cases: {len(TRANSLATION_TEST_CASES)}**\n",
         "| # | Direction | Domain | Difficulty | Source |",
         "|---|-----------|--------|------------|--------|"]
for i, tc in enumerate(TRANSLATION_TEST_CASES):
    meta = tc.metadata
    lines.append(f"| {i+1} | {meta.get('direction', '?')} | {meta.get('domain', '?')} | {meta.get('difficulty', '?')} | {tc.prompt[:50]}... |")
display(Markdown("\n".join(lines)))

In [None]:
# Step 1: Collect responses from both models (all test cases)
from src.evaluate.scenarios.translation import TRANSLATION_TEST_CASES, create_translation_evaluator

trans_evaluator = create_translation_evaluator(
    source_model="gpt-4o",
    target_model="gpt-4.1",
    source_deployment=os.getenv("GPT4O_DEPLOYMENT"),
    target_deployment=os.getenv("GPT41_DEPLOYMENT"),
)

trans_source_all, trans_target_all = trans_evaluator.collect()

# Step 2: Quick local eval on first 3 test cases (SDK azure-ai-evaluation)
display(Markdown(f"### Quick local eval (3 / {len(TRANSLATION_TEST_CASES)} test cases)"))
trans_local = compare_local(
    trans_source_all[:3], trans_target_all[:3],
    metrics=["coherence", "fluency", "relevance"],
    model_config=model_config,
    source_label="gpt-4o",
    target_label="gpt-4.1",
)

In [None]:
# Full cloud eval — Translation (all 10 test cases scored in Foundry)
if foundry:
    trans_cloud = foundry_compare(
        trans_source_all, trans_target_all,
        metrics=["coherence", "fluency", "relevance"],
        scenario="translation",
        source_label="gpt-4o", target_label="gpt-4.1",
    )
else:
    display(Markdown("**Foundry not configured** — skipping cloud evaluation.  \nSet `AZURE_AI_PROJECT_ENDPOINT` in `.env` and install `azure-ai-projects`."))

## 5. Scenario 4: Classification / Sentiment Analysis

Tests whether the new model maintains **classification accuracy and consistency**.

### Pre-built examples include:
- Sentiment analysis (positive/negative/neutral, sarcasm detection)
- Support ticket classification (billing, technical, account, shipping)
- Intent classification (flight booking, cancellation, baggage)
- IT incident priority (P1-P4)
- Edge cases: mixed sentiment, ambiguous tickets

In [None]:
from src.evaluate.scenarios.classification import (
    CLASSIFICATION_TEST_CASES, create_classification_evaluator
)

# Preview test cases by task type
from collections import Counter
tasks = Counter(tc.metadata.get('task', '?') for tc in CLASSIFICATION_TEST_CASES)

lines = [f"**Classification test cases: {len(CLASSIFICATION_TEST_CASES)}**",
         f"By task: `{dict(tasks)}`\n",
         "| # | Task | Expected | Prompt |",
         "|---|------|----------|--------|"]
for i, tc in enumerate(CLASSIFICATION_TEST_CASES[:5]):
    lines.append(f"| {i+1} | {tc.metadata.get('task')} | `{tc.ground_truth_label}` | {tc.prompt[:60]}... |")
display(Markdown("\n".join(lines)))

In [None]:
# Step 1: Collect responses from both models (all test cases)
from src.evaluate.scenarios.classification import (
    CLASSIFICATION_TEST_CASES, create_classification_evaluator
)

cls_evaluator = create_classification_evaluator(
    source_model="gpt-4o",
    target_model="gpt-4.1",
    source_deployment=os.getenv("GPT4O_DEPLOYMENT"),
    target_deployment=os.getenv("GPT41_DEPLOYMENT"),
    consistency_runs=3,
)

cls_source_all, cls_target_all = cls_evaluator.collect()

# Step 2: Deterministic accuracy on first 3 test cases
lines = [
    f"### Classification accuracy (3 / {len(CLASSIFICATION_TEST_CASES)} test cases)\n",
    "| Test | Expected | Source | Target |",
    "|------|----------|--------|--------|",
]
for i in range(min(3, len(cls_source_all))):
    prompt = CLASSIFICATION_TEST_CASES[i].prompt[:45]
    expected = CLASSIFICATION_TEST_CASES[i].ground_truth_label or "?"
    src_pred = cls_source_all[i]["_prediction"]
    tgt_pred = cls_target_all[i]["_prediction"]
    src_ok = "ok" if cls_source_all[i]["_accuracy"] >= 4.0 else "MISS"
    tgt_ok = "ok" if cls_target_all[i]["_accuracy"] >= 4.0 else "MISS"
    lines.append(f"| {prompt}... | `{expected}` | {src_ok} `{src_pred}` | {tgt_ok} `{tgt_pred}` |")
display(Markdown("\n".join(lines)))

# Step 3: F1 score via SDK (token-level match against ground truth)
display(Markdown("### SDK local eval (F1 score)"))
cls_local = compare_local(
    cls_source_all[:3], cls_target_all[:3],
    metrics=["f1_score"],
    model_config=model_config,
    source_label="gpt-4o",
    target_label="gpt-4.1",
)

In [None]:
# Full cloud eval — Classification (all test cases)
# Note: Foundry doesn't have a built-in classification accuracy metric.
# We use deterministic accuracy (above) + relevance for cloud comparison.
if foundry:
    cls_cloud = foundry_compare(
        cls_source_all, cls_target_all,
        metrics=["relevance"],
        scenario="classification",
        source_label="gpt-4o", target_label="gpt-4.1",
    )

    # Full deterministic accuracy table (all test cases)
    lines = [
        f"### Full deterministic accuracy ({len(cls_source_all)} test cases)\n",
        "| # | Task | Expected | Source | Target |",
        "|---|------|----------|--------|--------|",
    ]
    src_correct = tgt_correct = 0
    for i in range(len(cls_source_all)):
        task = CLASSIFICATION_TEST_CASES[i].metadata.get("task", "?")
        expected = CLASSIFICATION_TEST_CASES[i].ground_truth_label or "?"
        src_pred = cls_source_all[i]["_prediction"]
        tgt_pred = cls_target_all[i]["_prediction"]
        src_ok = cls_source_all[i]["_accuracy"] >= 4.0
        tgt_ok = cls_target_all[i]["_accuracy"] >= 4.0
        src_correct += src_ok
        tgt_correct += tgt_ok
        src_mark = "ok" if src_ok else "**MISS**"
        tgt_mark = "ok" if tgt_ok else "**MISS**"
        lines.append(f"| {i+1} | {task} | `{expected}` | {src_mark} `{src_pred}` | {tgt_mark} `{tgt_pred}` |")

    n = len(cls_source_all)
    lines.append(f"\n**Accuracy:** source={src_correct}/{n} ({100*src_correct/n:.0f}%) — target={tgt_correct}/{n} ({100*tgt_correct/n:.0f}%)")
    display(Markdown("\n".join(lines)))
else:
    display(Markdown("**Foundry not configured** — skipping cloud evaluation.  \nSet `AZURE_AI_PROJECT_ENDPOINT` in `.env` and install `azure-ai-projects`."))

## 6. Decision Guide

### Local vs Cloud evaluation

| | Local (SDK) | Cloud (Foundry) |
|---|---|---|
| **Speed** | Seconds | ~1 minute per model |
| **Results** | In notebook only | In Foundry portal, shareable |
| **Best for** | Quick iteration, prototyping | Final decision, CI/CD gates |
| **Metrics** | Same algorithms as Foundry | Authoritative scores |
| **Custom metrics** | Deterministic only (tool accuracy, etc.) | Built-in evaluators only |

### When is migration safe?

| Result | Action |
|--------|--------|
| **0 regressions** | Safe to migrate. Proceed with confidence. |
| **< 10% regressions** | Caution. Review flagged cases. Consider prompt tuning. |
| **> 10% regressions** | Risky. Investigate root causes before proceeding. |

### Recommended regression thresholds

| Metric | Acceptable Delta | Action if exceeded |
|--------|-----------------|-------------------|
| Coherence | -0.5 | Review response structure |
| Fluency | -0.5 | Check language quality |
| Relevance | -1.0 | Review prompt/system message |
| Groundedness | -0.5 | Critical for RAG — investigate hallucinations |
| Tool Accuracy | Any drop | Critical — wrong tool = wrong action |
| Classification Accuracy | Any drop | Review system prompt clarity |

### How to adapt with your own data

1. **Pick the closest scenario** from the 4 above
2. **Replace test data** with your production prompts:

```python
from src.evaluate.core import TestCase

my_tests = [
    TestCase(
        prompt="Your question here",
        context="Your document context...",
        expected_output="Expected answer",
    ),
]
```

3. **Create evaluator** with your test cases: `create_rag_evaluator("gpt-4o", "gpt-4.1", test_cases=my_tests)`
4. **Run the same flow**: `collect()` → `compare_local()` → `foundry_compare()`
5. **Iterate** until regressions are within acceptable thresholds