# Complaint Agent: Model Comparison & Selection

This notebook demonstrates how to compare different LLM models for an agentic application using MLflow's evaluation and judge alignment capabilities.

**Scenario:** You've built a prototype complaint triage agent but aren't sure which model to use. Databricks is model-agnostic, so you can easily test models from different providers.

**What you'll do:**
1. Define a lightweight agent for inner-loop development (no deployment required)
2. Create an evaluation dataset with diverse complaint scenarios
3. Evaluate 3 different models using Agent-as-a-Judge and Guidelines scorers
4. Review results and add human feedback via MLflow UI
5. Align the judge with human feedback to improve accuracy
6. Re-evaluate and visualize results
7. Register the winning model configuration

**Prerequisites:** This demo assumes UC tools have been created by running `stages/complaint_agent.ipynb` first.

## 1. Setup & Prerequisites

Install required packages and verify that UC tools exist. This is inner-loop development, so no deployment or ResponsesAgent wrapper needed.

In [None]:
%pip install -U -qqqq mlflow[databricks] langgraph==0.3.4 databricks-langchain plotly
dbutils.library.restartPython()

In [None]:
# Create widget for catalog name (will be populated by bundle parameter)
dbutils.widgets.text("CATALOG", "dlcaspers", "Catalog Name")

In [None]:
CATALOG = dbutils.widgets.get("CATALOG")
if not CATALOG:
    raise ValueError("Please provide a CATALOG name in the widget above")
print(f"Using catalog: {CATALOG}")

In [None]:
# Verify UC tools exist
print("Checking for required UC functions...")

required_functions = ['get_order_overview', 'get_order_timing', 'get_location_timings']
missing_functions = []

for func_name in required_functions:
    try:
        spark.sql(f"DESCRIBE FUNCTION {CATALOG}.ai.{func_name}").collect()
        print(f"✅ {func_name} found")
    except Exception as e:
        print(f"❌ {func_name} not found")
        missing_functions.append(func_name)

if missing_functions:
    raise RuntimeError(
        f"Missing UC functions: {', '.join(missing_functions)}. "
        f"Run ../../stages/complaint_agent.ipynb first to create these tools."
    )

print("\n✅ All required UC functions found")

In [None]:
import mlflow

# Create experiment for model comparison evaluations
experiment_name = f"/Shared/{CATALOG}_model_comparison"
experiment = mlflow.set_experiment(experiment_name)
experiment_id = experiment.experiment_id

print(f"✅ Using experiment: {experiment_name}")
print(f"   Experiment ID: {experiment_id}")

## 2. Define Inner-Loop Agent

Create a lightweight version of the complaint agent that takes `model_endpoint` as a parameter. This version:
- Uses the same UC tools and system prompt as the production version
- Returns results as a dict (no ResponsesAgent wrapper)
- Can be quickly instantiated with different models for comparison

This is perfect for rapid iteration during model selection.

In [None]:
import json
from typing import Annotated, Any, Optional, Sequence, TypedDict, Literal, cast
from uuid import uuid4

from databricks_langchain import (
    ChatDatabricks,
    DatabricksFunctionClient,
    UCFunctionToolkit,
    set_uc_function_client,
)
from langchain_core.messages import AIMessage, BaseMessage
from langchain_core.runnables import RunnableConfig, RunnableLambda
from langchain_core.tools import BaseTool
from langgraph.graph import END, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt.tool_node import ToolNode

# Set up UC function client
client = DatabricksFunctionClient()
set_uc_function_client(client)

# Define response schema (same as production)
class ComplaintResponse(TypedDict):
    order_id: str
    complaint_category: Literal["delivery_delay", "missing_items", "food_quality", "service_issue", "billing", "other"]
    decision: Literal["suggest_credit", "escalate"]
    credit_amount: Optional[float]
    confidence: Optional[Literal["high", "medium", "low"]]
    priority: Optional[Literal["standard", "urgent"]]
    rationale: str

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]

# System prompt (same as production)
SYSTEM_PROMPT = """You are the Complaint Triage Agent for Casper's Kitchens, a ghost kitchen network. Your job is to analyze customer complaints and recommend actions for internal customer service staff.

PROCESS:
1. Extract the order_id from the complaint text
2. Call get_order_overview(order_id) to get order details and items
3. Call get_order_timing(order_id) to get delivery timing
4. For delivery delay complaints, call get_location_timings(location) to get percentile benchmarks
5. Analyze the data and make a recommendation

DECISION FRAMEWORK:

SUGGEST_CREDIT - Use when you can make a data-backed recommendation:

Delivery delays (high confidence when data-backed):
- Compare actual delivery time to location percentiles
- >P75 but <P90: Suggest 15% of order total
- >P90 but <P99: Suggest 25% of order total  
- >P99: Suggest 50% of order total
- On-time delivery: Suggest $0 credit (low confidence - might be other issues)

Missing items:
- Parse items_json from get_order_overview to find actual item prices
- Use real item costs when available for accurate refund
- Check if claimed missing item actually makes sense for this order
- If item plausibly missing but can't verify price: estimate $8-12 per item (medium confidence)
- If no evidence of missing items: $0 credit (low confidence)

Food quality issues:
- If specific details provided (cold, soggy, wrong preparation): $10-15 (medium confidence)
- If vague ("bad", "gross"): escalate with priority="standard" instead
- If food quality complaint but order shows delivery delay >P90: consider combined credit

ESCALATE - Use when human judgment is needed:
- priority="standard": Vague complaints, missing data, edge cases, billing issues, service complaints
- priority="urgent": Legal threats, health/safety concerns, suspected fraud, abusive language

OUTPUT RULES:
- For suggest_credit: Include credit_amount (can be $0) and confidence, set priority to null
- For escalate: Include priority, set credit_amount and confidence to null
- Always include a data-focused rationale citing specific numbers (delivery times, percentiles, order details, item verification)
- Rationale is for internal staff - be factual, not apologetic

RATIONALE GUIDELINES:
- Rationale should clearly articulate the justification for the decision
- All evidence used to support the decision should be cited in the rationale. Rationale MUST cite evidence.
- Rationale should be detailed and specific, but no longer than 150 words
- Rationale should clearly justify the decision, credit amount (if applicable), confidence level, and priority (if applicable)
- If a refund is suggested, the rationale should clearly justify the credit amount, based on the evidence provided.

CONFIDENCE GUIDELINES:
- high: Strong data support (delivery >P90, item prices from order data, clear verification)
- medium: Reasonable inference (delivery P75-P90, estimated item costs, plausible but unverified)
- low: Weak or contradictory evidence ($0 credits when data doesn't support claim, can't verify complaint details)

IMPORTANT NOTES:
- If order_id not found or data unavailable: escalate with priority="standard"
- Round credit amounts to nearest $0.50
- When in doubt between two options, prefer escalate with priority="standard"
- A suggest_credit with $0 and low confidence is valid when complaint seems unfounded
"""

RESPONSE_FIELDS = {"order_id", "complaint_category", "decision", "rationale"}

def parse_structured_response(obj) -> ComplaintResponse:
    """Parse ComplaintResponse from AIMessage or dict"""
    if isinstance(obj, dict):
        candidate = obj
    else:
        parsed = obj.additional_kwargs.get("parsed_structured_output")
        if isinstance(parsed, dict):
            candidate = parsed
        else:
            content = obj.content
            if isinstance(content, str):
                raw = content
            elif isinstance(content, list):
                raw = "".join(part.get("text", "") if isinstance(part, dict) else str(part) for part in content)
            else:
                raise ValueError("Unsupported message content type")
            candidate = json.loads(raw)

    missing = RESPONSE_FIELDS.difference(candidate.keys())
    if missing:
        raise ValueError(f"Missing required fields: {sorted(missing)}")

    decision = candidate.get("decision")
    if decision == "suggest_credit":
        if candidate.get("credit_amount") is None:
            raise ValueError("suggest_credit requires credit_amount")
        if candidate.get("confidence") is None:
            raise ValueError("suggest_credit requires confidence")
        if candidate.get("priority") is not None:
            candidate["priority"] = None
    elif decision == "escalate":
        if candidate.get("priority") is None:
            raise ValueError("escalate requires priority")
        if candidate.get("credit_amount") is not None:
            candidate["credit_amount"] = None
        if candidate.get("confidence") is not None:
            candidate["confidence"] = None

    return cast(ComplaintResponse, candidate)


class ComplaintsAgentCore:
    """Lightweight complaint agent for model comparison (no ResponsesAgent wrapper)"""
    
    def __init__(self, model_endpoint: str, catalog: str):
        self.model_endpoint = model_endpoint
        self.catalog = catalog
        
        # Initialize LLM
        self.llm = ChatDatabricks(endpoint=model_endpoint)
        
        # Load UC tools
        uc_tool_names = [
            f"{catalog}.ai.get_order_overview",
            f"{catalog}.ai.get_order_timing",
            f"{catalog}.ai.get_location_timings",
        ]
        uc_toolkit = UCFunctionToolkit(function_names=uc_tool_names)
        self.tools = uc_toolkit.tools
        
        # Build agent
        self.agent = self._create_agent()
    
    def _create_agent(self):
        """Create LangGraph workflow"""
        tool_model = self.llm.bind_tools(self.tools, tool_choice="auto")
        structured_model = tool_model.with_structured_output(ComplaintResponse)

        def should_continue(state: AgentState):
            messages = state["messages"]
            last_message = messages[-1]
            if isinstance(last_message, AIMessage) and last_message.tool_calls:
                return "continue"
            return "end"

        preprocessor = RunnableLambda(
            lambda state: [{"role": "system", "content": SYSTEM_PROMPT}] + state["messages"]
        )

        tool_runnable = preprocessor | tool_model
        structured_runnable = preprocessor | structured_model

        def call_model(state: AgentState, config: RunnableConfig):
            response = tool_runnable.invoke(state, config)
            if not isinstance(response, AIMessage):
                raise ValueError(f"Expected AIMessage, received {type(response)}")

            if response.tool_calls:
                return {"messages": [response]}

            try:
                parsed = parse_structured_response(response)
            except (json.JSONDecodeError, ValueError):
                structured = structured_runnable.invoke(state, config)
                parsed = parse_structured_response(structured)

            structured_message = AIMessage(
                id=response.id or str(uuid4()),
                content=json.dumps(parsed),
                additional_kwargs={"parsed_structured_output": parsed},
            )
            return {"messages": [structured_message]}

        workflow = StateGraph(AgentState)
        workflow.add_node("agent", RunnableLambda(call_model))
        workflow.add_node("tools", ToolNode(self.tools))
        workflow.set_entry_point("agent")
        workflow.add_conditional_edges(
            "agent",
            should_continue,
            {"continue": "tools", "end": END},
        )
        workflow.add_edge("tools", "agent")
        return workflow.compile()
    
    def invoke(self, complaint: str) -> dict:
        """Process a complaint and return structured response"""
        result = self.agent.invoke({
            "messages": [{"role": "user", "content": complaint}]
        })
        
        # Extract final response
        final_message = result["messages"][-1]
        return parse_structured_response(final_message)

print("✅ ComplaintsAgentCore class defined")

### Test the Agent

Before creating the full evaluation dataset, let's validate that the agent works correctly with a single test case.

In [None]:
# Get a sample order ID for testing
sample_order = spark.sql(f"""
    SELECT order_id 
    FROM {CATALOG}.lakeflow.all_events 
    WHERE event_type='delivered'
    LIMIT 1
""").collect()[0]['order_id']

print(f"Using test order ID: {sample_order}")

In [None]:
import mlflow

# Enable autologging for LangChain
mlflow.langchain.autolog()

# Create agent instance with a test model
test_agent = ComplaintsAgentCore(
    model_endpoint="databricks-llama-4-maverick",
    catalog=CATALOG
)

# Test with a delivery delay complaint
test_complaint = f"My order took forever to arrive! Order ID: {sample_order}"
print(f"\nTest complaint: {test_complaint}")
print("\nProcessing...")

result = test_agent.invoke(test_complaint)

print("\n" + "="*80)
print("AGENT RESPONSE")
print("="*80)
import json
print(json.dumps(result, indent=2))
print("="*80)

print("\n✅ Agent test successful! Ready to proceed with evaluation.")

## 3. Create Evaluation Dataset

Generate ~20 diverse complaint scenarios to thoroughly test model performance across different situations:
- Delivery delays (with varying severity)
- Missing items (verifiable vs suspicious)
- Food quality complaints (specific vs vague)
- Service issues (should escalate)
- Health/safety concerns (urgent escalation)
- Edge cases (missing order ID, invalid format)

A diverse dataset helps reveal model strengths and weaknesses across the decision space.

In [None]:
# Get real order IDs for realistic eval scenarios
all_order_ids = [
    row['order_id'] for row in spark.sql(f"""
        SELECT DISTINCT order_id 
        FROM {CATALOG}.lakeflow.all_events 
        WHERE event_type='delivered'
        LIMIT 30
    """).collect()
]

print(f"✅ Retrieved {len(all_order_ids)} sample order IDs")

In [None]:
import random

# Create diverse complaint scenarios
complaints = []

# Delivery delays (8 examples)
for oid in all_order_ids[:4]:
    complaints.extend([
        f"My order took forever to arrive! Order ID: {oid}",
        f"Order {oid} arrived late and cold",
    ])

# Food quality - specific (4 examples)
for oid in all_order_ids[4:6]:
    complaints.extend([
        f"My falafel was completely soggy and inedible. Order: {oid}",
        f"The gyro meat was overcooked and dry. Order: {oid}",
    ])

# Food quality - vague (should escalate) (2 examples)
for oid in all_order_ids[6:8]:
    complaints.append(f"Everything tasted bad. Order {oid}")

# Missing items (4 examples)
for oid in all_order_ids[8:10]:
    complaints.extend([
        f"My entire falafel bowl was missing from the order! Order: {oid}",
        f"No drinks in my order {oid}",
    ])

# Service issues (should escalate) (3 examples)
for oid in all_order_ids[10:13]:
    complaints.append(f"Your driver was extremely rude to me. Order: {oid}")

# Health/safety (urgent escalation) (2 examples)
for oid in all_order_ids[13:15]:
    complaints.append(f"This food made me sick, possible food poisoning. Order: {oid}")

# Multiple issues (2 examples)
for oid in all_order_ids[15:17]:
    complaints.append(f"Order {oid} was late AND missing items AND cold!")

# Edge cases (3 examples)
complaints.extend([
    "My order was really late and the food was cold!",  # Missing order ID
    "Order ABC123 never arrived",  # Invalid order ID
    f"Not satisfied with order {all_order_ids[17]}",  # Vague
])

# Ensure exactly 20 examples
complaints = complaints[:20]

# Format for mlflow.genai.evaluate
# predict_fn takes 'inputs' parameter, so data format is {"inputs": <value>}
eval_data = [{"inputs": {"complaint": c} } for c in complaints]

## 4. Define Scorers

We use two complementary scorers with no overlap:

1. **Evidence Groundedness (Agent-as-a-Judge, Pass/Fail)**: Inspects execution traces to verify the final decision aligns with evidence actually gathered from tool calls. Uses trace inspection to check if the agent's conclusions match the data it retrieved. This scorer will be aligned with human feedback.

2. **Rationale Sufficiency (Guidelines, Pass/Fail)**: Validates output structure only—can a human follow the logic from rationale to decision? Checks self-consistency without needing trace data. Rule-based and deterministic.

In [None]:
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import Guidelines

# Scorer 1: Evidence Groundedness (Agent-as-a-Judge, Binary Pass/Fail)
# Uses {{ trace }} to enable agentic evaluation with tool access
evidence_groundedness_judge = make_judge(
    name="evidence_groundedness",
    instructions="""
Evaluate whether the agent's decision is grounded in evidence from the execution {{ trace }}.

Investigation checklist:
1. Find spans where tools were called (get_order_overview, get_order_timing, get_location_timings)
2. Extract the actual outputs returned by these tool calls
3. Compare tool outputs against claims made in the agent's rationale
4. Verify that credit amounts or escalation decisions match the tool data

For delivery complaints, check:
- Does the rationale's delivery time match what get_order_timing returned?
- Does the rationale's percentile comparison match get_location_timings output?
- Is the credit calculation based on the actual order total from get_order_overview?

For missing item complaints, check:
- Does get_order_overview show the items mentioned in the rationale?
- Is the credit amount based on actual item prices from the tool output?

Customer complaint: {{ inputs }}
Agent's final output: {{ outputs }}

Rate as PASS if rationale claims match tool outputs, FAIL if there are contradictions or unsupported claims.
""",
    model="databricks:/databricks-claude-sonnet-4-5"
)

# Scorer 2: Rationale Sufficiency (Guidelines, Pass/Fail)
rationale_sufficiency = Guidelines(
    name="rationale_sufficiency",
    guidelines=[
        "If decision='suggest_credit', rationale must contain numerical justification (dollar amount, percentile, or item value)",
        "If decision='suggest_credit', credit_amount must be present and > 0",
        "If decision='escalate', rationale must explain why (e.g., health/safety, missing data, service issue)",
        "If decision='escalate' and priority='urgent', rationale must mention health, safety, or time-sensitive concern",
        "Rationale must be at least 20 characters long",
        "Decision must be either 'suggest_credit' or 'escalate'"
    ]
)

print("✅ Scorers defined:")
print("   - evidence_groundedness (Agent-as-a-Judge, inspects traces, will be aligned)")
print("   - rationale_sufficiency (Guidelines, output-only validation)")

## 5. Run Initial Evaluation (Single Model)

Run an initial evaluation with one model to validate your eval setup and generate traces for judge alignment.

This creates the baseline traces that you'll review and use to align the judge before comparing multiple models.

In [None]:
# Use one model for initial evaluation and judge alignment
initial_model_name = "llama-3-3-70b"
initial_model_endpoint = "databricks-meta-llama-3-3-70b-instruct"

print(f"Initial model for baseline evaluation: {initial_model_name}")

In [None]:
import mlflow

# Enable autologging for detailed trace capture
mlflow.langchain.autolog()

print(f"Running {initial_model_name} on 20 test cases to generate traces...\n")

# Create agent instance
agent = ComplaintsAgentCore(model_endpoint=initial_model_endpoint, catalog=CATALOG)

# Enable autologging to capture traces
mlflow.langchain.autolog()

# Step 1: Run agent to generate traces (tag them for later retrieval)
baseline_tag = f"baseline_{initial_model_name}"
with mlflow.start_run(run_name=baseline_tag) as run:
    experiment_id = run.info.experiment_id
    run_id = run.info.run_id
    
    # Invoke agent on each complaint to generate traces
    for row in eval_data.itertuples():
        complaint = row.inputs['complaint']
        result = agent.invoke(complaint)
        
        # Tag the trace for later retrieval
        trace_id = mlflow.get_last_active_trace_id()
        mlflow.set_trace_tag(trace_id, "eval_group", baseline_tag)
        mlflow.set_trace_tag(trace_id, "model", initial_model_name)

print(f"✅ Generated {len(eval_data)} traces. Run ID: {run_id}")

# Step 2: Retrieve traces for evaluation
print(f"\nRetrieving traces for evaluation...")
baseline_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"tags.eval_group = '{baseline_tag}'",
    max_results=20
)

print(f"Retrieved {len(baseline_traces)} traces")

# Step 3: Evaluate the traces
print(f"\nEvaluating traces with scorers...")
baseline_result = mlflow.genai.evaluate(
    data=baseline_traces,
    scorers=[evidence_groundedness_judge, rationale_sufficiency]
)

print(f"\n✅ Baseline evaluation complete:")
print(f"   Evidence Groundedness: {baseline_result.metrics.get('evidence_groundedness/percentage', 'N/A'):.1f}% pass")
print(f"   Rationale Sufficiency: {baseline_result.metrics.get('rationale_sufficiency/percentage', 'N/A'):.1f}% pass")
print(f"\nExperiment ID: {experiment_id}")

## 6. Human Review Instructions

Now it's time to add human feedback to improve the judge's accuracy. The MLflow UI makes this easy:

### How to Add Feedback:

1. **Open MLflow UI**: Navigate to the experiment (link printed below)
2. **Review traces**: Click into evaluation runs to see individual predictions
3. **Add feedback**: For each trace, you can rate the `decision_quality` assessment
   - ✅ Mark as "good" if the agent made the right decision
   - ❌ Mark as "bad" if the agent made the wrong decision
   - Add comments explaining your reasoning

### What to Look For:
- Did the agent call the right tools for the complaint type?
- Is the credit amount justified by the data retrieved?
- Are escalations appropriate for the severity?
- Does the rationale cite specific numbers from tool outputs?

### Tips:
- Focus on cases where the judge scored 2-4 (borderline/uncertain cases)
- You need at least 10 feedback entries for alignment to work well
- Look for patterns where the judge disagreed with your assessment

In [None]:
import urllib.parse

# Get workspace URL
workspace_url = spark.conf.get("spark.databricks.workspaceUrl")
experiment_path = urllib.parse.quote(experiment_name)

mlflow_ui_url = f"https://{workspace_url}/#mlflow/experiments/{experiment_id}"

print("="*80)
print("👤 HUMAN REVIEW REQUIRED")
print("="*80)
print(f"\n1. Open MLflow UI: {mlflow_ui_url}")
print("\n2. Review evaluation traces and add feedback:")
print("   - Click into individual runs to see predictions")
print("   - For each trace, rate whether the agent's decision was correct")
print("   - Add comments explaining your reasoning")
print("\n3. Focus on:")
print("   - Tool usage (did it call the right functions?)")
print("   - Data justification (does rationale cite specific numbers?)")
print("   - Escalation appropriateness (correct priority level?)")
print("   - Credit accuracy (reasonable amount based on data?)")
print("\n4. Aim for at least 10-15 feedback entries for good alignment")
print("\n5. Once done, proceed to the next cell for judge alignment")
print("="*80)

## 7. Judge Alignment

After collecting human feedback, we can align the judge to better match human judgment. The SIMBA optimizer analyzes disagreements between the judge and human assessments, then generates improved instructions.

This typically reduces false positives/negatives by 30-50% compared to the initial judge.

In [None]:
from mlflow.genai.optimizers import SIMBAAlignmentOptimizer

# Get traces with human feedback for alignment
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id], max_results=15, return_type="list"
)

# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
    optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
    
    # Run alignment - shows minimal progress by default:
    # INFO: Starting SIMBA optimization with 15 examples (set logging to DEBUG for detailed output)
    # INFO: SIMBA optimization completed
    aligned_judge = evidence_groundedness_judge.align(optimizer, traces_for_alignment)
    
    # Register the aligned judge
    aligned_judge.register(experiment_id=experiment_id)
    print("✅ Judge aligned successfully with human feedback")
else:
    print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")

In [None]:
# Register aligned judge for reuse
aligned_judge.register(name=f"{CATALOG}.ai.evidence_groundedness_judge_aligned")

print(f"✅ Aligned judge registered to UC: {CATALOG}.ai.evidence_groundedness_judge_aligned")

## 8. Model Comparison with Aligned Judge

Now that the judge is aligned with human feedback, compare multiple models to find the best one for your use case.

We'll evaluate three different models:
- **GPT OSS 20B**: Cost-effective open model
- **Claude 3.7 Sonnet**: Strong reasoning capabilities
- **DBRX Instruct**: Databricks' own model (commented out for now)

In [None]:
# Define models to compare
models_to_compare = {
    "gpt-oss": "databricks-gpt-oss-20b",
    "sonnet": "databricks-claude-3-7-sonnet"
    # "dbrx-instruct": "databricks-dbrx-instruct"
}

comparison_results = {}

for model_name, endpoint in models_to_compare.items():
    print(f"\nRunning {model_name} to generate traces...")
    
    agent = ComplaintsAgentCore(model_endpoint=endpoint, catalog=CATALOG)
    
    # Step 1: Generate traces with tags
    comparison_tag = f"comparison_{model_name}"
    with mlflow.start_run(run_name=comparison_tag) as run:
        # Invoke agent on each complaint to generate traces
        for row in eval_data.itertuples():
            complaint = row.inputs['complaint']
            result = agent.invoke(complaint)
            
            # Tag the trace for later retrieval
            trace_id = mlflow.get_last_active_trace_id()
            mlflow.set_trace_tag(trace_id, "eval_group", comparison_tag)
            mlflow.set_trace_tag(trace_id, "model", model_name)
    
    print(f"  Generated {len(eval_data)} traces for {model_name}")
    
    # Step 2: Retrieve traces
    traces = mlflow.search_traces(
        experiment_ids=[experiment_id],
        filter_string=f"tags.eval_group = '{comparison_tag}'",
        max_results=20
    )
    
    # Step 3: Evaluate traces
    print(f"  Evaluating {len(traces)} traces with aligned judge...")
    result = mlflow.genai.evaluate(
        data=traces,
        scorers=[aligned_judge, rationale_sufficiency]
    )
    comparison_results[model_name] = result
    
    print(f"  ✅ {model_name}: evidence_groundedness={result.metrics.get('evidence_groundedness/percentage', 'N/A'):.1f}% pass, "
          f"rationale_sufficiency={result.metrics.get('rationale_sufficiency/percentage', 'N/A'):.1f}% pass")

print("\n✅ Model comparison complete!")

## 9. Results Analysis

Compare the models to identify the best one for your use case:
- Which model has the highest evidence groundedness (decisions align with tool data)?
- Which model has the most sufficient rationales (clear logic from rationale to decision)?
- What are the latency trade-offs between models?

In [None]:
import pandas as pd

# Build comparison dataframe
comparison_data = []

# Add baseline result for reference
comparison_data.append({
    'model': f"{initial_model_name} (baseline)",
    'evidence_groundedness': baseline_result.metrics.get('evidence_groundedness/percentage', 0),
    'rationale_sufficiency': baseline_result.metrics.get('rationale_sufficiency/percentage', 0),
    'latency_p50_ms': baseline_result.metrics.get('latency/p50', 0) * 1000 if baseline_result.metrics.get('latency/p50') else 0,
})

# Add comparison results
for model_name, result in comparison_results.items():
    comparison_data.append({
        'model': model_name,
        'evidence_groundedness': result.metrics.get('evidence_groundedness/percentage', 0),
        'rationale_sufficiency': result.metrics.get('rationale_sufficiency/percentage', 0),
        'latency_p50_ms': result.metrics.get('latency/p50', 0) * 1000 if result.metrics.get('latency/p50') else 0,
    })

comparison_df = pd.DataFrame(comparison_data)

print("\nModel Comparison Results:")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)
print("\nNote: Baseline used initial judge; other models used aligned judge")

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create visualization with multiple charts
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Evidence Groundedness (% Pass)',
        'Rationale Sufficiency (% Pass)',
        'Latency Comparison (P50)',
        'Average Pass Rate (%)'
    ),
    specs=[[{'type': 'bar'}, {'type': 'bar'}],
           [{'type': 'bar'}, {'type': 'bar'}]]
)

# Chart 1: Evidence Groundedness
fig.add_trace(
    go.Bar(x=comparison_df['model'], y=comparison_df['evidence_groundedness'],
           marker_color='darkblue', showlegend=False),
    row=1, col=1
)

# Chart 2: Rationale Sufficiency
fig.add_trace(
    go.Bar(x=comparison_df['model'], y=comparison_df['rationale_sufficiency'],
           marker_color='green', showlegend=False),
    row=1, col=2
)

# Chart 3: Latency
fig.add_trace(
    go.Bar(x=comparison_df['model'], y=comparison_df['latency_p50_ms'],
           marker_color='orange', showlegend=False),
    row=2, col=1
)

# Chart 4: Average pass rate
# Both scorers return 0-100 percentages, so just average them
comparison_df['average_pass_rate'] = (
    (comparison_df['evidence_groundedness'] + comparison_df['rationale_sufficiency']) / 2
)

fig.add_trace(
    go.Bar(x=comparison_df['model'], y=comparison_df['average_pass_rate'],
           marker_color='purple', showlegend=False),
    row=2, col=2
)

# Update layout
fig.update_layout(
    height=800,
    title_text="Model Comparison Dashboard",
    showlegend=True
)

fig.update_yaxes(title_text="Score (1-5)", row=1, col=1)
fig.update_yaxes(title_text="Accuracy (%)", row=1, col=2)
fig.update_yaxes(title_text="Latency (ms)", row=2, col=1)
fig.update_yaxes(title_text="Combined Score", row=2, col=2)

fig.show()

print("\n📊 Visualization complete. Review charts above.")

## 10. Register Winner & Next Steps

Identify the best-performing model and log its configuration for deployment.

In [None]:
# Identify winner based on combined score
winner_idx = comparison_df['average_pass_rate'].idxmax()
winner_name = comparison_df.loc[winner_idx, 'model']
winner_endpoint = models_to_test[winner_name]

print("="*80)
print("🏆 WINNING MODEL")
print("="*80)
print(f"\nModel: {winner_name}")
print(f"Endpoint: {winner_endpoint}")
print(f"\nPerformance:")
print(f"  - Evidence Groundedness: {comparison_df.loc[winner_idx, 'evidence_groundedness']:.1f}% pass")
print(f"  - Rationale Sufficiency: {comparison_df.loc[winner_idx, 'rationale_sufficiency']:.1f}% pass")
print(f"  - Latency P50: {comparison_df.loc[winner_idx, 'latency_p50_ms']:.0f}ms")
print(f"  - Average Pass Rate: {comparison_df.loc[winner_idx, 'average_pass_rate']:.1f}%")
print("="*80)

In [None]:
# Log winning configuration
with mlflow.start_run(run_name=f"winning_config_{winner_name}") as run:
    mlflow.log_param("winning_model", winner_name)
    mlflow.log_param("model_endpoint", winner_endpoint)
    mlflow.log_param("catalog", CATALOG)
    
    # Log all metrics
    mlflow.log_metric("evidence_groundedness", comparison_df.loc[winner_idx, 'evidence_groundedness'])
    mlflow.log_metric("rationale_sufficiency", comparison_df.loc[winner_idx, 'rationale_sufficiency'])
    mlflow.log_metric("latency_p50_ms", comparison_df.loc[winner_idx, 'latency_p50_ms'])
    mlflow.log_metric("average_pass_rate", comparison_df.loc[winner_idx, 'average_pass_rate'])
    
    # Save comparison df as artifact
    comparison_df.to_csv("model_comparison.csv", index=False)
    mlflow.log_artifact("model_comparison.csv")
    
    winning_run_id = run.info.run_id

print(f"\n✅ Winning configuration logged to MLflow (run_id: {winning_run_id})")

In [None]:
print("\n" + "="*80)
print("📋 NEXT STEPS")
print("="*80)
print(f"\n1. Use the winning model in your deployment:")
print(f"   - Model: {winner_name}")
print(f"   - Endpoint: {winner_endpoint}")
print(f"\n2. Deploy to production:")
print(f"   - Open: ../../stages/complaint_agent.ipynb")
print(f"   - Update LLM_MODEL widget to: {winner_endpoint}")
print(f"   - Run full deployment workflow (log, register, deploy to Model Serving)")
print(f"\n3. Continue monitoring:")
print(f"   - Use the aligned judge ({CATALOG}.ai.evidence_groundedness_judge_aligned)")
print(f"   - Monitor production traffic with same scorers")
print(f"   - Collect more human feedback to further improve the judge")
print(f"\n4. Iterate:")
print(f"   - As new models become available, re-run this comparison notebook")
print(f"   - Update evaluation dataset with new complaint patterns")
print(f"   - Refine scorers based on production learnings")
print("="*80)

print(f"\n🎯 You've successfully compared {len(models_to_test)} models and identified the best one!")
print(f"   The {winner_name} model achieved an average pass rate of {comparison_df.loc[winner_idx, 'average_pass_rate']:.1f}%")

## Summary

This notebook demonstrated:
- ✅ Lightweight agent definition for rapid iteration (no deployment needed)
- ✅ Systematic model comparison using MLflow evaluation
- ✅ Agent-as-a-Judge for sophisticated trace-based evaluation
- ✅ Human-in-the-loop feedback collection via MLflow UI
- ✅ Judge alignment to improve evaluation accuracy
- ✅ Clear visualization of model trade-offs
- ✅ Structured decision-making for model selection

**Key Takeaways:**
1. Inner-loop development (without deployment) enables fast model comparison
2. Agent-as-a-Judge can inspect execution traces for deeper evaluation
3. Human feedback significantly improves judge accuracy (30-50% improvement typical)
4. Different models have different trade-offs (accuracy vs latency vs cost)
5. Once you've selected a model, deployment is straightforward using the stages notebook

**What's Next:**
Deploy your winning model using the full production workflow in `../../stages/complaint_agent.ipynb`