# Choose the Best Model for Your Agent

Databricks provides native access to a wide range of major AI model families, including ChatGPT, Claude, Gemini, Llama, and more. MLflow on Databricks supports sophisticated model evaluation and comparison workflows, including trace-aware, human-aligned agentic evaluation.

In this notebook, we will explore some of the ways Databricks and MLflow empower you to rigorously iterate on the quality of your agent and to choose the best model for your use case. In particular, we will show how to use a trace-aware agentic judge and a human-aligned template-based judge to evaluate and compare different models.

**Scenario:** You've built a prototype complaint triage agent for Casper's Kitchens, a ghost kitchen network. You want to know which AI model should power your production agent.

**What you'll do:**
1. Define a prototype agent for customer complaint triage
2. Create an evaluation dataset with diverse complaint scenarios
3. Evaluate different models using agentic- and template-based scorers
4. Review results and add human feedback via MLflow UI
5. Align the judge with human feedback to improve accuracy
6. Re-evaluate and visualize results
7. Register the winning model configuration

**Prerequisites:** This demo assumes UC tools have been created by running `stages/complaint_agent.ipynb` and that you have run `stages/raw_data.ipynb` and `stages/lakeflow.ipynb` to start the raw data stream. See the [README](../../../README.md) for details on setting up the Casper's environment.

## 1. Setup & Prerequisites

Install required packages and verify that UC tools exist.

In [None]:
%pip install -U -qqqq mlflow[databricks] langgraph==0.3.4 databricks-langchain plotly dspy
dbutils.library.restartPython()

In [None]:
CATALOG = dbutils.widgets.get("CATALOG")
if not CATALOG:
    raise ValueError("Please provide a CATALOG name in the widget above")
print(f"Using catalog: {CATALOG}")

In [None]:
# Verify UC tools exist
print("Checking for required UC functions...")

required_functions = ['get_order_overview', 'get_order_timing', 'get_location_timings']

for func_name in required_functions:
    try:
        spark.sql(f"DESCRIBE FUNCTION {CATALOG}.ai.{func_name}").collect()
    except Exception:
        raise RuntimeError("UC tools missing. Run ../../stages/complaint_agent.ipynb first to create these tools.")

print("\n✅ All required UC functions found")

In [None]:
# Configure MLflow experiment

import mlflow

experiment_name = f"/Shared/{CATALOG}_model_comparison"
experiment = mlflow.set_experiment(experiment_name)
experiment_id = experiment.experiment_id

## 2. Define Agent

The cell below defines a version of the complaint agent used in the [complaint_agent.ipynb](../../stages/complaint_agent.ipynb) notebook. The agent can call on a set of Unity Catalog functions enabling it to retrieve order details, delivery timing, and location timings.

In this scenario, imagine that we are earlier in the prototyping phase and we want to quickly iterate on the agent's design. We are making fundamental design choices like which models to use and how to structure the agent's system prompt.

This version of the complaint agent takes `model_endpoint` as a parameter, enabling us to easily swap between different models for comparison. Beyond that, the specific implementation details are largely unimportant for the purposes of this demo: you can use the same general evaluation and comparison principles regardless of the underlying agent implementation.

In [None]:
import json
from typing import Annotated, Any, Optional, Sequence, TypedDict, Literal, cast
from uuid import uuid4

from databricks_langchain import (
    ChatDatabricks,
    DatabricksFunctionClient,
    UCFunctionToolkit,
    set_uc_function_client,
)
from langchain_core.messages import AIMessage, BaseMessage
from langchain_core.runnables import RunnableConfig, RunnableLambda
from langchain_core.tools import BaseTool
from langgraph.graph import END, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt.tool_node import ToolNode

# Set up UC function client
client = DatabricksFunctionClient()
set_uc_function_client(client)

# Define response schema (same as production)
class ComplaintResponse(TypedDict):
    order_id: str
    complaint_category: Literal["delivery_delay", "missing_items", "food_quality", "service_issue", "billing", "other"]
    decision: Literal["suggest_credit", "escalate"]
    credit_amount: Optional[float]
    confidence: Optional[Literal["high", "medium", "low"]]
    priority: Optional[Literal["standard", "urgent"]]
    rationale: str

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]

# System prompt (same as production)
SYSTEM_PROMPT = """You are the Complaint Triage Agent for Casper's Kitchens, a ghost kitchen network. Your job is to analyze customer complaints and recommend actions for internal customer service staff.

PROCESS:
1. Extract the order_id from the complaint text
2. Call get_order_overview(order_id) to get order details and items
3. Call get_order_timing(order_id) to get delivery timing
4. For delivery delay complaints, call get_location_timings(location) to get percentile benchmarks
5. Analyze the data and make a recommendation

DECISION FRAMEWORK:

SUGGEST_CREDIT - Use when you can make a data-backed recommendation:

Delivery delays (high confidence when data-backed):
- Compare actual delivery time to location percentiles
- >P75 but <P90: Suggest 15% of order total
- >P90 but <P99: Suggest 25% of order total  
- >P99: Suggest 50% of order total
- On-time delivery: Suggest $0 credit (low confidence - might be other issues)

Missing items:
- Parse items_json from get_order_overview to find actual item prices
- Use real item costs when available for accurate refund
- Check if claimed missing item actually makes sense for this order
- If item plausibly missing but can't verify price: estimate $8-12 per item (medium confidence)
- If no evidence of missing items: $0 credit (low confidence)

Food quality issues:
- If specific details provided (cold, soggy, wrong preparation): $10-15 (medium confidence)
- If vague ("bad", "gross"): escalate with priority="standard" instead
- If food quality complaint but order shows delivery delay >P90: consider combined credit

ESCALATE - Use when human judgment is needed:
- priority="standard": Vague complaints, missing data, edge cases, billing issues, service complaints
- priority="urgent": Legal threats, health/safety concerns, suspected fraud, abusive language

OUTPUT RULES:
- For suggest_credit: Include credit_amount (can be $0) and confidence, set priority to null
- For escalate: Include priority, set credit_amount and confidence to null
- Always include a data-focused rationale citing specific numbers (delivery times, percentiles, order details, item verification)
- Rationale is for internal staff - be factual, not apologetic

RATIONALE GUIDELINES:
- Rationale should clearly articulate the justification for the decision
- All evidence used to support the decision should be cited in the rationale. Rationale MUST cite evidence.
- Rationale should be detailed and specific, but no longer than 150 words
- Rationale should clearly justify the decision, credit amount (if applicable), confidence level, and priority (if applicable)
- If a refund is suggested, the rationale should clearly justify the credit amount, based on the evidence provided.

CONFIDENCE GUIDELINES:
- high: Strong data support (delivery >P90, item prices from order data, clear verification)
- medium: Reasonable inference (delivery P75-P90, estimated item costs, plausible but unverified)
- low: Weak or contradictory evidence ($0 credits when data doesn't support claim, can't verify complaint details)

IMPORTANT NOTES:
- If order_id not found or data unavailable: escalate with priority="standard"
- Round credit amounts to nearest $0.50
- When in doubt between two options, prefer escalate with priority="standard"
- A suggest_credit with $0 and low confidence is valid when complaint seems unfounded
"""

RESPONSE_FIELDS = {"order_id", "complaint_category", "decision", "rationale"}

def parse_structured_response(obj) -> ComplaintResponse:
    """Parse ComplaintResponse from AIMessage or dict"""
    if isinstance(obj, dict):
        candidate = obj
    else:
        parsed = obj.additional_kwargs.get("parsed_structured_output")
        if isinstance(parsed, dict):
            candidate = parsed
        else:
            content = obj.content
            if isinstance(content, str):
                raw = content
            elif isinstance(content, list):
                raw = "".join(part.get("text", "") if isinstance(part, dict) else str(part) for part in content)
            else:
                raise ValueError("Unsupported message content type")
            candidate = json.loads(raw)

    missing = RESPONSE_FIELDS.difference(candidate.keys())
    if missing:
        raise ValueError(f"Missing required fields: {sorted(missing)}")

    decision = candidate.get("decision")
    if decision == "suggest_credit":
        if candidate.get("credit_amount") is None:
            raise ValueError("suggest_credit requires credit_amount")
        if candidate.get("confidence") is None:
            raise ValueError("suggest_credit requires confidence")
        if candidate.get("priority") is not None:
            candidate["priority"] = None
    elif decision == "escalate":
        if candidate.get("priority") is None:
            raise ValueError("escalate requires priority")
        if candidate.get("credit_amount") is not None:
            candidate["credit_amount"] = None
        if candidate.get("confidence") is not None:
            candidate["confidence"] = None

    return cast(ComplaintResponse, candidate)


class ComplaintsAgentCore:
    """Lightweight complaint agent for model comparison (no ResponsesAgent wrapper)"""
    
    def __init__(self, model_endpoint: str, catalog: str):
        self.model_endpoint = model_endpoint
        self.catalog = catalog
        
        # Initialize LLM
        self.llm = ChatDatabricks(endpoint=model_endpoint)
        
        # Load UC tools
        uc_tool_names = [
            f"{catalog}.ai.get_order_overview",
            f"{catalog}.ai.get_order_timing",
            f"{catalog}.ai.get_location_timings",
        ]
        uc_toolkit = UCFunctionToolkit(function_names=uc_tool_names)
        self.tools = uc_toolkit.tools
        
        # Build agent
        self.agent = self._create_agent()
    
    def _create_agent(self):
        """Create LangGraph workflow"""
        tool_model = self.llm.bind_tools(self.tools, tool_choice="auto")
        structured_model = tool_model.with_structured_output(ComplaintResponse)

        def should_continue(state: AgentState):
            messages = state["messages"]
            last_message = messages[-1]
            if isinstance(last_message, AIMessage) and last_message.tool_calls:
                return "continue"
            return "end"

        preprocessor = RunnableLambda(
            lambda state: [{"role": "system", "content": SYSTEM_PROMPT}] + state["messages"]
        )

        tool_runnable = preprocessor | tool_model
        structured_runnable = preprocessor | structured_model

        def call_model(state: AgentState, config: RunnableConfig):
            response = tool_runnable.invoke(state, config)
            if not isinstance(response, AIMessage):
                raise ValueError(f"Expected AIMessage, received {type(response)}")

            if response.tool_calls:
                return {"messages": [response]}

            try:
                parsed = parse_structured_response(response)
            except (json.JSONDecodeError, ValueError):
                structured = structured_runnable.invoke(state, config)
                parsed = parse_structured_response(structured)

            structured_message = AIMessage(
                id=response.id or str(uuid4()),
                content=json.dumps(parsed),
                additional_kwargs={"parsed_structured_output": parsed},
            )
            return {"messages": [structured_message]}

        workflow = StateGraph(AgentState)
        workflow.add_node("agent", RunnableLambda(call_model))
        workflow.add_node("tools", ToolNode(self.tools))
        workflow.set_entry_point("agent")
        workflow.add_conditional_edges(
            "agent",
            should_continue,
            {"continue": "tools", "end": END},
        )
        workflow.add_edge("tools", "agent")
        return workflow.compile()
    
    def invoke(self, complaint: str) -> dict:
        """Process a complaint and return structured response"""
        result = self.agent.invoke({
            "messages": [{"role": "user", "content": complaint}]
        })
        
        # Extract final response
        final_message = result["messages"][-1]
        return parse_structured_response(final_message)

print("✅ ComplaintsAgentCore class defined")

### Test the Agent

Before creating the full evaluation dataset, let's validate that the agent works correctly with a single test case. We will:
1. Retrieve a real order ID from the `all_events` table
2. Query the agent with a delivery delay complaint and the order ID

We will also enable MLflow autologging so we can review the agent's execution trace in the UI.

In [None]:
# Get a sample order ID for testing
sample_order = spark.sql(f"""
    SELECT order_id 
    FROM {CATALOG}.lakeflow.all_events 
    WHERE event_type='delivered'
    LIMIT 1
""").collect()[0]['order_id']

print(f"Using test order ID: {sample_order}")

In [None]:
import mlflow

# Enable autologging for LangChain
mlflow.langchain.autolog()

# Create agent instance with a test model
test_agent = ComplaintsAgentCore(
    model_endpoint="databricks-gpt-5",
    catalog=CATALOG
)

# Test with a delivery delay complaint
test_complaint = f"My order took forever to arrive! Order ID: {sample_order}"

result = test_agent.invoke(test_complaint)

print(result)

The agent successfully analyzed the complaint and returned a structured response. For this particular delivery delay complaint, the agent:
- Categorized it as `delivery_delay`
- Suggested a credit of $1.00 (15% of order total, rounded)
- Assigned `medium` confidence
- Provided detailed rationale citing specific delivery time (31.38 min), location percentiles (P50: 26.13, P75: 31.05), and calculation logic

This demonstrates the agent's ability to retrieve order data, compare against benchmarks, and make data-driven credit recommendations.

## 3. Create Evaluation Dataset

Now we will generate 15 diverse complaint scenarios to thoroughly test model performance across different situations:
- Delivery delays (with varying severity)
- Missing items (verifiable vs suspicious)
- Food quality complaints (specific vs vague)
- Service issues (should escalate)
- Health/safety concerns (urgent escalation)
- Edge cases (missing order ID, invalid format)

A diverse dataset helps reveal model strengths and weaknesses across the decision space.

In [None]:
# Get real order IDs for realistic eval scenarios
all_order_ids = [
    row['order_id'] for row in spark.sql(f"""
        SELECT DISTINCT order_id 
        FROM {CATALOG}.lakeflow.all_events 
        WHERE event_type='delivered'
        LIMIT 15
    """).collect()
]

In [None]:
import random

# Create 15 diverse complaint scenarios (exactly one per order ID)
templates = [
    "My order took forever to arrive! Order ID: {oid}",
    "Order {oid} arrived late and cold",
    "My falafel was completely soggy and inedible. Order: {oid}",
    "The gyro meat was overcooked and dry. Order: {oid}",
    "Everything tasted bad. Order {oid}",
    "My entire falafel bowl was missing from the order! Order: {oid}",
    "No drinks in my order {oid}",
    "Your driver was extremely rude to me. Order: {oid}",
    "This food made me sick, possible food poisoning. Order: {oid}",
    "Order {oid} was late AND missing items AND cold!",
    "The packaging was torn open when it arrived. Order: {oid}",
    "The fries were completely soggy by the time they got here. Order: {oid}",
    "I received someone else’s order by mistake. Order: {oid}",
    "The sauce containers leaked all over the bag. Order: {oid}",
    "The order was marked delivered but never showed up. Order: {oid}",
]

# Map 1:1 (use each of the 15 IDs exactly once)
complaints = [tmpl.format(oid=all_order_ids[i]) for i, tmpl in enumerate(templates)]

# Format for mlflow.genai.evaluate
eval_data = [{"inputs": {"complaint": c}} for c in complaints]

## 4. Define [MLflow Scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/)

We use two complementary LLM judges with no overlap:

1. **Evidence Groundedness (Agent-as-a-Judge with `{{ trace }}`)**: Inspects execution traces to verify decisions align with tool outputs. Uses MCP tools (GetSpan, ListSpans, etc.) to examine what data was actually returned by tool calls and whether the agent's rationale matches that evidence. Shows off agentic evaluation capabilities. **Not aligned** (alignment not yet supported for trace-aware judges).

2. **Rationale Sufficiency (Template-based Judge)**: Evaluates if a human could understand the decision logic from the rationale alone. Uses `{{ inputs }}` and `{{ outputs }}` templates to assess clarity, completeness, and logical flow. **Will be aligned** with human feedback using SIMBA optimization.

Both of these scorers use the `mlflow.genai.judges.make_judge` function to create [template-based scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/make-judge/). There are several other ways to define scorers in MLflow, including a variety of [pre-defined scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/), [guideline-based LLM scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/guidelines/), and [code-based scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/custom/).

Note that, just like we can use different models for our agent, we can also use different models for our scorers!

In [None]:
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import Guidelines

# Scorer 1: Evidence Groundedness (Agent-as-a-Judge, Binary Pass/Fail)
# Uses {{ trace }} to enable agentic evaluation with tool access
evidence_groundedness_judge = make_judge(
    name="evidence_groundedness",
    instructions="""
Evaluate whether the agent's decision is grounded in evidence from the execution {{ trace }}.

Investigation checklist:
1. Find spans where tools were called (get_order_overview, get_order_timing, get_location_timings)
2. Extract the actual outputs returned by these tool calls
3. Compare tool outputs against claims made in the agent's rationale
4. Verify that credit amounts or escalation decisions match the tool data

For delivery complaints, check:
- Does the rationale's delivery time match what get_order_timing returned?
- Does the rationale's percentile comparison match get_location_timings output?
- Is the credit calculation based on the actual order total from get_order_overview?

For missing item complaints, check:
- Does get_order_overview show the items mentioned in the rationale?
- Is the credit amount based on actual item prices from the tool output?

Customer complaint: {{ inputs }}
Agent's final output: {{ outputs }}

Rate as PASS if rationale claims match tool outputs, FAIL if there are contradictions or unsupported claims.
""",
    model="databricks:/databricks-claude-sonnet-4-5"
)

# Scorer 2: Rationale Sufficiency (Template Judge, Pass/Fail)
# This judge will be aligned with human feedback
rationale_sufficiency_judge = make_judge(
    name="rationale_sufficiency",
    instructions="""
Evaluate whether the agent's rationale is sufficient to explain and justify the decision.

Customer complaint: {{ inputs }}
Agent's output: {{ outputs }}

Check if a human reading the rationale can clearly understand:
1. What decision was made (suggest_credit or escalate)
2. Why that decision was appropriate for this complaint
3. How any credit amount was determined (if applicable)

For credit decisions:
- Does the rationale cite specific numbers (delivery time, percentiles, dollar amounts)?
- Is there a clear logical connection between the evidence mentioned and the credit amount?
- Would a human understand why this amount is fair?

For escalations:
- Does the rationale explain why escalation is needed?
- Is the priority level (standard/urgent) justified?
- Would a human reviewer know what to investigate?

Rate as PASS if the rationale is clear, complete, and logically connects evidence to decision.
Rate as FAIL if the rationale is vague, missing key information, or logic is unclear.
""",
    model="databricks:/databricks-claude-sonnet-4-5"
)

## 5. Run Baseline Evaluation (Single Model)

We will start by running an initial evaluation with one model to validate our eval setup and generate traces. Note that we run the evaluation in two phases:
1. Run the agent to generate traces
2. Retrieve the traces and evaluate them with the scorers

We can then use the traces and initial evaluation results to align the `rationale_sufficiency` judge with human feedback.

There are different approaches you can take to running evaluations, including passing a prediction function to `mlflow.genai.evaluate`. Working with a pre-generated trace dataset is simple and flexible and allows you to iterate on your judges without re-running the agent.

We will tag the traces with the baseline model name so we can easily retrieve them later.

In [None]:
# Use one model for initial evaluation and judge alignment
initial_model_name = "llama-3-3-70b"
initial_model_endpoint = "databricks-meta-llama-3-3-70b-instruct"
baseline_tag = f"baseline_{initial_model_name}"

In [None]:
import mlflow

# Enable autologging for detailed trace capture
mlflow.langchain.autolog()

print(f"Running {initial_model_name} on test cases to generate traces...\n")

# Create agent instance
agent = ComplaintsAgentCore(model_endpoint=initial_model_endpoint, catalog=CATALOG)

# Enable autologging to capture traces
mlflow.langchain.autolog()

# Step 1: Run agent to generate traces (tag them for later retrieval)
with mlflow.start_run(run_name=baseline_tag) as run:
    experiment_id = run.info.experiment_id
    run_id = run.info.run_id
    
    # Invoke agent on each complaint to generate traces
    for row in eval_data:
        complaint = row['inputs']['complaint']
        result = agent.invoke(complaint)
        
        # Tag the trace for later retrieval
        trace_id = mlflow.get_last_active_trace_id()
        mlflow.set_trace_tag(trace_id, "eval_group", baseline_tag)
        mlflow.set_trace_tag(trace_id, "model", initial_model_name)

print(f"✅ Generated {len(eval_data)} traces. Run ID: {run_id}")

# Step 2: Retrieve traces for evaluation
baseline_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"tags.eval_group = '{baseline_tag}'",
    max_results=15
)


# Step 3: Evaluate the traces
print(f"\nEvaluating traces with scorers...")
baseline_result = mlflow.genai.evaluate(
    data=baseline_traces,
    scorers=[evidence_groundedness_judge, rationale_sufficiency_judge]
)

In our run, the baseline evaluation:
- Generated 15 traces tagged with `baseline_llama-3-3-70b`
- Ran both judges (evidence_groundedness and rationale_sufficiency) on all traces
- Completed successfully with results viewable in the MLflow UI Evaluation tab

You can now review these traces and add human feedback to improve judge accuracy.

## 6. Provide Human Assessments for Judge Alignment

Now it's time to add human feedback to improve the judge's accuracy. To do so, navigate to the Evaluation tab in the MLflow UI and find the traces under the run generated above. You can add your feedback by clicking into each trace, then clicking the "+" next to the `rationale_sufficiency` scorer. Fill in your feedback for each trace (even if you agree with the initial assessment) along with your rationale, if needed.


## 7. Judge Alignment

After collecting human feedback, we can align the judge to better match human judgment. The SIMBA optimizer analyzes disagreements between the judge and human assessments, then generates improved instructions.

In [None]:
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer

baseline_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"tags.eval_group = '{baseline_tag}'",
    max_results=15,
    return_type="list"
)


# Get traces with human feedback for alignment
traces_for_alignment = baseline_traces

# Align the rationale_sufficiency judge with human feedback
# Note: We align rationale_sufficiency (not evidence_groundedness) because alignment
# is not yet supported for trace-aware judges using {{ trace }} 
## (source: https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/alignment/#quick-start-align-your-first-judge)


aligned_judge = rationale_sufficiency_judge.align(traces_for_alignment)

During alignment, the SIMBA optimizer analyzed disagreements between the initial judge and human assessments, then generated improved instructions. The aligned judge now includes specific guidance learned from human feedback:

**For missing item complaints:**
- Focus solely on the specifics of that complaint
- Clearly state what item is missing
- Provide the item's price from the order
- Explain how the credit amount is calculated based on that price
- Avoid discussing irrelevant factors like delivery time unless they directly impact the decision

**For all complaints:**
- Ensure all statements are consistent and logically support the decision
- Clarify why a credit is appropriate in the context of the complaint
- Connect evidence explicitly to decisions

The alignment process took approximately 2-3 minutes. The resulting aligned judge is better calibrated to human judgment and more focused on relevant complaint details.

## 8. Model Comparison with Aligned Judge

Now that the judge is aligned with human feedback, compare multiple models to find the best one for your use case. Check out the list of models in the Databricks Model Serving tab to see the available models.

Suppose we are interested in comparing the following models:
- **databricks-claude-sonnet-4-5**
- **databricks-gpt-5-mini**

In [None]:
# Define models to compare
models_to_compare = {
    "Claude Sonnet 4.5": "databricks-claude-sonnet-4-5",
    "GPT-5 mini": "databricks-gpt-5-mini"
}

comparison_results = {}

for model_name, endpoint in models_to_compare.items():
    print(f"\nRunning {model_name} to generate traces...")
    
    agent = ComplaintsAgentCore(model_endpoint=endpoint, catalog=CATALOG)
    
    # Step 1: Generate traces with tags
    comparison_tag = f"comparison_{model_name}"
    with mlflow.start_run(run_name=comparison_tag) as run:
        # Invoke agent on each complaint to generate traces
        for row in eval_data:
            complaint = row['inputs']['complaint']
            result = agent.invoke(complaint)
            
            # Tag the trace for later retrieval
            trace_id = mlflow.get_last_active_trace_id()
            mlflow.set_trace_tag(trace_id, "eval_group", comparison_tag)
            mlflow.set_trace_tag(trace_id, "model", model_name)
    
    print(f"  Generated {len(eval_data)} traces for {model_name}")
    
    # Step 2: Retrieve traces
    traces = mlflow.search_traces(
        experiment_ids=[experiment_id],
        filter_string=f"tags.eval_group = '{comparison_tag}'",
        max_results=15
    )
    
    # Step 3: Evaluate traces
    print(f"  Evaluating {len(traces)} traces with aligned judge...")
    result = mlflow.genai.evaluate(
        data=traces,
        scorers=[aligned_judge, rationale_sufficiency_judge]
    )
    comparison_results[model_name] = result

For each model in the comparison:
- Generated 15 new traces (one per complaint scenario)
- Tagged traces for easy retrieval and organization
- Evaluated with both the aligned rationale_sufficiency judge and the trace-aware evidence_groundedness judge
- Results are available in MLflow UI for detailed analysis

The comparison enables side-by-side evaluation of model performance on accuracy, rationale quality, and latency.

## 9. Results Analysis

Compare the models to identify the best one for your use case:
- Which model has the highest evidence groundedness (decisions align with tool data)?
- Which model has the most sufficient rationales (clear logic from rationale to decision)?
- What are the latency trade-offs between models?

In [None]:
import pandas as pd

# Build comparison dataframe
comparison_data = []

# Add baseline result for reference
comparison_data.append({
    'model': f"{initial_model_name} (baseline)",
    'evidence_groundedness': baseline_result.metrics.get('evidence_groundedness/percentage', 0),
    'rationale_sufficiency': baseline_result.metrics.get('rationale_sufficiency/percentage', 0),
    'latency_p50_ms': baseline_result.metrics.get('latency/p50', 0) * 1000 if baseline_result.metrics.get('latency/p50') else 0,
})

# Add comparison results
for model_name, result in comparison_results.items():
    comparison_data.append({
        'model': model_name,
        'evidence_groundedness': result.metrics.get('evidence_groundedness/percentage', 0),
        'rationale_sufficiency': result.metrics.get('rationale_sufficiency/percentage', 0),
        'latency_p50_ms': result.metrics.get('latency/p50', 0) * 1000 if result.metrics.get('latency/p50') else 0,
    })

comparison_df = pd.DataFrame(comparison_data)

print("\nModel Comparison Results:")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)
print("\nNote: Baseline used initial judge; other models used aligned judge")

## Summary

- Built a parameterized complaint triage agent that uses UC tools.
- Generated a concise eval set and ran a baseline to create traces.
- Evaluated with two judges: evidence_groundedness (trace-aware, unaligned) and rationale_sufficiency (template judge, aligned via SIMBA with human feedback).
- Compared multiple serving endpoints using the aligned judge and reviewed accuracy and latency trade-offs.

**Key takeaways:**
1. Databricks lets you use the models you prefer and provides first-class tools to build agents with UC functions, capture traces, and evaluate them rigorously with MLflow.
2. Trace-aware judging verifies decisions against actual tool outputs; template judges can be aligned with human feedback for better agreement.
3. Choose the endpoint that best balances quality, latency, and cost