# Choose the Best Model for Your Agent

Databricks provides native access to a wide range of major AI model families, including ChatGPT, Claude, Gemini, Llama, and more. MLflow on Databricks supports sophisticated model evaluation and comparison workflows, including trace-aware, human-aligned agentic evaluation.

In this notebook, we will explore some of the ways Databricks and MLflow empower you to rigorously iterate on the quality of your agent and to choose the best model for your use case. In particular, we will show how to use a trace-aware agentic judge and a human-aligned template-based judge to evaluate and compare different models.

**Scenario:** You've built a prototype complaint triage agent for Casper's Kitchens, a ghost kitchen network. You want to know which AI model should power your production agent.

**What you'll do:**
1. Define a prototype agent for customer complaint triage
2. Create an evaluation dataset with diverse complaint scenarios
3. Evaluate different models using agentic- and template-based scorers
4. Review results and add human feedback via MLflow UI
5. Align the judge with human feedback to improve accuracy
6. Re-evaluate and visualize results
7. Register the winning model configuration

**Prerequisites:** This demo assumes UC tools have been created by running `stages/complaint_agent.ipynb` and that you have run `stages/raw_data.ipynb` and `stages/lakeflow.ipynb` to start the raw data stream. See the [README](../../../README.md) for details on setting up the Casper's environment.

## 1. Setup & Prerequisites

Install required packages and verify that UC tools exist.

In [None]:
%pip install -U -qqqq mlflow[databricks] dspy-ai unitycatalog-openai[databricks] pydantic
%restart_python

In [None]:
# Create widget for catalog parameter
dbutils.widgets.text("CATALOG", "caspersdev", "UC Catalog")

CATALOG = dbutils.widgets.get("CATALOG")
if not CATALOG:
    raise ValueError("Please provide a CATALOG name in the widget above")
print(f"Using catalog: {CATALOG}")

In [None]:
# Verify UC tools exist
print("Checking for required UC functions...")

required_functions = ['get_order_overview', 'get_order_timing', 'get_location_timings']

for func_name in required_functions:
    try:
        spark.sql(f"DESCRIBE FUNCTION {CATALOG}.ai.{func_name}").collect()
    except Exception:
        raise RuntimeError("UC tools missing. Run ../../stages/complaint_agent.ipynb first to create these tools.")

print("\n✅ All required UC functions found")

In [None]:
# Configure MLflow experiment

import mlflow

experiment_name = f"/Shared/{CATALOG}_model_comparison"
experiment = mlflow.set_experiment(experiment_name)
experiment_id = experiment.experiment_id

## 2. Define Agent

The cell below defines a DSPy version of the complaint agent used in the [complaint_agent.ipynb](../../stages/complaint_agent.ipynb) notebook. The agent uses DSPy's ReAct module for automatic tool orchestration and can call Unity Catalog functions to retrieve order details, delivery timing, and location timings.

In this scenario, imagine that we are earlier in the prototyping phase and we want to quickly iterate on the agent's design. We are making fundamental design choices like which models to use and how to structure the agent's prompt.

This version of the complaint agent takes `model_endpoint` as a parameter, enabling us to easily swap between different models for comparison. Beyond that, the specific implementation details are largely unimportant for the purposes of this demo: you can use the same general evaluation and comparison principles regardless of the underlying agent implementation.

In [None]:
from typing import Optional, Literal
from pydantic import BaseModel, Field, field_validator, ValidationError

import dspy
from unitycatalog.ai.core.base import get_uc_function_client

# Initialize UC function client
uc_client = get_uc_function_client()

# Define response schema (same as production)
class ComplaintResponse(BaseModel):
    """Structured output for complaint triage decisions."""
    order_id: str
    complaint_category: Literal["delivery_delay", "missing_items", "food_quality", "service_issue", "billing", "other"] = Field(
        description="Exactly ONE primary complaint category"
    )
    decision: Literal["suggest_credit", "escalate"]
    credit_amount: Optional[float] = None
    confidence: Optional[Literal["high", "medium", "low"]] = None
    priority: Optional[Literal["standard", "urgent"]] = None
    rationale: str
    
    @field_validator('complaint_category', mode='before')
    @classmethod
    def parse_category(cls, v):
        """Extract first valid category if multiple provided."""
        if not isinstance(v, str):
            return v
            
        valid_categories = ["delivery_delay", "missing_items", "food_quality", "service_issue", "billing", "other"]
        v_lower = v.lower().strip()
        
        # Exact match
        if v_lower in valid_categories:
            return v_lower
        
        # Find first valid category in string
        for cat in valid_categories:
            if cat in v_lower:
                return cat
        
        return "other"
    
    @field_validator('confidence', mode='before')
    @classmethod
    def parse_confidence(cls, v):
        """Ensure valid confidence value."""
        if v is None or (isinstance(v, str) and v.lower() == "null"):
            return None
        if isinstance(v, str):
            v_lower = v.lower().strip()
            if v_lower in ["high", "medium", "low"]:
                return v_lower
            return "medium"
        return v
    
    @field_validator('priority', mode='before')
    @classmethod
    def parse_priority(cls, v):
        """Ensure valid priority value."""
        if v is None or (isinstance(v, str) and v.lower() == "null"):
            return None
        if isinstance(v, str):
            v_lower = v.lower().strip()
            if v_lower in ["standard", "urgent"]:
                return v_lower
            return "standard"
        return v


class ComplaintTriage(dspy.Signature):
    """Analyze customer complaints for Casper's Kitchens and recommend triage actions.
    
    Process:
    1. Extract order_id from complaint
    2. Use get_order_overview(order_id) for order details and items
    3. Use get_order_timing(order_id) for delivery timing
    4. For delays, use get_location_timings(location) for percentile benchmarks
    5. Make data-backed decision
    
    Decision Framework:
    
    SUGGEST_CREDIT (with credit_amount and confidence):
    - Delivery delays: Compare actual delivery time to location percentiles
      * <P75: Suggest $0 credit (low confidence - on-time or minimal delay)
      * P75-P99: Suggest 15% of order total (medium to high confidence)
      * >P99: Suggest 25% of order total (high confidence)
    - Missing items: Use actual item prices from order data when available
      * Verify claimed item exists in order (affects confidence)
      * Use real costs from order data, or estimate $8-12 per item if unavailable
    - Food quality: 20-40% of order total based on severity
      * Minor issues (slightly cold, minor preparation issue): 20% (medium confidence)
      * Major issues (completely inedible, wrong preparation, health concern): 40% (high confidence)
      * Vague complaints ("bad", "gross"): escalate instead
    
    ESCALATE (with priority):
    - priority="standard": Vague complaints, missing data, billing issues, service complaints
    - priority="urgent": Legal threats, health/safety concerns, suspected fraud, abusive language
    
    Output Requirements:
    - For suggest_credit: credit_amount is REQUIRED and must be a number (can be 0.0 if no credit warranted), confidence is REQUIRED, priority must be null
    - For escalate: priority is REQUIRED, credit_amount and confidence must be null
    - complaint_category: Choose EXACTLY ONE category (the primary one)
    - Rationale must cite specific evidence (delivery times, percentiles, item verification, order total)
    - Rationale should be detailed but under 150 words
    - Round credit amounts to nearest $0.50
    - Confidence: high (strong data), medium (reasonable inference), low (weak/contradictory)
    """
    
    complaint: str = dspy.InputField(desc="Customer complaint text")
    order_id: str = dspy.OutputField(desc="Extracted order ID")
    complaint_category: str = dspy.OutputField(desc="EXACTLY ONE category: delivery_delay, missing_items, food_quality, service_issue, billing, or other")
    decision: str = dspy.OutputField(desc="EXACTLY ONE: suggest_credit or escalate")
    credit_amount: str = dspy.OutputField(desc="If suggest_credit: MUST be a number (e.g., 0.0, 10.5). If escalate: null")
    confidence: str = dspy.OutputField(desc="If suggest_credit: EXACTLY ONE of high, medium, low. If escalate: null")
    priority: str = dspy.OutputField(desc="If escalate: EXACTLY ONE of standard or urgent. If suggest_credit: null")
    rationale: str = dspy.OutputField(desc="Data-focused justification citing specific evidence")


# Unity Catalog tool wrappers
def get_order_overview(order_id: str) -> str:
    """Get order details including items, location, and customer info."""
    result = uc_client.execute_function(
        f"{CATALOG}.ai.get_order_overview",
        {"oid": order_id}
    )
    return str(result.value)


def get_order_timing(order_id: str) -> str:
    """Get timing information for a specific order."""
    result = uc_client.execute_function(
        f"{CATALOG}.ai.get_order_timing",
        {"oid": order_id}
    )
    return str(result.value)


def get_location_timings(location: str) -> str:
    """Get delivery time percentiles for a specific location."""
    result = uc_client.execute_function(
        f"{CATALOG}.ai.get_location_timings",
        {"loc": location}
    )
    return str(result.value)


class ComplaintTriageModule(dspy.Module):
    """DSPy module for complaint triage with tool calling."""
    
    def __init__(self, model_endpoint: str):
        super().__init__()
        # Configure DSPy with the specified model endpoint
        lm = dspy.LM(f'databricks/{model_endpoint}', max_tokens=2000)
        dspy.configure(lm=lm)
        
        self.react = dspy.ReAct(
            signature=ComplaintTriage,
            tools=[get_order_overview, get_order_timing, get_location_timings],
            max_iters=10
        )
    
    def forward(self, complaint: str, max_retries: int = 2) -> ComplaintResponse:
        """Process complaint and return structured triage decision with retry on validation failure."""
        
        for attempt in range(max_retries + 1):
            try:
                result = self.react(complaint=complaint)
                
                # Parse credit_amount
                credit_amount = None
                if result.credit_amount and result.credit_amount.lower() != "null":
                    try:
                        credit_amount = float(result.credit_amount)
                    except (ValueError, TypeError):
                        if result.decision == "suggest_credit":
                            raise ValidationError("suggest_credit requires valid numeric credit_amount")
                
                # Validate business rules before Pydantic construction
                if result.decision == "suggest_credit" and credit_amount is None:
                    raise ValidationError("suggest_credit requires credit_amount to be a number (can be 0.0)")
                
                # Construct Pydantic model - field validators run here
                return ComplaintResponse(
                    order_id=result.order_id,
                    complaint_category=result.complaint_category,
                    decision=result.decision,
                    credit_amount=credit_amount,
                    confidence=result.confidence,
                    priority=result.priority,
                    rationale=result.rationale
                )
                
            except (ValidationError, ValueError) as e:
                if attempt < max_retries:
                    # Retry - DSPy will regenerate with potentially different output
                    continue
                else:
                    # Final attempt failed - re-raise
                    raise


class ComplaintsAgentCore:
    """Lightweight complaint agent for model comparison using DSPy"""
    
    def __init__(self, model_endpoint: str, catalog: str):
        self.model_endpoint = model_endpoint
        self.catalog = catalog
        global CATALOG
        CATALOG = catalog
        
        # Build DSPy agent
        self.agent = ComplaintTriageModule(model_endpoint=model_endpoint)
    
    def invoke(self, complaint: str) -> dict:
        """Process a complaint and return structured response"""
        result = self.agent(complaint=complaint)
        return result.model_dump()

print("✅ ComplaintsAgentCore class defined")

### Test the Agent

Before creating the full evaluation dataset, let's validate that the agent works correctly with a single test case. We will:
1. Retrieve a real order ID from the `all_events` table
2. Query the agent with a delivery delay complaint and the order ID

We will also enable MLflow autologging so we can review the agent's execution trace in the UI.

In [None]:
# Get a sample order ID for testing
sample_order = spark.sql(f"""
    SELECT order_id 
    FROM {CATALOG}.lakeflow.all_events 
    WHERE event_type='delivered'
    LIMIT 1
""").collect()[0]['order_id']

print(f"Using test order ID: {sample_order}")

In [None]:
import mlflow

# Enable autologging for DSPy
mlflow.dspy.autolog()

# Create agent instance with a test model
test_agent = ComplaintsAgentCore(
    model_endpoint="databricks-gpt-5",
    catalog=CATALOG
)

# Test with a delivery delay complaint
test_complaint = f"My order took forever to arrive! Order ID: {sample_order}"

result = test_agent.invoke(test_complaint)

print(result)

The DSPy agent successfully analyzed the complaint and returned a structured response. For this particular delivery delay complaint, the agent:
- Categorized it as `delivery_delay`
- Made a data-driven decision using the ReAct workflow to call UC tools
- Provided detailed rationale citing specific evidence from tool outputs

This demonstrates DSPy's ability to automatically orchestrate tool calls, retrieve order data, compare against benchmarks, and make data-driven credit recommendations using the ReAct pattern.

## 3. Create Evaluation Dataset

Now we will generate 15 diverse complaint scenarios to thoroughly test model performance across different situations:
- Delivery delays (with varying severity)
- Missing items (verifiable vs suspicious)
- Food quality complaints (specific vs vague)
- Service issues (should escalate)
- Health/safety concerns (urgent escalation)
- Edge cases (missing order ID, invalid format)

A diverse dataset helps reveal model strengths and weaknesses across the decision space.

In [None]:
# Get real order IDs for realistic eval scenarios
all_order_ids = [
    row['order_id'] for row in spark.sql(f"""
        SELECT DISTINCT order_id 
        FROM {CATALOG}.lakeflow.all_events 
        WHERE event_type='delivered'
        LIMIT 15
    """).collect()
]

In [None]:
# Create 15 diverse complaint scenarios (exactly one per order ID)
templates = [
    "My order took forever to arrive! Order ID: {oid}",
    "Order {oid} arrived late and cold",
    "My falafel was completely soggy and inedible. Order: {oid}",
    "The gyro meat was overcooked and dry. Order: {oid}",
    "Everything tasted bad. Order {oid}",
    "My entire falafel bowl was missing from the order! Order: {oid}",
    "No drinks in my order {oid}",
    "Your driver was extremely rude to me. Order: {oid}",
    "This food made me sick, possible food poisoning. Order: {oid}",
    "Order {oid} was late AND missing items AND cold!",
    "The packaging was torn open when it arrived. Order: {oid}",
    "The fries were completely soggy by the time they got here. Order: {oid}",
    "I received someone else’s order by mistake. Order: {oid}",
    "The sauce containers leaked all over the bag. Order: {oid}",
    "The order was marked delivered but never showed up. Order: {oid}",
]

# Map 1:1 (use each of the 15 IDs exactly once)
complaints = [tmpl.format(oid=all_order_ids[i]) for i, tmpl in enumerate(templates)]

# Format for mlflow.genai.evaluate
eval_data = [{"inputs": {"complaint": c}} for c in complaints]

## 4. Define [MLflow Scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/)

We use two complementary LLM judges with no overlap:

1. **Evidence Groundedness (Agent-as-a-Judge with `{{ trace }}`)**: Inspects execution traces to verify decisions align with tool outputs. Uses MCP tools (GetSpan, ListSpans, etc.) to examine what data was actually returned by tool calls and whether the agent's rationale matches that evidence. Shows off agentic evaluation capabilities. **Not aligned** (alignment not yet supported for trace-aware judges).

2. **Rationale Sufficiency (Template-based Judge)**: Evaluates if a human could understand the decision logic from the rationale alone. Uses `{{ inputs }}` and `{{ outputs }}` templates to assess clarity, completeness, and logical flow. **Will be aligned** with human feedback using SIMBA optimization.

Both of these scorers use the `mlflow.genai.judges.make_judge` function to create [template-based scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/make-judge/). There are several other ways to define scorers in MLflow, including a variety of [pre-defined scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/), [guideline-based LLM scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/guidelines/), and [code-based scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/custom/).

Note that, just like we can use different models for our agent, we can also use different models for our scorers!

In [None]:
from mlflow.genai.judges import make_judge

# Scorer 1: Evidence Groundedness (Agent-as-a-Judge, Binary Pass/Fail)
# Uses {{ trace }} to enable agentic evaluation with tool access
evidence_groundedness_judge = make_judge(
    name="evidence_groundedness",
    instructions="""
Evaluate whether the agent's decision is grounded in evidence from the execution {{ trace }}.

Investigation checklist:
1. Find spans where tools were called (get_order_overview, get_order_timing, get_location_timings)
2. Extract the actual outputs returned by these tool calls
3. Compare tool outputs against claims made in the agent's rationale
4. Verify that credit amounts or escalation decisions match the tool data

For delivery complaints, check:
- Does the rationale's delivery time match what get_order_timing returned?
- Does the rationale's percentile comparison match get_location_timings output?
- Is the credit calculation based on the actual order total from get_order_overview?

For missing item complaints, check:
- Does get_order_overview show the items mentioned in the rationale?
- Is the credit amount based on actual item prices from the tool output?

Customer complaint: {{ inputs }}
Agent's final output: {{ outputs }}

Rate as PASS if rationale claims match tool outputs, FAIL if there are contradictions or unsupported claims.
""",
    model="databricks:/databricks-gpt-5-mini"
)

# Scorer 2: Rationale Sufficiency (Template Judge, Pass/Fail)
# This judge will be aligned with human feedback
rationale_sufficiency_judge = make_judge(
    name="rationale_sufficiency",
    instructions="""
Evaluate whether the agent's rationale is sufficient to explain and justify the decision.

Customer complaint: {{ inputs }}
Agent's output: {{ outputs }}

Check if a human reading the rationale can clearly understand:
1. What decision was made (suggest_credit or escalate)
2. Why that decision was appropriate for this complaint
3. How any credit amount was determined (if applicable)

For credit decisions:
- Does the rationale cite specific numbers (delivery time, percentiles, dollar amounts)?
- Is there a clear logical connection between the evidence mentioned and the credit amount?
- Would a human understand why this amount is fair?

For escalations:
- Does the rationale explain why escalation is needed?
- Is the priority level (standard/urgent) justified?
- Would a human reviewer know what to investigate?

Rate as PASS if the rationale is clear, complete, and logically connects evidence to decision.
Rate as FAIL if the rationale is vague, missing key information, or logic is unclear.
""",
    model="databricks:/databricks-gemini-2-5-pro"
)

## 5. Run Baseline Evaluation (Single Model)

We will start by running an initial evaluation with one model to validate our eval setup and generate traces. Note that we run the evaluation in two phases:
1. Run the agent to generate traces
2. Retrieve the traces and evaluate them with the scorers

We can then use the traces and initial evaluation results to align the `rationale_sufficiency` judge with human feedback.

There are different approaches you can take to running evaluations, including passing a prediction function to `mlflow.genai.evaluate`. Working with a pre-generated trace dataset is simple and flexible and allows you to iterate on your judges without re-running the agent.

We will tag the traces with the baseline model name so we can easily retrieve them later.

In [None]:
# Use one model for initial evaluation and judge alignment
initial_model_name = "llama-3-3-70b"
initial_model_endpoint = "databricks-meta-llama-3-3-70b-instruct"
baseline_tag = f"baseline_{initial_model_name}"

In [None]:
import mlflow

# Enable autologging for detailed trace capture
mlflow.dspy.autolog()

print(f"Running {initial_model_name} on test cases to generate traces...\n")

# Create agent instance
agent = ComplaintsAgentCore(model_endpoint=initial_model_endpoint, catalog=CATALOG)

# Step 1: Run agent to generate traces (tag them for later retrieval)
with mlflow.start_run(run_name=baseline_tag) as run:
    experiment_id = run.info.experiment_id
    run_id = run.info.run_id
    
    # Invoke agent on each complaint to generate traces
    for row in eval_data:
        complaint = row['inputs']['complaint']
        result = agent.invoke(complaint)
        
        # Tag the trace for later retrieval
        trace_id = mlflow.get_last_active_trace_id()
        mlflow.set_trace_tag(trace_id, "eval_group", baseline_tag)
        mlflow.set_trace_tag(trace_id, "model", initial_model_name)

print(f"✅ Generated {len(eval_data)} traces. Run ID: {run_id}")

# Step 2: Retrieve traces for evaluation
baseline_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"tags.eval_group = '{baseline_tag}'",
    max_results=15
)


# Step 3: Evaluate the traces
print(f"\nEvaluating traces with scorers...")
baseline_result = mlflow.genai.evaluate(
    data=baseline_traces,
    scorers=[evidence_groundedness_judge, rationale_sufficiency_judge]
)

In our run, the baseline evaluation:
- Generated 15 traces tagged with `baseline_llama-3-3-70b`
- Ran both judges (evidence_groundedness and rationale_sufficiency) on all traces
- Completed successfully with results viewable in the MLflow UI Evaluation tab

You can now review these traces and add human feedback to improve judge accuracy.

## 6. Provide Human Assessments for Judge Alignment

Now it's time to add human feedback to improve the judge's accuracy. To do so, navigate to the Evaluation tab in the MLflow UI and find the traces under the run generated above. You can add your feedback by clicking into each trace, then clicking the "+" next to the `rationale_sufficiency` scorer. Fill in your feedback for each trace (even if you agree with the initial assessment) along with your rationale, if needed.


## 7. Judge Alignment

After collecting human feedback, we can align the judge to better match human judgment. The SIMBA optimizer analyzes disagreements between the judge and human assessments, then generates improved instructions.

In [None]:
baseline_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"tags.eval_group = '{baseline_tag}'",
    max_results=15,
    return_type="list"
)


# Get traces with human feedback for alignment
traces_for_alignment = baseline_traces

# Align the rationale_sufficiency judge with human feedback
# Note: We align rationale_sufficiency (not evidence_groundedness) because alignment
# is not yet supported for trace-aware judges using {{ trace }} 
## (source: https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/alignment/#quick-start-align-your-first-judge)


aligned_judge = rationale_sufficiency_judge.align(traces_for_alignment)

During alignment, the SIMBA optimizer analyzed disagreements between the initial judge and human assessments, then generated improved instructions. The aligned judge now includes specific guidance learned from human feedback:

**For missing item complaints:**
- Focus solely on the specifics of that complaint
- Clearly state what item is missing
- Provide the item's price from the order
- Explain how the credit amount is calculated based on that price
- Avoid discussing irrelevant factors like delivery time unless they directly impact the decision

**For all complaints:**
- Ensure all statements are consistent and logically support the decision
- Clarify why a credit is appropriate in the context of the complaint
- Connect evidence explicitly to decisions

The alignment process took approximately 2-3 minutes. The resulting aligned judge is better calibrated to human judgment and more focused on relevant complaint details.

## 8. Model Comparison with Aligned Judge

Now that the judge is aligned with human feedback, compare multiple models to find the best one for your use case. Check out the list of models in the Databricks Model Serving tab to see the available models.

Suppose we are interested in comparing the following models:
- **databricks-claude-sonnet-4-5**
- **databricks-gpt-5-mini**

In [None]:
# Define models to compare
models_to_compare = {
    "Claude Sonnet 4.5": "databricks-claude-sonnet-4-5",
    "GPT-5 mini": "databricks-gpt-5-mini"
}


for model_name, endpoint in models_to_compare.items():
    print(f"\nRunning {model_name} to generate traces...")
    
    agent = ComplaintsAgentCore(model_endpoint=endpoint, catalog=CATALOG)
    
    # Step 1: Generate traces with tags
    comparison_tag = f"comparison_{model_name}"
    with mlflow.start_run(run_name=comparison_tag) as run:
        # Invoke agent on each complaint to generate traces
        for row in eval_data:
            complaint = row['inputs']['complaint']
            result = agent.invoke(complaint)
            
            # Tag the trace for later retrieval
            trace_id = mlflow.get_last_active_trace_id()
            mlflow.set_trace_tag(trace_id, "eval_group", comparison_tag)
            mlflow.set_trace_tag(trace_id, "model", model_name)
    
    print(f"  Generated {len(eval_data)} traces for {model_name}")
    
    # Step 2: Retrieve traces
    traces = mlflow.search_traces(
        experiment_ids=[experiment_id],
        filter_string=f"tags.eval_group = '{comparison_tag}'",
        max_results=15
    )
    
    # Step 3: Evaluate traces
    print(f"  Evaluating {len(traces)} traces with aligned judge...")
    result = mlflow.genai.evaluate(
        data=traces,
        scorers=[aligned_judge, evidence_groundedness_judge]
    )

For each model in the comparison:
- Generated 15 new traces (one per complaint scenario)
- Tagged traces for easy retrieval and organization
- Evaluated with both the aligned rationale_sufficiency judge and the trace-aware evidence_groundedness judge
- Results are available in MLflow UI for detailed analysis

The comparison enables side-by-side evaluation of model performance on accuracy, rationale quality, and latency.

## Summary

- Built a parameterized complaint triage agent using DSPy's ReAct module with UC tools.
- Generated a concise eval set and ran a baseline to create traces.
- Evaluated with two judges: evidence_groundedness (trace-aware, unaligned) and rationale_sufficiency (template judge, aligned via SIMBA with human feedback).
- Compared multiple serving endpoints using the aligned judge and reviewed accuracy and latency trade-offs.

**Key takeaways:**
1. Databricks lets you use the models you prefer and provides first-class tools to build agents with UC functions, capture traces, and evaluate them rigorously with MLflow.
2. DSPy's ReAct module provides automatic tool orchestration and structured outputs for building production-ready agents.
3. Trace-aware judging verifies decisions against actual tool outputs; template judges can be aligned with human feedback for better agreement.
4. Choose the endpoint that best balances quality, latency, and cost