# Model Comparison Eval Harness: Tau2-Bench Airline

This notebook compares different models on airline customer service scenarios using tau2-bench natural language evaluation.

**Models being compared:**
- Claude 4 Opus (AnthropicPolicy)
- GPT 4.1 (OpenAIPolicy)
- Kimi K2 (FireworksPolicy)

**Evaluation Framework:** tau2-bench with natural language assertions


In [None]:
# Install required packages
!pip install eval-protocol anthropic fireworks-ai tau2-bench pytest-asyncio
!pip install firectl  # For sharing results


In [None]:
import asyncio
import json
import os
import time
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Any, Tuple
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from litellm import cost_per_token
from loguru import logger

# Import eval protocol and tau2-bench
import eval_protocol as rk
from eval_protocol import reward_function, EvaluateResult
from eval_protocol.models import LLMUsageStats

from examples.tau2_mcp.tests.test_tau2_e2e import MCPServerManager

from vendor.tau2.evaluator.evaluator_nl_assertions import NLAssertionsEvaluator
from vendor.tau2.data_model.message import (
    SystemMessage,
    AssistantMessage,
    UserMessage,
    ToolMessage,
)

print("✅ All imports successful!")

logging.basicConfig(level=logging.WARNING, force=True)

logger.remove()  # Remove default handler
logger.add(lambda _: None, level="ERROR")

✅ All imports successful!


3

## 1. Set Up Evaluation Benchmark

First, let's load the evaluation dataset we want to benchmark our models on.

In [None]:
with open("datasets/airline.json", "r") as f:
    tau2_eval_dataset = json.load(f)
    # TODO: something here is broken

print(f"✅ Loaded airline dataset with {len(tau2_eval_dataset)} scenarios")


✅ Loaded airline dataset with 50 scenarios


## 2. Evaluation Function: Tau2-Bench

Now, let's implement the actual evaluation function (also called a reward function), based on Tau2-Bench. If you haven't heard of Tau2-Bench, it's a customer support benchmark from Sierra AI. Check out more information here: https://github.com/sierra-research/tau2-bench

In [None]:
@reward_function
async def airline_eval(messages: List[Any], nl_assertions: List[str] = None, **kwargs) -> EvaluateResult:
    """
    Evaluate airline conversation using tau2-bench NLAssertionsEvaluator.

    Args:
        messages: Conversation between agent and customer
        nl_assertions: List of natural language assertions to evaluate
        **kwargs: Additional parameters

    Returns:
        EvaluateResult with binary pass/fail and detailed assertion breakdown
    """
    # Default assertions if none provided
    if nl_assertions is None:
        nl_assertions = ["The agent handled the customer request appropriately according to airline policy"]

    # Convert Message objects directly to tau2-bench message objects
    trajectory_objects = []
    for msg in messages:
        role = msg.role
        content = msg.content

        if role == "system":
            trajectory_objects.append(SystemMessage(role=role, content=content))
        elif role == "assistant":
            trajectory_objects.append(AssistantMessage(role=role, content=content))
        elif role == "user":
            trajectory_objects.append(UserMessage(role=role, content=content))
        elif role == "tool":
            tool_id = msg.tool_call_id
            trajectory_objects.append(ToolMessage(id=tool_id, role=role, content=content))

    # Run the synchronous tau2-bench evaluation in a thread pool to avoid blocking
    loop = asyncio.get_event_loop()
    nl_assertions_checks = await loop.run_in_executor(
        None, 
        NLAssertionsEvaluator.evaluate_nl_assertions,
        trajectory_objects, 
        nl_assertions
    )

    all_expectations_met = all(result.met for result in nl_assertions_checks)
    reward = 1.0 if all_expectations_met else 0.0

    # Build reason string
    if all_expectations_met:
        reason = f"All {len(nl_assertions)} natural language assertions passed"
    else:
        failed_assertions = [nl_assertions[i] for i, result in enumerate(nl_assertions_checks) if not result.met]
        reason = f"Failed assertions: {failed_assertions}"

    return EvaluateResult(
        score=reward,
        reason=reason,
        metrics={},
    )

## 3. Set Up Model Policies

Configure the three models we want to compare: Claude 4 Opus, GPT-4.1, and Kimi K2.


In [None]:
# Check for required API keys (set these as environment variables)
# Example: export ANTHROPIC_API_KEY=your-key-here

required_keys = ["ANTHROPIC_API_KEY", "OPENAI_API_KEY", "FIREWORKS_API_KEY"]
missing_keys = [key for key in required_keys if not os.getenv(key)]

if missing_keys:
    print(f"⚠️  Missing API keys: {missing_keys}")
    print("Please set these environment variables:")
    for key in missing_keys:
        print(f"  export {key}='your-key-here'")
else:
    print("✅ All required API keys are set")


✅ All required API keys are set


In [None]:
# Create model policies
openai_policy = rk.OpenAIPolicy(
    model_id="gpt-4.1",
    temperature=0.1,
    max_tokens=4096,
)

anthropic_policy = rk.AnthropicPolicy(
    model_id="claude-sonnet-4-20250514",
    temperature=0.1,
    max_tokens=4096,
)

kimi_policy = rk.FireworksPolicy(
    model_id="accounts/fireworks/models/kimi-k2-instruct",
    temperature=0.1,
    max_tokens=4096,
)

models_to_test = {
    # "gpt-4.1": {
    #     "policy": openai_policy,
    #     "name": "GPT-4.1",
    #     "provider": "OpenAI"
    # },
    "claude-sonnet-4": {
        "policy": anthropic_policy,
        "name": "Claude 4 Sonnet",
        "provider": "Anthropic"
    },
    "kimi-k2": {
        "policy": kimi_policy,
        "name": "Kimi K2", 
        "provider": "Fireworks"
    }
}

print("✅ Model policies created:")
for model_id, model_info in models_to_test.items():
    print(f"  - {model_info['name']} ({model_info['provider']})")


✅ Model policies created:
  - Claude 4 Sonnet (Anthropic)
  - Kimi K2 (Fireworks)


## 4. Run Evaluations

Now we'll run the airline evaluation on both models and compare their performance.

First, let's set up some code to manager our MCP server. We will run this server later on for our MCP tools to make calls to.

Before we get into the main logic, we'd like to track quality and cost across the different models, so this is a bit of setup for tracking cost. For Kimi K2, we're using the official pricing from Firework's website, since litellm doesn't contain it.

In [None]:
MANUAL_PRICING = {
    "accounts/fireworks/models/kimi-k2-instruct": {
        "input_cost_per_1m": 0.60,  # Estimated based on Fireworks pricing
        "output_cost_per_1m": 2.50,  # Estimated - Fireworks often uses same price for input/output
    }
}

def calculate_evaluation_cost(model_id: str, llm_usage_summary: LLMUsageStats) -> Dict[str, Any]:
    input_tokens = llm_usage_summary.prompt_tokens or 0
    output_tokens = llm_usage_summary.completion_tokens or 0
    total_tokens = llm_usage_summary.total_tokens or (input_tokens + output_tokens)
 
    if model_id in MANUAL_PRICING:
        pricing = MANUAL_PRICING[model_id]
        
        input_cost = input_tokens * pricing["input_cost_per_1m"] / 1000000
        output_cost = output_tokens * pricing["output_cost_per_1m"] / 1000000
        total_cost = input_cost + output_cost
        
        cost_source = "manual_pricing"

    else:
        input_cost, output_cost = cost_per_token(
            model=model_id,
            prompt_tokens=input_tokens,
            completion_tokens=output_tokens
        )
        total_cost = input_cost + output_cost
        
        cost_source = "litellm"
        
    return {
        "total_cost": total_cost,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_tokens": total_tokens,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "model_id": model_id,
        "cost_source": cost_source,
    }

Below is our core logic for running the Tau2-bench eval for a single model. We use the eval protocol framework to do rk.make() and rk.rollout(), 

In [None]:
async def run_model_evaluation(model_id: str, model_info: Dict, dataset: List[Dict]) -> Tuple[List[Dict], List[Dict]]:
    """
    Run evaluation for a single model on the airline dataset.
    
    Returns:
        Tuple of (evaluation_results, evaluation_records)
    """
    print(f"\n🧪 Starting evaluation for {model_info['name']}...")

    # Use context manager for automatic cleanup even on exceptions
    with MCPServerManager("../examples/tau2_mcp/server.py", port=8000, domain="airline") as server:
        policy = model_info["policy"]
        
        envs = rk.make(
            "http://localhost:8000/mcp/",
            dataset=dataset, 
            model_id=policy.model_id,
        )
        
        print(f"📊 Created {len(envs.sessions)} environment sessions")
        
        start_time = time.time()
        evaluation_rows = await rk.rollout(envs, policy=policy, steps=30, max_concurrent_rollouts=8)
        duration = time.time() - start_time
        
        print(f"✅ Completed {len(evaluation_rows)} evaluation rows in {duration:.2f}s")
        
        # Create a helper function to process each evaluation row
        async def process_evaluation_row(i: int, eval_row, dataset_item):
            nl_assertions = dataset_item["assertions"]
            
            # Run tau2-bench evaluation (now async and parallelizable!)
            eval_result = await airline_eval(eval_row.messages, nl_assertions)
            
            # Calculate cost using existing LLMUsageStats and LiteLLM/manual pricing
            llm_usage = eval_row.llm_usage_summary
            print(f"  📊 LLM Usage for {dataset_item['id']}: {llm_usage}")  # Debug: show actual usage
            cost_info = calculate_evaluation_cost(policy.model_id, llm_usage)

            num_assertions = len(nl_assertions)

            # Create evaluation result
            result = {
                "scenario_id": dataset_item["id"],
                "model_id": policy.model_id,
                "score": eval_result.score,
                "num_assertions": num_assertions,
                "cost_info": cost_info,  # Include cost information in results
            }
            
            # Create comprehensive evaluation record
            evaluation_record = {
                "model_id": policy.model_id,
                "scenario_id": dataset_item["id"],
                "conversation_history": eval_row.messages,
                "evaluation": {
                    "score": eval_result.score,
                    "num_assertions": num_assertions,
                    "reason": eval_result.reason,
                    "assertions": [
                        {
                            "assertion": assertion,
                            "passed": eval_result.score > 0  # All pass or all fail for this simple implementation
                        }
                        for assertion in nl_assertions
                    ]
                },
                "cost_info": cost_info,  # Add cost information to evaluation record
                "timestamp": datetime.now().isoformat(),
            }
            
            print(f"  📋 {result['scenario_id']}: {result['score']:.1f}, total {result['num_assertions']} assertions)")
            return result, evaluation_record
            
        # Process all evaluation rows in parallel using asyncio.gather
        print(f"🚀 Processing {len(evaluation_rows)} evaluation row evaluations in parallel...")
        eval_start_time = time.time()
        
        tasks = [
            process_evaluation_row(i, eval_row, dataset[i]) 
            for i, eval_row in enumerate(evaluation_rows)
        ]
        
        # Run all evaluations concurrently
        results_and_records = await asyncio.gather(*tasks)
        
        eval_duration = time.time() - eval_start_time
        print(f"✅ Completed parallel evaluations in {eval_duration:.2f}s")
        
        # Separate results and evaluation records
        results = []
        evaluation_records = []
        for result, evaluation_record in results_and_records:
            results.append(result)
            evaluation_records.append(evaluation_record)
        
        await envs.close()
        # Server cleanup happens automatically via context manager
        
        return results, evaluation_records

In [None]:
all_results = []
all_evaluation_records = []

for model_id, model_info in models_to_test.items():
    model_results, evaluation_records = await run_model_evaluation(model_id, model_info, tau2_eval_dataset)
    all_results.extend(model_results)
    all_evaluation_records.extend(evaluation_records)

print(f"\n✅ Completed evaluations for {len(models_to_test)} models")
print(f"📊 Total results: {len(all_results)}")
print(f"📊 Total evaluation records: {len(all_evaluation_records)}")


🧪 Starting evaluation for Claude 4 Sonnet...
✅ Server started successfully on port 8000
📊 Created 50 environment sessions




🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
✅ Completed 50 trajectories in 438.92s
🚀 Processing 50 trajectory evaluations in parallel...
  📊 LLM Usage for airline_task_6: {'prompt_tokens': 11809, 'completion_tokens': 439, 'total_tokens': 12248}
  📋 airline_task_6: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_1: {'prompt_tokens': 48521, 'completion_tokens': 465, 'total_tokens': 48986}
  📋 airline_task_1: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_0: {'prompt_tokens': 18067, 'completion_tokens': 255, 'total_tokens': 18322}
  📋 airline_task_0: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_10: {'prompt_tokens': 70113, 'completion_tokens': 1132, 'total_tokens': 71245}
  📋 airline_task_10: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_13: {'prompt_tokens': 73350, 'completion_tokens': 1136, 'total_tokens': 74486}
  📋 airline_task_13: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_5: {'prompt_tokens': 31643, 'completion_tokens': 



🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
✅ Completed 50 trajectories in 373.16s
🚀 Processing 50 trajectory evaluations in parallel...
  📊 LLM Usage for airline_task_0: {'prompt_tokens': 10394, 'completion_tokens': 348, 'total_tokens': 10742}
  📋 airline_task_0: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_1: {'prompt_tokens': 42103, 'completion_tokens': 192, 'total_tokens': 42295}
  📋 airline_task_1: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_6: {'prompt_tokens': 4932, 'completion_tokens': 38, 'total_tokens': 4970}
  📋 airline_task_6: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_13: {'prompt_tokens': 38788, 'completion_tokens': 663, 'total_tokens': 39451}
  📋 airline_task_13: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_10: {'prompt_tokens': 43693, 'completion_tokens': 366, 'total_tokens': 44059}
  📋 airline_task_10: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_5: {'prompt_tokens': 33828, 'completion_tokens': 479, 

## 5. Analyze Results

Let's analyze and visualize the comparison between Claude 4 Opus, GPT-4.1, and Kimi K2.


In [34]:
model_id_to_config = {}
for config_key, model_info in models_to_test.items():
    actual_model_id = model_info["policy"].model_id
    model_id_to_config[actual_model_id] = model_info

print(f"\n📈 Summary Statistics:")
total_cost = 0.0
for actual_model_id, model_info in model_id_to_config.items():
    model_results_subset = [r for r in all_results if r['model_id'] == actual_model_id]
    avg_score = sum(r['score'] for r in model_results_subset) / len(model_results_subset) if model_results_subset else 0
    
    # Calculate total cost for this model
    model_total_cost = sum(r['cost_info']['total_cost'] for r in model_results_subset if 'cost_info' in r)
    total_cost += model_total_cost
    
    # Show cost source info
    cost_sources = [r['cost_info'].get('cost_source', 'unknown') for r in model_results_subset if 'cost_info' in r]
    cost_source_summary = f" (via {cost_sources[0]})" if cost_sources else ""
    
    print(f"   {model_info['name']}: {avg_score:.2%} success rate ({sum(r['score'] for r in model_results_subset)}/{len(model_results_subset)}) - Cost: ${model_total_cost:.2f}{cost_source_summary}")

print(f"\n💰 Total evaluation cost: ${total_cost:.2f}")
print(f"📊 Cost calculation uses actual API usage data from LLMUsageStats")


📈 Summary Statistics:
   Claude 4 Sonnet: 54.00% success rate (27.0/50) - Cost: $8.79 (via litellm)
   Kimi K2: 46.00% success rate (23.0/50) - Cost: $1.14 (via manual_pricing)

💰 Total evaluation cost: $9.93
📊 Cost calculation uses actual API usage data from LLMUsageStats


In [None]:
def save_results_jsonl(evaluation_records: List[Dict], output_file: str = "evaluation_outputs/all_evaluations.jsonl"):
    """Save all evaluation records in JSONL format (one JSON object per line)."""
    output_path = Path(output_file)
    output_path.parent.mkdir(exist_ok=True)
    
    with open(output_path, 'w') as f:
        for record in evaluation_records:
            json.dump(record, f, default=str)
            f.write('\n')
    
    print(f"📄 Saved JSONL file: {output_path}")
    return output_path

save_results_jsonl(all_evaluation_records)

📄 Saved JSONL file: trajectory_outputs/all_trajectories.jsonl


PosixPath('trajectory_outputs/all_trajectories.jsonl')

In [None]:
def save_evaluation_files(evaluation_records: List[Dict], output_dir: str = "evaluation_outputs"):
    """Save evaluation records to individual files and create summary."""
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Save individual evaluation files
    for record in evaluation_records:
        # Sanitize model_id for filename (replace slashes with underscores)
        safe_model_id = record['model_id'].replace('/', '_').replace('\\', '_')
        filename = f"{safe_model_id}_{record['scenario_id']}_evaluation.json"
        filepath = output_path / filename
        
        with open(filepath, 'w') as f:
            json.dump(record, f, indent=2, default=str)
    
    # Create summary file
    summary = {
        "evaluation_summary": {
            "total_evaluations": len(evaluation_records),
            "models_evaluated": list(set(r['model_id'] for r in evaluation_records)),
            "scenarios_evaluated": list(set(r['scenario_id'] for r in evaluation_records)),
            "timestamp": datetime.now().isoformat(),
        },
        "model_performance": {},
        "scenario_difficulty": {}
    }
    
    # Calculate model performance
    for model_id in summary["evaluation_summary"]["models_evaluated"]:
        model_records = [r for r in evaluation_records if r['model_id'] == model_id]
        total_score = sum(r['evaluation']['score'] for r in model_records)
        avg_score = total_score / len(model_records) if model_records else 0
        
        # Calculate cost metrics
        total_cost = sum(r.get('cost_info', {}).get('total_cost', 0) for r in model_records)
        total_tokens = sum(r.get('cost_info', {}).get('total_tokens', 0) for r in model_records)
        avg_cost_per_scenario = total_cost / len(model_records) if model_records else 0
        
        summary["model_performance"][model_id] = {
            "total_scenarios": len(model_records),
            "total_score": total_score,
            "average_score": avg_score,
            "pass_rate": avg_score,  # Since scores are 0 or 1
            "total_cost": total_cost,
            "average_cost_per_scenario": avg_cost_per_scenario,
            "total_tokens": total_tokens,
            "cost_per_success": total_cost / total_score if total_score > 0 else 0
        }
    
    # Calculate scenario difficulty
    for scenario_id in summary["evaluation_summary"]["scenarios_evaluated"]:
        scenario_records = [r for r in evaluation_records if r['scenario_id'] == scenario_id]
        total_score = sum(r['evaluation']['score'] for r in scenario_records)
        avg_score = total_score / len(scenario_records) if scenario_records else 0
        
        summary["scenario_difficulty"][scenario_id] = {
            "models_tested": len(scenario_records),
            "total_score": total_score,
            "average_score": avg_score,
            "difficulty": "easy" if avg_score > 0.8 else "medium" if avg_score > 0.5 else "hard"
        }
    
    # Save summary
    summary_path = output_path / "evaluation_summary.json"
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    
    print(f"\n📁 Saved evaluation files to: {output_path}")
    print(f"   - {len(evaluation_records)} individual evaluation files")
    print(f"   - 1 evaluation summary file")
    
    return output_path

save_evaluation_files(all_evaluation_records)


📁 Saved trajectory files to: trajectory_outputs
   - 100 individual trajectory files
   - 1 evaluation summary file


PosixPath('trajectory_outputs')

## 7. Share Results with Firectl

Finally, let's create a dataset with our evaluation results to share using `firectl create dataset`.


In [None]:
# TODO

## Summary

This notebook provides a complete eval harness for comparing models using tau2-bench airline evaluation with proper dataset structure:

1. **Dataset Structure**: Following tau2-bench pattern with separate JSON datasets and markdown system prompts
2. **Models**: Configured Claude 4 Sonnet (AnthropicPolicy) and Kimi K2 (FireworksPolicy)
3. **Evaluation**: Used tau2-bench NLAssertionsEvaluator for objective scoring with EvaluationRow format
4. **Analysis**: Compared performance across multiple dimensions
5. **Sharing**: Prepared results for sharing via `firectl create dataset`

### Key Features:
- **Clean Dataset Structure**: Separate JSON data and markdown prompts like the tau2 examples
- **Natural Language Evaluation**: Uses human-readable assertions instead of code-based metrics
- **Multi-Model Comparison**: Easy to add more models for comparison
- **Comprehensive Analysis**: Performance, accuracy, and efficiency metrics with cost tracking
- **EvaluationRow Support**: Updated to work with the new EvaluationRow format from eval_protocol
- **Reproducible**: Results can be shared and reproduced via firectl

### Next Steps:
1. Set your API keys as environment variables:
   ```bash
   export ANTHROPIC_API_KEY="your-anthropic-key-here"
   export OPENAI_API_KEY="your-openai-key-here"
   export FIREWORKS_API_KEY="your-fireworks-key-here"
   ```
2. Start the tau2 MCP server: `cd examples/tau2_mcp && python server.py --port 8000`
3. Run the evaluation cells
4. Share results with the community using the provided firectl command

### Expected Results:
Based on the tau2-bench framework, we expect different models to show varying performance on natural language assertion evaluation, demonstrating their ability to adhere to airline policy compliance and customer service protocols.

This structure uses the updated EvaluationRow format and provides comprehensive cost analysis across different model providers.