# Model Comparison Eval Harness: Tau2-Bench Airline

This notebook compares different models on airline customer service scenarios using tau2-bench natural language evaluation.

**Models being compared:**
- Claude 4 Opus (AnthropicPolicy)
- GPT 4.1 (OpenAIPolicy)
- Kimi K2 (FireworksPolicy)

**Evaluation Framework:** tau2-bench with natural language assertions


In [None]:
# Install required packages
!pip install eval-protocol anthropic fireworks-ai tau2-bench pytest-asyncio
!pip install firectl  # For sharing results


In [1]:
import asyncio
import json
import os
import time
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Any, Tuple
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from litellm import cost_per_token
from loguru import logger

# Import eval protocol and tau2-bench
import eval_protocol as rk
from eval_protocol import reward_function, EvaluateResult

from tau2.evaluator.evaluator_nl_assertions import NLAssertionsEvaluator
from tau2.data_model.message import (
    SystemMessage,
    AssistantMessage,
    UserMessage,
    ToolMessage,
)

print("✅ All imports successful!")

logging.basicConfig(level=logging.WARNING)

logger.remove()  # Remove default handler
logger.add(lambda _: None, level="ERROR")

[32m2025-07-26 01:29:46.187[0m | [1mINFO    [0m | [36mtau2.utils.utils[0m:[36m<module>[0m:[36m27[0m - [1mUsing data directory from source: /Users/derekxu/Documents/code/python-sdk/.venv/lib/python3.12/data[0m
[32m2025-07-26 01:29:46.198[0m | [1mINFO    [0m | [36mtau2.utils.llm_utils[0m:[36m<module>[0m:[36m65[0m - [1mLiteLLM: Cache is disabled[0m


✅ All imports successful!


1

## 1. Set Up Evaluation Benchmark

First, let's load the evaluation dataset we want to benchmark our models on.

In [2]:
with open("datasets/airline.json", "r") as f:
    tau2_eval_dataset = json.load(f)

print(f"✅ Loaded airline dataset with {len(tau2_eval_dataset)} scenarios")


✅ Loaded airline dataset with 50 scenarios


## 2. Evaluation Function: Tau2-Bench

Now, let's implement the actual evaluation function (also called a reward function), based on Tau2-Bench. If you haven't heard of Tau2-Bench, it's a customer support benchmark from Sierra AI. Check out more information here: https://github.com/sierra-research/tau2-bench

In [3]:
@reward_function
async def airline_eval(messages: List[Dict[str, Any]], nl_assertions: List[str] = None, **kwargs) -> EvaluateResult:
    """
    Evaluate airline conversation using tau2-bench NLAssertionsEvaluator.

    Args:
        messages: Conversation between agent and customer
        nl_assertions: List of natural language assertions to evaluate
        **kwargs: Additional parameters

    Returns:
        EvaluateResult with binary pass/fail and detailed assertion breakdown
    """
    # Default assertions if none provided
    if nl_assertions is None:
        nl_assertions = ["The agent handled the customer request appropriately according to airline policy"]

    # Convert messages to tau2-bench message objects based on role
    trajectory_objects = []
    for msg in messages:
        role = msg["role"]
        content = msg["content"]

        if role == "system":
            trajectory_objects.append(SystemMessage(role=role, content=content))
        elif role == "assistant":
            trajectory_objects.append(AssistantMessage(role=role, content=content))
        elif role == "user":
            trajectory_objects.append(UserMessage(role=role, content=content))
        elif role == "tool":
            tool_id = msg.get("tool_call_id")
            trajectory_objects.append(ToolMessage(id=tool_id, role=role, content=content))

    # Run the synchronous tau2-bench evaluation in a thread pool to avoid blocking
    loop = asyncio.get_event_loop()
    nl_assertions_checks = await loop.run_in_executor(
        None, 
        NLAssertionsEvaluator.evaluate_nl_assertions,
        trajectory_objects, 
        nl_assertions
    )

    all_expectations_met = all(result.met for result in nl_assertions_checks)
    reward = 1.0 if all_expectations_met else 0.0

    # Build reason string
    if all_expectations_met:
        reason = f"All {len(nl_assertions)} natural language assertions passed"
    else:
        failed_assertions = [nl_assertions[i] for i, result in enumerate(nl_assertions_checks) if not result.met]
        reason = f"Failed assertions: {failed_assertions}"

    return EvaluateResult(
        score=reward,
        reason=reason,
        metrics={},
    )

## 3. Set Up Model Policies

Configure the three models we want to compare: Claude 4 Opus, GPT-4.1, and Kimi K2.


In [10]:
# Check for required API keys (set these as environment variables)
# Example: export ANTHROPIC_API_KEY=your-key-here

os.environ["ANTHROPIC_API_KEY"] = "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"

required_keys = ["ANTHROPIC_API_KEY", "OPENAI_API_KEY", "FIREWORKS_API_KEY"]
missing_keys = [key for key in required_keys if not os.getenv(key)]

if missing_keys:
    print(f"⚠️  Missing API keys: {missing_keys}")
    print("Please set these environment variables:")
    for key in missing_keys:
        print(f"  export {key}='your-key-here'")
else:
    print("✅ All required API keys are set")


✅ All required API keys are set


In [11]:
# Create model policies
openai_policy = rk.OpenAIPolicy(
    model_id="gpt-4.1",
    temperature=0.1,
    max_tokens=4096,
)

anthropic_policy = rk.AnthropicPolicy(
    model_id="claude-opus-4-20250514",
    temperature=0.1,
    max_tokens=4096,
)

kimi_policy = rk.FireworksPolicy(
    model_id="accounts/fireworks/models/kimi-k2-instruct",  # Kimi K2
    temperature=0.1,
    max_tokens=4096,
)

models_to_test = {
    "gpt-4.1": {
        "policy": openai_policy,
        "name": "GPT-4.1",
        "provider": "OpenAI"
    },
    "claude-opus-4": {
        "policy": anthropic_policy,
        "name": "Claude 4 Opus",
        "provider": "Anthropic"
    },
    "kimi-k2": {
        "policy": kimi_policy,
        "name": "Kimi K2", 
        "provider": "Fireworks"
    }
}

print("✅ Model policies created:")
for model_id, model_info in models_to_test.items():
    print(f"  - {model_info['name']} ({model_info['provider']})")


INFO:eval_protocol.mcp.execution.policy:✅ Initialized OpenAI client: gpt-4.1


INFO:eval_protocol.mcp.execution.policy:✅ Initialized Anthropic client: claude-opus-4-20250514
INFO:eval_protocol.mcp.execution.policy:✅ Initialized Fireworks LLM: accounts/fireworks/models/kimi-k2-instruct (serverless)


✅ Model policies created:
  - GPT-4.1 (OpenAI)
  - Claude 4 Opus (Anthropic)
  - Kimi K2 (Fireworks)


## 4. Run Evaluations

Now we'll run the airline evaluation on both models and compare their performance.

First, let's set up some code to manager our MCP server. We will run this server later on for our MCP tools to make calls to.

In [12]:
import os
import subprocess
import time
import signal
import atexit
from pathlib import Path
from typing import Any, Dict, List, Optional
import json

class MCPServerManager:
    """Manages MCP server lifecycle for testing."""
    
    # Class-level tracking of all server instances
    _active_servers = []
    _cleanup_registered = False

    def __init__(self, server_script: str, port: int = 8000, domain: str = "airline"):
        self.server_script = server_script
        self.port = port
        self.domain = domain
        self.process: Optional[subprocess.Popen] = None
        self.base_dir = Path(".").resolve()
        self._log_file = None
        self._log_file_path = None
        
        # Register this server for cleanup
        MCPServerManager._active_servers.append(self)
        
        # Register cleanup handlers only once
        if not MCPServerManager._cleanup_registered:
            MCPServerManager._register_cleanup_handlers()
            MCPServerManager._cleanup_registered = True

    def start(self) -> None:
        """Start the MCP server."""
        if self.process:
            return

        # Set environment for server
        env = os.environ.copy()
        env["PORT"] = str(self.port)

        # Start server process (no domain argument needed for tau2_mcp server)
        cmd = ["python", self.server_script, "--port", str(self.port)]

        # Setup log file with cleanup
        log_file_path = os.path.join(self.base_dir, f"server_output_{self.domain}_{self.port}.log")
        if os.path.exists(log_file_path):
            os.remove(log_file_path)

        log_file = open(log_file_path, "w")

        self.process = subprocess.Popen(
            cmd,
            cwd=self.base_dir,
            env=env,
            stdout=log_file,
            stderr=log_file,
            text=True,
        )

        # Store log file reference for cleanup
        self._log_file = log_file
        self._log_file_path = log_file_path

        # Wait for server to start
        time.sleep(3)

        # Check if process is still running
        if self.process.poll() is not None:
            try:
                with open(self._log_file_path, 'r') as f:
                    log_content = f.read()
                print(f"❌ Server failed to start!")
                print(f"📋 Server log ({self._log_file_path}):")
                print("=" * 50)
                print(log_content)
                print("=" * 50)
                raise RuntimeError(f"Server failed to start. Check log above for details.")
            except Exception as e:
                stdout, stderr = self.process.communicate()
                raise RuntimeError(f"Server failed to start. stderr: {stderr}, log error: {e}")
        
        print(f"✅ Server started successfully on port {self.port}")

    def stop(self) -> None:
        """Stop the MCP server."""
        if self.process:
            print(f"🛑 Stopping server on port {self.port}...")
            self.process.terminate()
            try:
                self.process.wait(timeout=5)
            except subprocess.TimeoutExpired:
                print(f"⚡ Force killing server on port {self.port}...")
                self.process.kill()
                self.process.wait()
            self.process = None
            
        # Clean up log file
        if self._log_file:
            try:
                self._log_file.close()
            except Exception:
                pass
            self._log_file = None
            
        if self._log_file_path and os.path.exists(self._log_file_path):
            try:
                os.remove(self._log_file_path)
                print(f"🧹 Cleaned up log file: {self._log_file_path}")
            except OSError:
                pass
            self._log_file_path = None
        
        # Remove from active servers list
        if self in MCPServerManager._active_servers:
            MCPServerManager._active_servers.remove(self)

    @classmethod
    def _cleanup_all_servers(cls):
        """Clean up all active servers on exit"""
        print(f"\n🧹 Cleaning up {len(cls._active_servers)} active servers...")
        for server in cls._active_servers.copy():
            try:
                server.stop()
            except Exception as e:
                print(f"⚠️  Error stopping server: {e}")
        cls._active_servers.clear()
    
    @classmethod
    def _signal_handler(cls, signum, frame):
        """Handle interrupt signals"""
        print(f"\n🛑 Received signal {signum}, cleaning up...")
        cls._cleanup_all_servers()
        exit(1)
    
    @classmethod
    def _register_cleanup_handlers(cls):
        """Register cleanup handlers - called only once"""
        atexit.register(cls._cleanup_all_servers)
        signal.signal(signal.SIGINT, cls._signal_handler)  # Ctrl+C
        signal.signal(signal.SIGTERM, cls._signal_handler)  # Termination signal
    
    def __enter__(self):
        """Context manager entry"""
        self.start()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit - ensures cleanup even on exceptions"""
        self.stop()
        if exc_type:
            print(f"⚠️  Server cleanup after exception: {exc_type.__name__}")
        return False  # Don't suppress exceptions


Before we get into the main logic, we'd like to track quality and cost across the different models, so this is a bit of setup for tracking cost. For Kimi K2, we're using the official pricing from Firework's website, since litellm doesn't contain it.

In [13]:
MANUAL_PRICING = {
    "accounts/fireworks/models/kimi-k2-instruct": {
        "input_cost_per_1m": 0.60,  # Estimated based on Fireworks pricing
        "output_cost_per_1m": 2.50,  # Estimated - Fireworks often uses same price for input/output
    }
}

def calculate_trajectory_cost(model_id: str, llm_usage_summary: Dict) -> Dict[str, Any]:
    input_tokens = llm_usage_summary.get('prompt_tokens', 0)
    output_tokens = llm_usage_summary.get('completion_tokens', 0)
    total_tokens = llm_usage_summary.get('total_tokens', input_tokens + output_tokens)
    
    if model_id in MANUAL_PRICING:
        pricing = MANUAL_PRICING[model_id]
        
        input_cost = input_tokens * pricing["input_cost_per_1m"] / 1000000
        output_cost = output_tokens * pricing["output_cost_per_1m"] / 1000000
        total_cost = input_cost + output_cost
        
        cost_source = "manual_pricing"

    else:
        input_cost, output_cost = cost_per_token(
            model=model_id,
            prompt_tokens=input_tokens,
            completion_tokens=output_tokens
        )
        total_cost = input_cost + output_cost
        
        cost_source = "litellm"
        
    return {
        "total_cost": total_cost,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_tokens": total_tokens,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "model_id": model_id,
        "cost_source": cost_source,
    }

Below is our core logic for running the Tau2-bench eval for a single model. We use the eval protocol framework to do rk.make() and rk.rollout(), 

In [14]:
async def run_model_evaluation(model_id: str, model_info: Dict, dataset: List[Dict]) -> Tuple[List[Dict], List[Dict]]:
    """
    Run evaluation for a single model on the airline dataset.
    
    Returns:
        Tuple of (evaluation_results, trajectory_records)
    """
    print(f"\n🧪 Starting evaluation for {model_info['name']}...")

    # Use context manager for automatic cleanup even on exceptions
    with MCPServerManager("../examples/tau2_mcp/server.py", port=8000, domain="airline") as server:
        policy = model_info["policy"]
        
        envs = rk.make(
            "http://localhost:8000/mcp/",
            dataset=dataset, 
            model_id=policy.model_id,
        )
        
        print(f"📊 Created {len(envs.sessions)} environment sessions")
        
        start_time = time.time()
        trajectories = await rk.rollout(envs, policy=policy, steps=30, max_concurrent_rollouts=8)
        duration = time.time() - start_time
        
        print(f"✅ Completed {len(trajectories)} trajectories in {duration:.2f}s")
        
        # Create a helper function to process each trajectory
        async def process_trajectory(i: int, trajectory, dataset_item):
            conversation_history = trajectory.conversation_history
            nl_assertions = dataset_item["assertions"]
            
            # Run tau2-bench evaluation (now async and parallelizable!)
            eval_result = await airline_eval(conversation_history, nl_assertions)
            
            # Calculate cost using existing LLMUsageStats and LiteLLM/manual pricing
            llm_usage = getattr(trajectory, 'llm_usage_summary', {})
            print(f"  📊 LLM Usage for {dataset_item['id']}: {llm_usage}")  # Debug: show actual usage
            cost_info = calculate_trajectory_cost(policy.model_id, llm_usage)

            num_assertions = len(nl_assertions)

            # Create evaluation result
            result = {
                "scenario_id": dataset_item["id"],
                "model_id": policy.model_id,
                "score": eval_result.score,
                "num_assertions": num_assertions,
                "cost_info": cost_info,  # Include cost information in results
            }
            
            # Create comprehensive trajectory record
            trajectory_record = {
                "model_id": policy.model_id,
                "scenario_id": dataset_item["id"],
                "conversation_history": conversation_history,
                "evaluation": {
                    "score": eval_result.score,
                    "num_assertions": num_assertions,
                    "reason": eval_result.reason,
                    "assertions": [
                        {
                            "assertion": assertion,
                            "passed": eval_result.score > 0  # All pass or all fail for this simple implementation
                        }
                        for assertion in nl_assertions
                    ]
                },
                "cost_info": cost_info,  # Add cost information to trajectory record
                "timestamp": datetime.now().isoformat(),
            }
            
            print(f"  📋 {result['scenario_id']}: {result['score']:.1f}, total {result['num_assertions']} assertions)")
            return result, trajectory_record
            
        # Process all trajectories in parallel using asyncio.gather
        print(f"🚀 Processing {len(trajectories)} trajectory evaluations in parallel...")
        eval_start_time = time.time()
        
        tasks = [
            process_trajectory(i, trajectory, dataset[i]) 
            for i, trajectory in enumerate(trajectories)
        ]
        
        # Run all evaluations concurrently
        results_and_records = await asyncio.gather(*tasks)
        
        eval_duration = time.time() - eval_start_time
        print(f"✅ Completed parallel evaluations in {eval_duration:.2f}s")
        
        # Separate results and trajectory records
        results = []
        trajectory_records = []
        for result, trajectory_record in results_and_records:
            results.append(result)
            trajectory_records.append(trajectory_record)
        
        await envs.close()
        # Server cleanup happens automatically via context manager
        
        return results, trajectory_records

In [15]:
all_results = []
all_trajectory_records = []

for model_id, model_info in models_to_test.items():
    model_results, trajectory_records = await run_model_evaluation(model_id, model_info, tau2_eval_dataset)
    all_results.extend(model_results)
    all_trajectory_records.extend(trajectory_records)

print(f"\n✅ Completed evaluations for {len(models_to_test)} models")
print(f"📊 Total results: {len(all_results)}")
print(f"📊 Total trajectories: {len(all_trajectory_records)}")


🧪 Starting evaluation for GPT-4.1...


INFO:eval_protocol.mcp.execution.manager:🚀 Live mode: No recording/playback
INFO:eval_protocol.mcp.execution.manager:🧵 Starting 50 rollouts with max 8 concurrent threads...
INFO:mcp.client.streamable_http:Received session ID: e83d02a29977443e82c275530bb22e7c
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 0f287a105e8344b2825eadbbf3d7ff11
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: e18fea8b76dd45088ffb0df77b27cfba
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 6dcb58d3c55b4c689a2f238f96d5ac85
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 99d6e5f99d3e4fa48287d608794aecf0
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Recei

✅ Server started successfully on port 8000
📊 Created 50 environment sessions


INFO:eval_protocol.mcp.client.connection:Session c19bd8175c5f9d906c9e5900b51ac4db: ✅ Successfully fetched session-aware initial state from control plane endpoint
[92m01:33:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4.1; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4.1; provider = openai
[92m01:33:11 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
INFO:eval_protocol.mcp.execution.manager:🎯 Starting rollout 0 in thread MainThread
INFO:eval_protocol.mcp.client.connection:Session 6e879084380f39bf71a8e50f5f0374ab: ✅ Successfully fetched session-aware initial state from control plane endpoint
INFO:eval_protocol.mcp.client.connection:Session dc3134f07acdbe840439f429b98d2ecb: ✅ Successfully fetched session-aware initial state from control plane endpoint
INFO:eval_protocol.mcp.client.connection:Session 27f4c833a200e16fc61347546322639d: ✅ Successful

🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
✅ Completed 50 trajectories in 351.64s
🚀 Processing 50 trajectory evaluations in parallel...


[92m01:39:04 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:04 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:04 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:04 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:04 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:04 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_6: {'prompt_tokens': 3445, 'completion_tokens': 21, 'total_tokens': 3466}
  📋 airline_task_6: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_0: {'prompt_tokens': 3399, 'completion_tokens': 20, 'total_tokens': 3419}
  📋 airline_task_0: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_1: {'prompt_tokens': 12794, 'completion_tokens': 159, 'total_tokens': 12953}
  📋 airline_task_1: 0.0, total 1 assertions)


[92m01:39:05 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_10: {'prompt_tokens': 27536, 'completion_tokens': 268, 'total_tokens': 27804}
  📋 airline_task_10: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_3: {'prompt_tokens': 3402, 'completion_tokens': 20, 'total_tokens': 3422}
  📋 airline_task_3: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_5: {'prompt_tokens': 17688, 'completion_tokens': 169, 'total_tokens': 17857}
  📋 airline_task_5: 0.0, total 2 assertions)


[92m01:39:05 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:05 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_13: {'prompt_tokens': 3723, 'completion_tokens': 51, 'total_tokens': 3774}
  📋 airline_task_13: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_12: {'prompt_tokens': 23070, 'completion_tokens': 203, 'total_tokens': 23273}
  📋 airline_task_12: 1.0, total 2 assertions)


[92m01:39:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_11: {'prompt_tokens': 12540, 'completion_tokens': 117, 'total_tokens': 12657}
  📋 airline_task_11: 0.0, total 3 assertions)


[92m01:39:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_4: {'prompt_tokens': 31953, 'completion_tokens': 191, 'total_tokens': 32144}
  📋 airline_task_4: 0.0, total 2 assertions)


[92m01:39:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_7: {'prompt_tokens': 17798, 'completion_tokens': 214, 'total_tokens': 18012}
  📋 airline_task_7: 0.0, total 4 assertions)


[92m01:39:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_19: {'prompt_tokens': 18352, 'completion_tokens': 199, 'total_tokens': 18551}
  📋 airline_task_19: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_16: {'prompt_tokens': 23906, 'completion_tokens': 214, 'total_tokens': 24120}
  📋 airline_task_16: 1.0, total 2 assertions)


[92m01:39:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_15: {'prompt_tokens': 26785, 'completion_tokens': 308, 'total_tokens': 27093}
  📋 airline_task_15: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_8: {'prompt_tokens': 40523, 'completion_tokens': 303, 'total_tokens': 40826}
  📋 airline_task_8: 0.0, total 4 assertions)


[92m01:39:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_2: {'prompt_tokens': 18534, 'completion_tokens': 344, 'total_tokens': 18878}
  📋 airline_task_2: 1.0, total 4 assertions)


[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_9: {'prompt_tokens': 12744, 'completion_tokens': 162, 'total_tokens': 12906}
  📋 airline_task_9: 0.0, total 4 assertions)


[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_14: {'prompt_tokens': 46074, 'completion_tokens': 318, 'total_tokens': 46392}
  📋 airline_task_14: 0.0, total 5 assertions)
  📊 LLM Usage for airline_task_20: {'prompt_tokens': 17905, 'completion_tokens': 282, 'total_tokens': 18187}
  📋 airline_task_20: 1.0, total 2 assertions)


[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_17: {'prompt_tokens': 28545, 'completion_tokens': 352, 'total_tokens': 28897}
  📋 airline_task_17: 0.0, total 3 assertions)
  📊 LLM Usage for airline_task_26: {'prompt_tokens': 7017, 'completion_tokens': 127, 'total_tokens': 7144}
  📋 airline_task_26: 1.0, total 1 assertions)


[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_21: {'prompt_tokens': 47435, 'completion_tokens': 308, 'total_tokens': 47743}
  📋 airline_task_21: 0.0, total 3 assertions)
  📊 LLM Usage for airline_task_22: {'prompt_tokens': 32419, 'completion_tokens': 250, 'total_tokens': 32669}
  📋 airline_task_22: 1.0, total 3 assertions)


[92m01:39:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_25: {'prompt_tokens': 30721, 'completion_tokens': 632, 'total_tokens': 31353}
  📋 airline_task_25: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_28: {'prompt_tokens': 7387, 'completion_tokens': 91, 'total_tokens': 7478}
  📋 airline_task_28: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_29: {'prompt_tokens': 18498, 'completion_tokens': 203, 'total_tokens': 18701}
  📋 airline_task_29: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_31: {'prompt_tokens': 7197, 'completion_tokens': 96, 'total_tokens': 7293}
  📋 airline_task_31: 1.0, total 1 assertions)


[92m01:39:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_27: {'prompt_tokens': 75931, 'completion_tokens': 347, 'total_tokens': 76278}
  📋 airline_task_27: 1.0, total 3 assertions)


[92m01:39:10 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:10 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_18: {'prompt_tokens': 21160, 'completion_tokens': 607, 'total_tokens': 21767}
  📋 airline_task_18: 1.0, total 6 assertions)
  📊 LLM Usage for airline_task_36: {'prompt_tokens': 3582, 'completion_tokens': 20, 'total_tokens': 3602}
  📋 airline_task_36: 1.0, total 1 assertions)


[92m01:39:10 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:39:10 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_32: {'prompt_tokens': 12841, 'completion_tokens': 201, 'total_tokens': 13042}
  📋 airline_task_32: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_30: {'prompt_tokens': 12376, 'completion_tokens': 126, 'total_tokens': 12502}
  📋 airline_task_30: 1.0, total 2 assertions)


[92m01:39:11 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:11 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_34: {'prompt_tokens': 23378, 'completion_tokens': 237, 'total_tokens': 23615}
  📋 airline_task_34: 0.0, total 1 assertions)


[92m01:39:11 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:11 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_24: {'prompt_tokens': 28502, 'completion_tokens': 430, 'total_tokens': 28932}
  📋 airline_task_24: 1.0, total 3 assertions)


[92m01:39:11 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:11 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_40: {'prompt_tokens': 3832, 'completion_tokens': 45, 'total_tokens': 3877}
  📋 airline_task_40: 0.0, total 1 assertions)


[92m01:39:12 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:12 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_33: {'prompt_tokens': 34924, 'completion_tokens': 286, 'total_tokens': 35210}
  📋 airline_task_33: 1.0, total 3 assertions)


[92m01:39:12 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_42: {'prompt_tokens': 13650, 'completion_tokens': 254, 'total_tokens': 13904}
  📋 airline_task_42: 1.0, total 2 assertions)


[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_48: {'prompt_tokens': 7684, 'completion_tokens': 41, 'total_tokens': 7725}
  📋 airline_task_48: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_47: {'prompt_tokens': 8090, 'completion_tokens': 104, 'total_tokens': 8194}
  📋 airline_task_47: 1.0, total 1 assertions)


[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_38: {'prompt_tokens': 16734, 'completion_tokens': 89, 'total_tokens': 16823}
  📋 airline_task_38: 1.0, total 4 assertions)
  📊 LLM Usage for airline_task_37: {'prompt_tokens': 16027, 'completion_tokens': 268, 'total_tokens': 16295}
  📋 airline_task_37: 1.0, total 3 assertions)
  📊 LLM Usage for airline_task_43: {'prompt_tokens': 7185, 'completion_tokens': 162, 'total_tokens': 7347}
  📋 airline_task_43: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_46: {'prompt_tokens': 3600, 'completion_tokens': 43, 'total_tokens': 3643}
  📋 airline_task_46: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_49: {'prompt_tokens': 7317, 'completion_tokens': 42, 'total_tokens': 7359}
  📋 airline_task_49: 0.0, total 1 assertions)


[92m01:39:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:39:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_39: {'prompt_tokens': 7587, 'completion_tokens': 176, 'total_tokens': 7763}
  📋 airline_task_39: 0.0, total 4 assertions)
  📊 LLM Usage for airline_task_45: {'prompt_tokens': 4190, 'completion_tokens': 21, 'total_tokens': 4211}
  📋 airline_task_45: 1.0, total 2 assertions)


[92m01:39:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_35: {'prompt_tokens': 27958, 'completion_tokens': 329, 'total_tokens': 28287}
  📋 airline_task_35: 1.0, total 3 assertions)


[92m01:39:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_23: {'prompt_tokens': 89802, 'completion_tokens': 704, 'total_tokens': 90506}
  📋 airline_task_23: 0.0, total 8 assertions)


[92m01:39:16 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_44: {'prompt_tokens': 33793, 'completion_tokens': 877, 'total_tokens': 34670}
  📋 airline_task_44: 0.0, total 5 assertions)


[92m01:39:16 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_41: {'prompt_tokens': 13785, 'completion_tokens': 238, 'total_tokens': 14023}
  📋 airline_task_41: 0.0, total 2 assertions)
✅ Completed parallel evaluations in 14.72s
🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
🛑 Stopping server on port 8000...
🧹 Cleaned up log file: /Users/derekxu/Documents/code/python-sdk/local_evals/server_output_airline_8000.log

🧪 Starting evaluation for Claude 4 Opus...


INFO:eval_protocol.mcp.execution.manager:🚀 Live mode: No recording/playback
INFO:eval_protocol.mcp.execution.manager:🧵 Starting 50 rollouts with max 8 concurrent threads...
INFO:mcp.client.streamable_http:Received session ID: 8091f2d6ad514f10933a053036fcb30c
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 315be2cea5314cfc8236c75ddeb05374
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 07856128c8a04e8fa977ee9de8138c0c
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: e09624fcc5634c66bbde2dd842c4b2c3
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 3261f2f7357649a4a01212a7b44e8133
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Recei

✅ Server started successfully on port 8000
📊 Created 50 environment sessions


INFO:eval_protocol.mcp.client.connection:Session 6d25b3e69aee0055d9341f4ef05afd99: ✅ Successfully fetched session-aware initial state from control plane endpoint
[92m01:39:20 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4.1; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4.1; provider = openai
[92m01:39:20 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
INFO:eval_protocol.mcp.execution.manager:🎯 Starting rollout 0 in thread MainThread
INFO:eval_protocol.mcp.client.connection:Session da9c83d55218e8efa9342b068ac5f3fc: ✅ Successfully fetched session-aware initial state from control plane endpoint
[92m01:39:20 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4.1; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4.1; provider = openai
[92m01:39:22 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling

🧹 Closing 50 MCP sessions...


INFO:eval_protocol.mcp.execution.manager:📊 Rollout complete: 0/50 reached goal
INFO:eval_protocol.mcp.execution.manager:🎛️  Control plane terminations: 0/50
INFO:eval_protocol.mcp.execution.manager:⏱️  Total duration: 677.19s
INFO:eval_protocol.mcp.execution.manager:🧵 Used 8 concurrent threads
[92m01:50:37 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:37 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:37 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:37 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-

✅ All MCP sessions closed.
✅ Completed 50 trajectories in 677.48s
🚀 Processing 50 trajectory evaluations in parallel...


[92m01:50:39 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_6: {'prompt_tokens': 26931, 'completion_tokens': 559, 'total_tokens': 27490}
  📋 airline_task_6: 1.0, total 1 assertions)


[92m01:50:39 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_1: {'prompt_tokens': 48246, 'completion_tokens': 837, 'total_tokens': 49083}
  📋 airline_task_1: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_0: {'prompt_tokens': 11680, 'completion_tokens': 171, 'total_tokens': 11851}
  📋 airline_task_0: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_13: {'prompt_tokens': 44504, 'completion_tokens': 640, 'total_tokens': 45144}
  📋 airline_task_13: 1.0, total 1 assertions)


[92m01:50:39 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:39 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_10: {'prompt_tokens': 75094, 'completion_tokens': 1474, 'total_tokens': 76568}
  📋 airline_task_10: 1.0, total 1 assertions)


[92m01:50:40 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:40 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:40 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:40 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_5: {'prompt_tokens': 59770, 'completion_tokens': 781, 'total_tokens': 60551}
  📋 airline_task_5: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_4: {'prompt_tokens': 41589, 'completion_tokens': 439, 'total_tokens': 42028}
  📋 airline_task_4: 1.0, total 2 assertions)


[92m01:50:40 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:40 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:40 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:40 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_12: {'prompt_tokens': 25984, 'completion_tokens': 482, 'total_tokens': 26466}
  📋 airline_task_12: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_3: {'prompt_tokens': 18244, 'completion_tokens': 298, 'total_tokens': 18542}
  📋 airline_task_3: 1.0, total 2 assertions)


[92m01:50:41 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:41 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:41 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:41 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_7: {'prompt_tokens': 49239, 'completion_tokens': 649, 'total_tokens': 49888}
  📋 airline_task_7: 0.0, total 4 assertions)
  📊 LLM Usage for airline_task_19: {'prompt_tokens': 17990, 'completion_tokens': 321, 'total_tokens': 18311}
  📋 airline_task_19: 1.0, total 1 assertions)


[92m01:50:41 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:41 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:41 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:41 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_11: {'prompt_tokens': 48062, 'completion_tokens': 788, 'total_tokens': 48850}
  📋 airline_task_11: 1.0, total 3 assertions)
  📊 LLM Usage for airline_task_2: {'prompt_tokens': 73883, 'completion_tokens': 959, 'total_tokens': 74842}
  📋 airline_task_2: 0.0, total 4 assertions)


[92m01:50:42 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_15: {'prompt_tokens': 55626, 'completion_tokens': 898, 'total_tokens': 56524}
  📋 airline_task_15: 1.0, total 2 assertions)


[92m01:50:42 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_9: {'prompt_tokens': 32265, 'completion_tokens': 641, 'total_tokens': 32906}
  📋 airline_task_9: 0.0, total 4 assertions)


[92m01:50:42 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:42 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_8: {'prompt_tokens': 32731, 'completion_tokens': 726, 'total_tokens': 33457}
  📋 airline_task_8: 1.0, total 4 assertions)
  📊 LLM Usage for airline_task_16: {'prompt_tokens': 34112, 'completion_tokens': 677, 'total_tokens': 34789}
  📋 airline_task_16: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_17: {'prompt_tokens': 66925, 'completion_tokens': 1204, 'total_tokens': 68129}
  📋 airline_task_17: 1.0, total 3 assertions)


[92m01:50:43 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:43 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_26: {'prompt_tokens': 18053, 'completion_tokens': 580, 'total_tokens': 18633}
  📋 airline_task_26: 1.0, total 1 assertions)


[92m01:50:43 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:43 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_14: {'prompt_tokens': 45843, 'completion_tokens': 735, 'total_tokens': 46578}
  📋 airline_task_14: 0.0, total 5 assertions)


[92m01:50:44 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:44 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:44 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:44 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_31: {'prompt_tokens': 41322, 'completion_tokens': 597, 'total_tokens': 41919}
  📋 airline_task_31: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_22: {'prompt_tokens': 57927, 'completion_tokens': 917, 'total_tokens': 58844}
  📋 airline_task_22: 1.0, total 3 assertions)


[92m01:50:44 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:44 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_21: {'prompt_tokens': 77706, 'completion_tokens': 1140, 'total_tokens': 78846}
  📋 airline_task_21: 0.0, total 3 assertions)


[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_27: {'prompt_tokens': 91895, 'completion_tokens': 1067, 'total_tokens': 92962}
  📋 airline_task_27: 0.0, total 3 assertions)


[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_25: {'prompt_tokens': 36461, 'completion_tokens': 1103, 'total_tokens': 37564}
  📋 airline_task_25: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_30: {'prompt_tokens': 26274, 'completion_tokens': 666, 'total_tokens': 26940}
  📋 airline_task_30: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_29: {'prompt_tokens': 64702, 'completion_tokens': 939, 'total_tokens': 65641}
  📋 airline_task_29: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_20: {'prompt_tokens': 27775, 'completion_tokens': 865, 'total_tokens': 28640}
  📋 airline_task_20: 0.0, total 2 assertions)


[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_32: {'prompt_tokens': 34885, 'completion_tokens': 736, 'total_tokens': 35621}
  📋 airline_task_32: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_34: {'prompt_tokens': 33968, 'completion_tokens': 740, 'total_tokens': 34708}
  📋 airline_task_34: 1.0, total 1 assertions)


[92m01:50:45 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:45 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:46 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:46 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_28: {'prompt_tokens': 12147, 'completion_tokens': 341, 'total_tokens': 12488}
  📋 airline_task_28: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_18: {'prompt_tokens': 263255, 'completion_tokens': 3155, 'total_tokens': 266410}
  📋 airline_task_18: 1.0, total 6 assertions)


[92m01:50:46 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:46 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:50:46 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:46 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_24: {'prompt_tokens': 60156, 'completion_tokens': 1275, 'total_tokens': 61431}
  📋 airline_task_24: 0.0, total 3 assertions)
  📊 LLM Usage for airline_task_36: {'prompt_tokens': 11678, 'completion_tokens': 372, 'total_tokens': 12050}
  📋 airline_task_36: 1.0, total 1 assertions)


[92m01:50:47 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:47 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_40: {'prompt_tokens': 11703, 'completion_tokens': 239, 'total_tokens': 11942}
  📋 airline_task_40: 1.0, total 1 assertions)


[92m01:50:47 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:47 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_41: {'prompt_tokens': 65393, 'completion_tokens': 614, 'total_tokens': 66007}
  📋 airline_task_41: 0.0, total 2 assertions)


[92m01:50:47 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:47 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_47: {'prompt_tokens': 11778, 'completion_tokens': 162, 'total_tokens': 11940}
  📋 airline_task_47: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_33: {'prompt_tokens': 70405, 'completion_tokens': 1216, 'total_tokens': 71621}
  📋 airline_task_33: 1.0, total 3 assertions)


[92m01:50:48 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_35: {'prompt_tokens': 37194, 'completion_tokens': 862, 'total_tokens': 38056}
  📋 airline_task_35: 1.0, total 3 assertions)


[92m01:50:48 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:48 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:48 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_45: {'prompt_tokens': 21223, 'completion_tokens': 325, 'total_tokens': 21548}
  📋 airline_task_45: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_42: {'prompt_tokens': 64766, 'completion_tokens': 768, 'total_tokens': 65534}
  📋 airline_task_42: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_37: {'prompt_tokens': 58092, 'completion_tokens': 1217, 'total_tokens': 59309}
  📋 airline_task_37: 0.0, total 3 assertions)


[92m01:50:48 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_48: {'prompt_tokens': 18623, 'completion_tokens': 372, 'total_tokens': 18995}
  📋 airline_task_48: 1.0, total 1 assertions)


[92m01:50:49 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_43: {'prompt_tokens': 55874, 'completion_tokens': 773, 'total_tokens': 56647}
  📋 airline_task_43: 1.0, total 2 assertions)


[92m01:50:49 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_39: {'prompt_tokens': 107490, 'completion_tokens': 854, 'total_tokens': 108344}
  📋 airline_task_39: 0.0, total 4 assertions)


[92m01:50:49 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:49 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_46: {'prompt_tokens': 5657, 'completion_tokens': 75, 'total_tokens': 5732}
  📋 airline_task_46: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_49: {'prompt_tokens': 18519, 'completion_tokens': 355, 'total_tokens': 18874}
  📋 airline_task_49: 1.0, total 1 assertions)


[92m01:50:50 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:50:50 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_23: {'prompt_tokens': 111855, 'completion_tokens': 1747, 'total_tokens': 113602}
  📋 airline_task_23: 0.0, total 8 assertions)
  📊 LLM Usage for airline_task_38: {'prompt_tokens': 63787, 'completion_tokens': 808, 'total_tokens': 64595}
  📋 airline_task_38: 0.0, total 4 assertions)


[92m01:50:50 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_44: {'prompt_tokens': 154163, 'completion_tokens': 2080, 'total_tokens': 156243}
  📋 airline_task_44: 0.0, total 5 assertions)
✅ Completed parallel evaluations in 13.23s
🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
🛑 Stopping server on port 8000...
🧹 Cleaned up log file: /Users/derekxu/Documents/code/python-sdk/local_evals/server_output_airline_8000.log

🧪 Starting evaluation for Kimi K2...


INFO:eval_protocol.mcp.execution.manager:🚀 Live mode: No recording/playback
INFO:eval_protocol.mcp.execution.manager:🧵 Starting 50 rollouts with max 8 concurrent threads...
INFO:mcp.client.streamable_http:Received session ID: 582953622a5f4d1f947a73fae8eb4f7a
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 96cd854fb991428f9dcffb2ce830097b
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: a16333c2c87b4480bb8fb3a593b119db
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 2f574f639192473593ccb32ad3543259
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Received session ID: 39bb5753bd5b4a728804157c2846a751
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18
INFO:mcp.client.streamable_http:Recei

✅ Server started successfully on port 8000
📊 Created 50 environment sessions


INFO:eval_protocol.mcp.client.connection:Session 2e3e3e0d933aac16a52ab16aa47093b7: ✅ Successfully fetched session-aware initial state from control plane endpoint
[92m01:50:54 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4.1; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4.1; provider = openai
[92m01:50:55 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
INFO:eval_protocol.mcp.execution.manager:🎯 Starting rollout 4 in thread MainThread
INFO:eval_protocol.mcp.client.connection:Session 8549d8e19f3bc833a284301ab1b5fe6b: ✅ Successfully fetched session-aware initial state from control plane endpoint
INFO:eval_protocol.mcp.client.connection:Session fef85f1efaffdaf584a0b1bb3d845198: ✅ Successfully fetched session-aware initial state from control plane endpoint
INFO:eval_protocol.mcp.client.connection:Session 182b765b60f120704f5828109f14b9ed: ✅ Successful

🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
✅ Completed 50 trajectories in 430.00s
🚀 Processing 50 trajectory evaluations in parallel...


[92m01:58:05 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:05 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_6: {'prompt_tokens': 10418, 'completion_tokens': 299, 'total_tokens': 10717}
  📋 airline_task_6: 1.0, total 1 assertions)


[92m01:58:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_1: {'prompt_tokens': 42424, 'completion_tokens': 273, 'total_tokens': 42697}
  📋 airline_task_1: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_0: {'prompt_tokens': 10380, 'completion_tokens': 286, 'total_tokens': 10666}
  📋 airline_task_0: 1.0, total 1 assertions)


[92m01:58:06 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:06 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_10: {'prompt_tokens': 47224, 'completion_tokens': 660, 'total_tokens': 47884}
  📋 airline_task_10: 1.0, total 1 assertions)


[92m01:58:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_5: {'prompt_tokens': 34071, 'completion_tokens': 487, 'total_tokens': 34558}
  📋 airline_task_5: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_3: {'prompt_tokens': 16041, 'completion_tokens': 132, 'total_tokens': 16173}
  📋 airline_task_3: 1.0, total 2 assertions)


[92m01:58:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_13: {'prompt_tokens': 53099, 'completion_tokens': 839, 'total_tokens': 53938}
  📋 airline_task_13: 0.0, total 1 assertions)


[92m01:58:07 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:07 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_4: {'prompt_tokens': 54131, 'completion_tokens': 608, 'total_tokens': 54739}
  📋 airline_task_4: 1.0, total 2 assertions)


[92m01:58:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:08 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:08 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_7: {'prompt_tokens': 90184, 'completion_tokens': 729, 'total_tokens': 90913}
  📋 airline_task_7: 1.0, total 4 assertions)
  📊 LLM Usage for airline_task_19: {'prompt_tokens': 28338, 'completion_tokens': 153, 'total_tokens': 28491}
  📋 airline_task_19: 1.0, total 1 assertions)


[92m01:58:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_2: {'prompt_tokens': 59960, 'completion_tokens': 686, 'total_tokens': 60646}
  📋 airline_task_2: 0.0, total 4 assertions)


[92m01:58:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_12: {'prompt_tokens': 29495, 'completion_tokens': 363, 'total_tokens': 29858}
  📋 airline_task_12: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_16: {'prompt_tokens': 29999, 'completion_tokens': 399, 'total_tokens': 30398}
  📋 airline_task_16: 0.0, total 2 assertions)


[92m01:58:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_8: {'prompt_tokens': 49656, 'completion_tokens': 518, 'total_tokens': 50174}
  📋 airline_task_8: 1.0, total 4 assertions)
  📊 LLM Usage for airline_task_15: {'prompt_tokens': 56760, 'completion_tokens': 550, 'total_tokens': 57310}
  📋 airline_task_15: 0.0, total 2 assertions)


[92m01:58:09 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:09 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:10 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_17: {'prompt_tokens': 37550, 'completion_tokens': 268, 'total_tokens': 37818}
  📋 airline_task_17: 0.0, total 3 assertions)
  📊 LLM Usage for airline_task_11: {'prompt_tokens': 44221, 'completion_tokens': 475, 'total_tokens': 44696}
  📋 airline_task_11: 1.0, total 3 assertions)


[92m01:58:10 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:10 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_9: {'prompt_tokens': 49844, 'completion_tokens': 438, 'total_tokens': 50282}
  📋 airline_task_9: 0.0, total 4 assertions)


[92m01:58:11 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:11 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_26: {'prompt_tokens': 15622, 'completion_tokens': 385, 'total_tokens': 16007}
  📋 airline_task_26: 1.0, total 1 assertions)


[92m01:58:11 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:11 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_22: {'prompt_tokens': 51690, 'completion_tokens': 350, 'total_tokens': 52040}
  📋 airline_task_22: 1.0, total 3 assertions)


[92m01:58:12 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:12 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_31: {'prompt_tokens': 16339, 'completion_tokens': 382, 'total_tokens': 16721}
  📋 airline_task_31: 1.0, total 1 assertions)


[92m01:58:12 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:12 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_20: {'prompt_tokens': 36300, 'completion_tokens': 685, 'total_tokens': 36985}
  📋 airline_task_20: 0.0, total 2 assertions)


[92m01:58:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:13 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_21: {'prompt_tokens': 80264, 'completion_tokens': 837, 'total_tokens': 81101}
  📋 airline_task_21: 0.0, total 3 assertions)


[92m01:58:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:13 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:13 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_28: {'prompt_tokens': 10402, 'completion_tokens': 364, 'total_tokens': 10766}
  📋 airline_task_28: 1.0, total 2 assertions)
  📊 LLM Usage for airline_task_25: {'prompt_tokens': 10126, 'completion_tokens': 83, 'total_tokens': 10209}
  📋 airline_task_25: 0.0, total 2 assertions)


[92m01:58:13 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:13 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_27: {'prompt_tokens': 71868, 'completion_tokens': 360, 'total_tokens': 72228}
  📋 airline_task_27: 0.0, total 3 assertions)


[92m01:58:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:14 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:14 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:14 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model=

  📊 LLM Usage for airline_task_34: {'prompt_tokens': 22861, 'completion_tokens': 380, 'total_tokens': 23241}
  📋 airline_task_34: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_18: {'prompt_tokens': 86104, 'completion_tokens': 1077, 'total_tokens': 87181}
  📋 airline_task_18: 0.0, total 6 assertions)
  📊 LLM Usage for airline_task_36: {'prompt_tokens': 10691, 'completion_tokens': 154, 'total_tokens': 10845}
  📋 airline_task_36: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_30: {'prompt_tokens': 16621, 'completion_tokens': 205, 'total_tokens': 16826}
  📋 airline_task_30: 1.0, total 2 assertions)


[92m01:58:14 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:14 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_29: {'prompt_tokens': 37704, 'completion_tokens': 738, 'total_tokens': 38442}
  📋 airline_task_29: 1.0, total 2 assertions)


[92m01:58:15 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:15 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_32: {'prompt_tokens': 22599, 'completion_tokens': 248, 'total_tokens': 22847}
  📋 airline_task_32: 0.0, total 2 assertions)


[92m01:58:16 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:16 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:16 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai
[92m01:58:16 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_14: {'prompt_tokens': 59700, 'completion_tokens': 399, 'total_tokens': 60099}
  📋 airline_task_14: 0.0, total 5 assertions)
  📊 LLM Usage for airline_task_24: {'prompt_tokens': 47247, 'completion_tokens': 532, 'total_tokens': 47779}
  📋 airline_task_24: 0.0, total 3 assertions)


[92m01:58:16 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:16 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_40: {'prompt_tokens': 10333, 'completion_tokens': 114, 'total_tokens': 10447}
  📋 airline_task_40: 1.0, total 1 assertions)


[92m01:58:17 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:17 - LiteLLM:INFO[0m: utils.py:3230 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = openai


  📊 LLM Usage for airline_task_41: {'prompt_tokens': 48992, 'completion_tokens': 239, 'total_tokens': 49231}
  📋 airline_task_41: 1.0, total 2 assertions)


[92m01:58:17 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:17 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_33: {'prompt_tokens': 69535, 'completion_tokens': 950, 'total_tokens': 70485}
  📋 airline_task_33: 0.0, total 3 assertions)
  📊 LLM Usage for airline_task_42: {'prompt_tokens': 76604, 'completion_tokens': 304, 'total_tokens': 76908}
  📋 airline_task_42: 1.0, total 2 assertions)


[92m01:58:17 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:17 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_35: {'prompt_tokens': 31356, 'completion_tokens': 812, 'total_tokens': 32168}
  📋 airline_task_35: 0.0, total 3 assertions)
  📊 LLM Usage for airline_task_39: {'prompt_tokens': 94568, 'completion_tokens': 377, 'total_tokens': 94945}
  📋 airline_task_39: 0.0, total 4 assertions)


[92m01:58:17 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_47: {'prompt_tokens': 10216, 'completion_tokens': 65, 'total_tokens': 10281}
  📋 airline_task_47: 0.0, total 1 assertions)
  📊 LLM Usage for airline_task_48: {'prompt_tokens': 10390, 'completion_tokens': 64, 'total_tokens': 10454}
  📋 airline_task_48: 0.0, total 1 assertions)


[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_46: {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}
  📋 airline_task_46: 1.0, total 1 assertions)
  📊 LLM Usage for airline_task_38: {'prompt_tokens': 28433, 'completion_tokens': 419, 'total_tokens': 28852}
  📋 airline_task_38: 1.0, total 4 assertions)


[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_37: {'prompt_tokens': 59797, 'completion_tokens': 509, 'total_tokens': 60306}
  📋 airline_task_37: 0.0, total 3 assertions)


[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m01:58:18 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_43: {'prompt_tokens': 35081, 'completion_tokens': 199, 'total_tokens': 35280}
  📋 airline_task_43: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_45: {'prompt_tokens': 31911, 'completion_tokens': 793, 'total_tokens': 32704}
  📋 airline_task_45: 0.0, total 2 assertions)
  📊 LLM Usage for airline_task_23: {'prompt_tokens': 69608, 'completion_tokens': 1190, 'total_tokens': 70798}
  📋 airline_task_23: 0.0, total 8 assertions)


[92m01:58:19 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_49: {'prompt_tokens': 16281, 'completion_tokens': 296, 'total_tokens': 16577}
  📋 airline_task_49: 1.0, total 1 assertions)


[92m01:58:22 - LiteLLM:INFO[0m: utils.py:1239 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


  📊 LLM Usage for airline_task_44: {'prompt_tokens': 42649, 'completion_tokens': 366, 'total_tokens': 43015}
  📋 airline_task_44: 0.0, total 5 assertions)
✅ Completed parallel evaluations in 18.21s
🧹 Closing 50 MCP sessions...
✅ All MCP sessions closed.
🛑 Stopping server on port 8000...
🧹 Cleaned up log file: /Users/derekxu/Documents/code/python-sdk/local_evals/server_output_airline_8000.log

✅ Completed evaluations for 3 models
📊 Total results: 150
📊 Total trajectories: 150


## 5. Analyze Results

Let's analyze and visualize the comparison between Claude 4 Opus, GPT-4.1, and Kimi K2.


In [18]:
model_id_to_config = {}
for config_key, model_info in models_to_test.items():
    actual_model_id = model_info["policy"].model_id
    model_id_to_config[actual_model_id] = model_info

print(f"\n📈 Summary Statistics:")
total_cost = 0.0
for actual_model_id, model_info in model_id_to_config.items():
    model_results_subset = [r for r in all_results if r['model_id'] == actual_model_id]
    avg_score = sum(r['score'] for r in model_results_subset) / len(model_results_subset) if model_results_subset else 0
    
    # Calculate total cost for this model
    model_total_cost = sum(r['cost_info']['total_cost'] for r in model_results_subset if 'cost_info' in r)
    total_cost += model_total_cost
    
    # Show cost source info
    cost_sources = [r['cost_info'].get('cost_source', 'unknown') for r in model_results_subset if 'cost_info' in r]
    cost_source_summary = f" (via {cost_sources[0]})" if cost_sources else ""
    
    print(f"   {model_info['name']}: {avg_score:.2%} success rate ({sum(r['score'] for r in model_results_subset)}/{len(model_results_subset)}) - Cost: ${model_total_cost:.2f}{cost_source_summary}")

print(f"\n💰 Total evaluation cost: ${total_cost:.2f}")
print(f"📊 Cost calculation uses actual API usage data from LLMUsageStats")


📈 Summary Statistics:
   GPT-4.1: 60.00% success rate (30.0/50) - Cost: $2.12 (via litellm)
   Claude 4 Opus: 60.00% success rate (30.0/50) - Cost: $41.17 (via litellm)
   Kimi K2: 48.00% success rate (24.0/50) - Cost: $1.24 (via manual_pricing)

💰 Total evaluation cost: $44.53
📊 Cost calculation uses actual API usage data from LLMUsageStats


In [19]:
def save_results_jsonl(trajectory_records: List[Dict], output_file: str = "trajectory_outputs/all_trajectories.jsonl"):
    """Save all trajectory records in JSONL format (one JSON object per line)."""
    output_path = Path(output_file)
    output_path.parent.mkdir(exist_ok=True)
    
    with open(output_path, 'w') as f:
        for record in trajectory_records:
            json.dump(record, f, default=str)
            f.write('\n')
    
    print(f"📄 Saved JSONL file: {output_path}")
    return output_path

save_results_jsonl(all_trajectory_records)

📄 Saved JSONL file: trajectory_outputs/all_trajectories.jsonl


PosixPath('trajectory_outputs/all_trajectories.jsonl')

In [20]:
def save_trajectory_files(trajectory_records: List[Dict], output_dir: str = "trajectory_outputs"):
    """Save trajectory records to individual files and create summary."""
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Save individual trajectory files
    for record in trajectory_records:
        # Sanitize model_id for filename (replace slashes with underscores)
        safe_model_id = record['model_id'].replace('/', '_').replace('\\', '_')
        filename = f"{safe_model_id}_{record['scenario_id']}_trajectory.json"
        filepath = output_path / filename
        
        with open(filepath, 'w') as f:
            json.dump(record, f, indent=2, default=str)
    
    # Create summary file
    summary = {
        "evaluation_summary": {
            "total_trajectories": len(trajectory_records),
            "models_evaluated": list(set(r['model_id'] for r in trajectory_records)),
            "scenarios_evaluated": list(set(r['scenario_id'] for r in trajectory_records)),
            "timestamp": datetime.now().isoformat(),
        },
        "model_performance": {},
        "scenario_difficulty": {}
    }
    
    # Calculate model performance
    for model_id in summary["evaluation_summary"]["models_evaluated"]:
        model_records = [r for r in trajectory_records if r['model_id'] == model_id]
        total_score = sum(r['evaluation']['score'] for r in model_records)
        avg_score = total_score / len(model_records) if model_records else 0
        
        # Calculate cost metrics
        total_cost = sum(r.get('cost_info', {}).get('total_cost', 0) for r in model_records)
        total_tokens = sum(r.get('cost_info', {}).get('total_tokens', 0) for r in model_records)
        avg_cost_per_scenario = total_cost / len(model_records) if model_records else 0
        
        summary["model_performance"][model_id] = {
            "total_scenarios": len(model_records),
            "total_score": total_score,
            "average_score": avg_score,
            "pass_rate": avg_score,  # Since scores are 0 or 1
            "total_cost": total_cost,
            "average_cost_per_scenario": avg_cost_per_scenario,
            "total_tokens": total_tokens,
            "cost_per_success": total_cost / total_score if total_score > 0 else 0
        }
    
    # Calculate scenario difficulty
    for scenario_id in summary["evaluation_summary"]["scenarios_evaluated"]:
        scenario_records = [r for r in trajectory_records if r['scenario_id'] == scenario_id]
        total_score = sum(r['evaluation']['score'] for r in scenario_records)
        avg_score = total_score / len(scenario_records) if scenario_records else 0
        
        summary["scenario_difficulty"][scenario_id] = {
            "models_tested": len(scenario_records),
            "total_score": total_score,
            "average_score": avg_score,
            "difficulty": "easy" if avg_score > 0.8 else "medium" if avg_score > 0.5 else "hard"
        }
    
    # Save summary
    summary_path = output_path / "evaluation_summary.json"
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    
    print(f"\n📁 Saved trajectory files to: {output_path}")
    print(f"   - {len(trajectory_records)} individual trajectory files")
    print(f"   - 1 evaluation summary file")
    
    return output_path

save_trajectory_files(all_trajectory_records)


📁 Saved trajectory files to: trajectory_outputs
   - 150 individual trajectory files
   - 1 evaluation summary file


PosixPath('trajectory_outputs')

## 7. Share Results with Firectl

Finally, let's create a dataset with our evaluation results to share using `firectl create dataset`.


In [None]:
# TODO

## Summary

This notebook provides a complete eval harness for comparing models using tau2-bench airline evaluation with proper dataset structure:

1. **Dataset Structure**: Following tau2-bench pattern with separate JSON datasets and markdown system prompts
2. **Models**: Configured Claude 3 Opus (AnthropicPolicy) and Kimi K2 (FireworksPolicy)
3. **Evaluation**: Used tau2-bench NLAssertionsEvaluator for objective scoring
4. **Analysis**: Compared performance across multiple dimensions
5. **Sharing**: Prepared results for sharing via `firectl create dataset`

### Key Features:
- **Clean Dataset Structure**: Separate JSON data and markdown prompts like the tau2 examples
- **Natural Language Evaluation**: Uses human-readable assertions instead of code-based metrics
- **Multi-Model Comparison**: Easy to add more models for comparison
- **Comprehensive Analysis**: Performance, accuracy, and efficiency metrics
- **Reproducible**: Results can be shared and reproduced via firectl

### Next Steps:
1. Set your API keys as environment variables:
   ```bash
   export ANTHROPIC_API_KEY="your-anthropic-key-here"
   export OPENAI_API_KEY="your-openai-key-here"
   export FIREWORKS_API_KEY="your-fireworks-key-here"
   ```
2. Start the tau2 MCP server: `cd examples/tau2_mcp && python server.py --port 8000`
3. Run the evaluation cells
4. Share results with the community using the provided firectl command

### Expected Results:
Based on the tau2-bench framework, we expect Kimi K2 to show higher accuracy on natural language assertion evaluation, demonstrating better adherence to airline policy compliance and customer service protocols.

This structure mimics the tau2-bench pattern and eliminates the need for manual global variables.