## AB Testing Models - Evaluating Agent Performance

Learn how to systematically evaluate multiple agent variants using different language models. This tutorial shows you how to run parallel evaluations, compare performance metrics, analyze cost-performance trade-offs, and generate data-driven model selection recommendations.

### What You'll Learn
- Load agent configurations from Part 1 and recreate agents
- Create diverse evaluation datasets from TauBench scenarios
- Run systematic evaluations on multiple agent variants
- Use five built-in evaluators for comprehensive assessment
- Compare evaluation scores side-by-side across models
- Calculate statistical measures (mean, standard deviation)
- Analyze cost-performance trade-offs
- Generate actionable model selection recommendations

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:------------------------------------------------------------------------------|
| Tutorial type       | Advanced - Systematic model evaluation and comparison                         |
| Tutorial components | Multi-model evaluation, statistical analysis, cost-performance optimization   |
| Tutorial vertical   | Agent Evaluation                                                              |
| Example complexity  | Advanced                                                                      |
| SDK used            | Strands Agents, Strands Evals                                                 |

### Understanding A/B Testing Evaluation

Systematic evaluation answers: Which model produces highest quality responses? Best completes goals? Selects tools most accurately? Provides best cost-performance ratio?

#### Five Dimensions of Evaluation

| Evaluator | Purpose |
|:----------|:--------|
| OutputEvaluator | Measures response quality, correctness, completeness |
| HelpfulnessEvaluator | Assesses practical value, clarity, actionability |
| GoalSuccessRateEvaluator | Tracks binary success/failure for task completion |
| ToolSelectionAccuracyEvaluator | Validates appropriate tool choices |
| ToolParameterAccuracyEvaluator | Checks parameter correctness and formatting |

#### Evaluation Methodology

| Requirement | Description |
|:------------|:------------|
| Same test dataset | All models tested on identical queries |
| Same evaluation criteria | Consistent scoring across variants |
| Multiple test cases | Statistical validity through volume |

**Analysis Flow**: Run agents → Collect scores → Calculate statistics → Compare side-by-side → Factor cost → Generate recommendations

### Environment Setup

Configure AWS region and prepare for multi-model evaluation.

In [None]:
import boto3

# AWS Configuration
session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'

### Setup and Imports

Import all necessary libraries for evaluation and statistical analysis.

In [None]:
import os
import sys
import json
import logging
from typing import Dict, List, Any
import statistics
from collections import defaultdict

# Add paths for airline tools (local data directory)
sys.path.append('./data/ma-bench/')
sys.path.append('./data/tau-bench/')

# Strands imports
from strands import Agent
from strands.models import BedrockModel

# Strands Evals imports
from strands_evals import Experiment, Case
from strands_evals.evaluators import (
    OutputEvaluator,
    HelpfulnessEvaluator,
    GoalSuccessRateEvaluator,
    ToolSelectionAccuracyEvaluator,
    ToolParameterAccuracyEvaluator
)

# Display utilities
from IPython.display import Markdown, display

# Disable verbose logging
logging.basicConfig(level=logging.CRITICAL)
for logger_name in ["strands", "graph", "event_loop", "registry", "sliding_window_conversation_manager", "bedrock", "streaming"]:
    logging.getLogger(logger_name).setLevel(logging.CRITICAL)

# Bypass tool consent
os.environ["BYPASS_TOOL_CONSENT"] = "true"

### Load Agent Configurations

Load the agent configurations saved in Part 1 (Tutorial 07a).

In [None]:
# Load configuration from Part 1
config_path = "./agent_configs.json"

with open(config_path, "r") as f:
    config = json.load(f)

print("Loaded agent configurations:")
print(f"  Models: {len(config['models'])}")
print(f"  Dataset path: {config['dataset_path']}")
print(f"  Tools: {config['num_tools']}")
print("\nModel variants:")
for key, model_info in config['models'].items():
    print(f"  {model_info['name']}: {model_info['model_id']}")

### Import Airline Domain Tools

Recreate the airline tool environment for all three agent variants.

In [None]:
# Import all 14 airline tools
from mabench.environments.airline.tools.book_reservation import book_reservation
from mabench.environments.airline.tools.calculate import calculate
from mabench.environments.airline.tools.cancel_reservation import cancel_reservation
from mabench.environments.airline.tools.get_reservation_details import get_reservation_details
from mabench.environments.airline.tools.get_user_details import get_user_details
from mabench.environments.airline.tools.list_all_airports import list_all_airports
from mabench.environments.airline.tools.search_direct_flight import search_direct_flight
from mabench.environments.airline.tools.search_onestop_flight import search_onestop_flight
from mabench.environments.airline.tools.send_certificate import send_certificate
from mabench.environments.airline.tools.think import think
from mabench.environments.airline.tools.transfer_to_human_agents import transfer_to_human_agents
from mabench.environments.airline.tools.update_reservation_baggages import update_reservation_baggages
from mabench.environments.airline.tools.update_reservation_flights import update_reservation_flights
from mabench.environments.airline.tools.update_reservation_passengers import update_reservation_passengers

# Import airline policy
from tau_bench.envs.airline.wiki import WIKI

# Define tools list
AIRLINE_TOOLS = [
    book_reservation,
    calculate,
    cancel_reservation,
    get_reservation_details,
    get_user_details,
    list_all_airports,
    search_direct_flight,
    search_onestop_flight,
    send_certificate,
    think,
    transfer_to_human_agents,
    update_reservation_baggages,
    update_reservation_flights,
    update_reservation_passengers,
]

# Create system prompt
SYSTEM_PROMPT_TEMPLATE = """
You are a helpful assistant for a travel website. Help the user answer any questions.

<instructions>
- Remember to check if the airport city is in the state mentioned by the user. For example, Houston is in Texas.
- Infer about the U.S. state in which the airport city resides. For example, Houston is in Texas.
- You should not use made-up or placeholder arguments.
</instructions>

<policy>
{policy}
</policy>
"""

AIRLINE_SYSTEM_PROMPT = SYSTEM_PROMPT_TEMPLATE.replace("{policy}", WIKI)

print(f"Imported {len(AIRLINE_TOOLS)} airline tools")
print(f"System prompt created ({len(AIRLINE_SYSTEM_PROMPT)} characters)")

### Recreate All Three Agent Variants

Build the three agent variants (Haiku, Sonnet, Nova Lite) for evaluation.

In [None]:
# Create BedrockModels for each variant
model_haiku = BedrockModel(
    region_name=AWS_REGION,
    model_id=config['models']['haiku']['model_id']
)

model_sonnet = BedrockModel(
    region_name=AWS_REGION,
    model_id=config['models']['sonnet']['model_id']
)

model_nova_lite = BedrockModel(
    region_name=AWS_REGION,
    model_id=config['models']['nova_lite']['model_id']
)

# Create agents
agent_haiku = Agent(
    name="airline_assistant_haiku",
    model=model_haiku,
    tools=AIRLINE_TOOLS,
    system_prompt=AIRLINE_SYSTEM_PROMPT
)

agent_sonnet = Agent(
    name="airline_assistant_sonnet",
    model=model_sonnet,
    tools=AIRLINE_TOOLS,
    system_prompt=AIRLINE_SYSTEM_PROMPT
)

agent_nova_lite = Agent(
    name="airline_assistant_nova_lite",
    model=model_nova_lite,
    tools=AIRLINE_TOOLS,
    system_prompt=AIRLINE_SYSTEM_PROMPT
)

# Store in dictionary for easy iteration
agents = {
    "haiku": {"agent": agent_haiku, "name": "Claude Haiku"},
    "sonnet": {"agent": agent_sonnet, "name": "Claude Sonnet"},
    "nova_lite": {"agent": agent_nova_lite, "name": "Nova Lite"}
}

print("All three agent variants created:")
for key, info in agents.items():
    print(f"  {info['name']}: Ready")

### Load TauBench Dataset

Load the full TauBench airline dataset and select diverse test cases for evaluation.

In [None]:
# Load TauBench dataset
dataset_path = config['dataset_path']

with open(dataset_path, "r") as file:
    all_tasks = json.load(file)

print(f"Loaded {len(all_tasks)} total scenarios from TauBench")
print("\nDataset includes:")
print("  - Flight searches and bookings")
print("  - Reservation modifications")
print("  - Cancellations and refunds")
print("  - Customer service inquiries")
print("  - Policy-based decisions")

### Select Diverse Test Cases

Choose 12 diverse test cases that represent different complexity levels and scenario types.

In [None]:
# Select diverse test cases (indices chosen for variety)
selected_indices = [0, 1] #, 2, 4, 10, 15, 20, 25, 30, 35, 40, 45]
selected_tasks = [all_tasks[i] for i in selected_indices]

print(f"Selected {len(selected_tasks)} diverse test cases:")
print("\nTest Case Overview:")
for i, task in enumerate(selected_tasks, 1):
    question_preview = task['question'].replace('\n', ' ') + "..."
    print(f"  {i}. {question_preview}")

### Create Evaluation Dataset

Build Strands Evals dataset with test cases and five evaluators.

In [None]:
# Create Case objects for evaluation
test_cases = []
for i, task in enumerate(selected_tasks):
    # Use 'question' if available, otherwise fall back to 'instruction'
    user_query = task.get('question', task.get('instruction', ''))
    case = Case(
        name=f"Scenario_{i+1}_{task['user_id']}",
        input=user_query,  # ✅ Works for both schemas
        expected_output="A helpful, accurate response that addresses the user's airline service request."
    )
    test_cases.append(case)

# Create separate experiments for each evaluator
evaluator_list = [
    ("OutputEvaluator", OutputEvaluator(rubric="The output should provide a helpful, accurate response to the airline service request.")),
    ("HelpfulnessEvaluator", HelpfulnessEvaluator()),
    ("GoalSuccessRateEvaluator", GoalSuccessRateEvaluator()),
    ("ToolSelectionAccuracyEvaluator", ToolSelectionAccuracyEvaluator()),
    ("ToolParameterAccuracyEvaluator", ToolParameterAccuracyEvaluator())
]

dataset_output = Experiment(
    cases=test_cases,
    evaluator=evaluator_list[0][1]
)

dataset_helpfulness = Experiment(
    cases=test_cases,
    evaluator=evaluator_list[1][1]
)

dataset_goal = Experiment(
    cases=test_cases,
    evaluator=evaluator_list[2][1]
)

dataset_tool_selection = Experiment(
    cases=test_cases,
    evaluator=evaluator_list[3][1]
)

dataset_tool_parameter = Experiment(
    cases=test_cases,
    evaluator=evaluator_list[4][1]
)

print(f"Evaluation experiments created:")
print(f"  Test cases: {len(test_cases)}")
print(f"  Evaluators: {len(evaluator_list)} (one experiment per evaluator)")
print("\nEvaluator types:")
for evaluator_name, _ in evaluator_list:
    print(f"  - {evaluator_name}")


### Run Evaluation: Claude Haiku

Evaluate the Claude Haiku agent variant on all test cases.

In [None]:
# Define agent task function for Haiku
def agent_task_haiku(case: Case) -> str:
    """Execute Haiku agent with the given case input."""
    agent_haiku.messages = []
    response = agent_haiku(case.input)
    return str(response)

# Run evaluations for all 5 evaluators
print("[EVALUATING HAIKU]")
print("Running evaluation on Claude Haiku agent...")

print("  [1/5] OutputEvaluator...")
report_haiku_output = dataset_output.run_evaluations(agent_task_haiku)

print("  [2/5] HelpfulnessEvaluator...")
report_haiku_helpfulness = dataset_helpfulness.run_evaluations(agent_task_haiku)

print("  [3/5] GoalSuccessRateEvaluator...")
report_haiku_goal = dataset_goal.run_evaluations(agent_task_haiku)

print("  [4/5] ToolSelectionAccuracyEvaluator...")
report_haiku_tool_selection = dataset_tool_selection.run_evaluations(agent_task_haiku)

print("  [5/5] ToolParameterAccuracyEvaluator...")
report_haiku_tool_parameter = dataset_tool_parameter.run_evaluations(agent_task_haiku)

print("Haiku evaluation complete (5 evaluation rounds)")

#### Haiku Results

Display evaluation results for Claude Haiku.

In [None]:
print("\n" + "=" * 80)
print("CLAUDE HAIKU RESULTS")
print("=" * 80)

print("\n[1/5] OUTPUT QUALITY")
print("-" * 80)
report_haiku_output.display()

print("\n[2/5] HELPFULNESS")
print("-" * 80)
report_haiku_helpfulness.display()

print("\n[3/5] GOAL SUCCESS RATE")
print("-" * 80)
report_haiku_goal.display()

print("\n[4/5] TOOL SELECTION ACCURACY")
print("-" * 80)
report_haiku_tool_selection.display()

print("\n[5/5] TOOL PARAMETER ACCURACY")
print("-" * 80)
report_haiku_tool_parameter.display()

### Run Evaluation: Claude Sonnet

Evaluate the Claude Sonnet agent variant on all test cases.

In [None]:
# Define agent task function for Sonnet
def agent_task_sonnet(case: Case) -> str:
    """Execute Sonnet agent with the given case input."""
    agent_sonnet.messages = []
    response = agent_sonnet(case.input)
    return str(response)

# Run evaluations for all 5 evaluators
print("[EVALUATING SONNET]")
print("Running evaluation on Claude Sonnet agent...")

print("  [1/5] OutputEvaluator...")
report_sonnet_output = dataset_output.run_evaluations(agent_task_sonnet)

print("  [2/5] HelpfulnessEvaluator...")
report_sonnet_helpfulness = dataset_helpfulness.run_evaluations(agent_task_sonnet)

print("  [3/5] GoalSuccessRateEvaluator...")
report_sonnet_goal = dataset_goal.run_evaluations(agent_task_sonnet)

print("  [4/5] ToolSelectionAccuracyEvaluator...")
report_sonnet_tool_selection = dataset_tool_selection.run_evaluations(agent_task_sonnet)

print("  [5/5] ToolParameterAccuracyEvaluator...")
report_sonnet_tool_parameter = dataset_tool_parameter.run_evaluations(agent_task_sonnet)

print("Sonnet evaluation complete (5 evaluation rounds)")

#### Sonnet Results

Display evaluation results for Claude Sonnet.

In [None]:
print("\n" + "=" * 80)
print("CLAUDE SONNET RESULTS")
print("=" * 80)

print("\n[1/5] OUTPUT QUALITY")
print("-" * 80)
report_sonnet_output.display()

print("\n[2/5] HELPFULNESS")
print("-" * 80)
report_sonnet_helpfulness.display()

print("\n[3/5] GOAL SUCCESS RATE")
print("-" * 80)
report_sonnet_goal.display()

print("\n[4/5] TOOL SELECTION ACCURACY")
print("-" * 80)
report_sonnet_tool_selection.display()

print("\n[5/5] TOOL PARAMETER ACCURACY")
print("-" * 80)
report_sonnet_tool_parameter.display()

### Run Evaluation: Nova Lite

Evaluate the Nova Lite agent variant on all test cases.

In [None]:
# Define agent task function for Nova Lite
def agent_task_nova(case: Case) -> str:
    """Execute Nova Lite agent with the given case input."""
    agent_nova_lite.messages = []
    response = agent_nova_lite(case.input)
    return str(response)

# Run evaluations for all 5 evaluators
print("[EVALUATING NOVA LITE]")
print("Running evaluation on Nova Lite agent...")

print("  [1/5] OutputEvaluator...")
report_nova_output = dataset_output.run_evaluations(agent_task_nova)

print("  [2/5] HelpfulnessEvaluator...")
report_nova_helpfulness = dataset_helpfulness.run_evaluations(agent_task_nova)

print("  [3/5] GoalSuccessRateEvaluator...")
report_nova_goal = dataset_goal.run_evaluations(agent_task_nova)

print("  [4/5] ToolSelectionAccuracyEvaluator...")
report_nova_tool_selection = dataset_tool_selection.run_evaluations(agent_task_nova)

print("  [5/5] ToolParameterAccuracyEvaluator...")
report_nova_tool_parameter = dataset_tool_parameter.run_evaluations(agent_task_nova)

print("Nova Lite evaluation complete (5 evaluation rounds)")

#### Nova Lite Results

Display evaluation results for Nova Lite.

In [None]:
print("\n" + "=" * 80)
print("NOVA LITE RESULTS")
print("=" * 80)

print("\n[1/5] OUTPUT QUALITY")
print("-" * 80)
report_nova_output.display()

print("\n[2/5] HELPFULNESS")
print("-" * 80)
report_nova_helpfulness.display()

print("\n[3/5] GOAL SUCCESS RATE")
print("-" * 80)
report_nova_goal.display()

print("\n[4/5] TOOL SELECTION ACCURACY")
print("-" * 80)
report_nova_tool_selection.display()

print("\n[5/5] TOOL PARAMETER ACCURACY")
print("-" * 80)
report_nova_tool_parameter.display()

In [None]:
# DEBUG: Inspect report structure to understand how to extract scores
print("=" * 80)
print("DEBUG: Inspecting report_haiku_output structure")
print("=" * 80)

print("\n1. Report type:")
print(f"   {type(report_haiku_output)}")

print("\n2. Report attributes:")
report_attrs = [attr for attr in dir(report_haiku_output) if not attr.startswith('_')]
print(f"   {report_attrs}")

print("\n3. Checking for 'detailed_results' attribute:")
if hasattr(report_haiku_output, 'detailed_results'):
    print(f"   Type: {type(report_haiku_output.detailed_results)}")
    if report_haiku_output.detailed_results:
        print(f"   Length: {len(report_haiku_output.detailed_results)}")
        print(f"   First element type: {type(report_haiku_output.detailed_results[0])}")
        print(f"   First element: {report_haiku_output.detailed_results[0]}")
else:
    print("   'detailed_results' not found!")

print("\n4. Checking for 'results' attribute:")
if hasattr(report_haiku_output, 'results'):
    print(f"   Type: {type(report_haiku_output.results)}")
    if report_haiku_output.results:
        print(f"   Length: {len(report_haiku_output.results)}")
        print(f"   First element type: {type(report_haiku_output.results[0])}")
        print(f"   First element attributes: {[attr for attr in dir(report_haiku_output.results[0]) if not attr.startswith('_')]}")
        print(f"   First element: {report_haiku_output.results[0]}")
        
        # Check if first element has score
        if hasattr(report_haiku_output.results[0], 'score'):
            print(f"   First element score: {report_haiku_output.results[0].score}")
else:
    print("   'results' not found!")

print("\n5. All non-private attributes with values:")
for attr in report_attrs[:10]:  # Show first 10 to avoid too much output
    try:
        value = getattr(report_haiku_output, attr)
        if not callable(value):
            print(f"   {attr}: {type(value)} = {str(value)[:100]}")
    except:
        pass

In [None]:
# DEBUG: Inspect report structure to understand how to extract scores
print("=" * 80)
print("DEBUG: Inspecting report_haiku_output structure")
print("=" * 80)

print("\n1. Report type:")
print(f"   {type(report_haiku_output)}")

print("\n2. Report attributes:")
report_attrs = [attr for attr in dir(report_haiku_output) if not attr.startswith('_')]
print(f"   {report_attrs}")

print("\n3. Checking for 'detailed_results' attribute:")
if hasattr(report_haiku_output, 'detailed_results'):
    print(f"   Type: {type(report_haiku_output.detailed_results)}")
    if report_haiku_output.detailed_results:
        print(f"   Length: {len(report_haiku_output.detailed_results)}")
        print(f"   First element type: {type(report_haiku_output.detailed_results[0])}")
        print(f"   First element: {report_haiku_output.detailed_results[0]}")
else:
    print("   'detailed_results' not found!")

print("\n4. Checking for 'results' attribute:")
if hasattr(report_haiku_output, 'results'):
    print(f"   Type: {type(report_haiku_output.results)}")
    if report_haiku_output.results:
        print(f"   Length: {len(report_haiku_output.results)}")
        print(f"   First element type: {type(report_haiku_output.results[0])}")
        print(f"   First element attributes: {[attr for attr in dir(report_haiku_output.results[0]) if not attr.startswith('_')]}")
        print(f"   First element: {report_haiku_output.results[0]}")
        
        # Check if first element has score
        if hasattr(report_haiku_output.results[0], 'score'):
            print(f"   First element score: {report_haiku_output.results[0].score}")
else:
    print("   'results' not found!")

print("\n5. All non-private attributes with values:")
for attr in report_attrs[:10]:  # Show first 10 to avoid too much output
    try:
        value = getattr(report_haiku_output, attr)
        if not callable(value):
            print(f"   {attr}: {type(value)} = {str(value)[:100]}")
    except:
        pass

In [None]:
# Helper function to extract scores from report
def extract_scores(report, evaluator_name: str) -> Dict[str, List[float]]:
    """Extract numeric scores organized by evaluator type.
    
    Args:
        report: EvaluationReport object containing detailed_results
        evaluator_name: Name of the evaluator (e.g., 'OutputEvaluator', 'HelpfulnessEvaluator')
    
    Returns:
        Dictionary mapping evaluator name to list of scores
    """
    scores = defaultdict(list)
    
    # detailed_results is a list of lists of EvaluationOutput objects
    # Each test case has its own list of EvaluationOutput objects
    for test_case_results in report.detailed_results:
        # test_case_results is a list of EvaluationOutput objects for one test case
        for eval_output in test_case_results:
            # eval_output is an EvaluationOutput object with fields: score, test_pass, reason, label
            if hasattr(eval_output, 'score'):
                scores[evaluator_name].append(float(eval_output.score))
    
    return dict(scores)

# Extract scores from all evaluation reports (3 models x evaluators)
scores_haiku = {}
scores_haiku.update(extract_scores(report_haiku_output, 'OutputEvaluator'))
scores_haiku.update(extract_scores(report_haiku_helpfulness, 'HelpfulnessEvaluator'))
scores_haiku.update(extract_scores(report_haiku_goal, 'GoalSuccessRateEvaluator'))
scores_haiku.update(extract_scores(report_haiku_tool_selection, 'ToolSelectionAccuracyEvaluator'))
scores_haiku.update(extract_scores(report_haiku_tool_parameter, 'ToolParameterAccuracyEvaluator'))

scores_sonnet = {}
scores_sonnet.update(extract_scores(report_sonnet_output, 'OutputEvaluator'))
scores_sonnet.update(extract_scores(report_sonnet_helpfulness, 'HelpfulnessEvaluator'))
scores_sonnet.update(extract_scores(report_sonnet_goal, 'GoalSuccessRateEvaluator'))
scores_sonnet.update(extract_scores(report_sonnet_tool_selection, 'ToolSelectionAccuracyEvaluator'))
scores_sonnet.update(extract_scores(report_sonnet_tool_parameter, 'ToolParameterAccuracyEvaluator'))

scores_nova = {}
scores_nova.update(extract_scores(report_nova_output, 'OutputEvaluator'))
scores_nova.update(extract_scores(report_nova_helpfulness, 'HelpfulnessEvaluator'))
scores_nova.update(extract_scores(report_nova_goal, 'GoalSuccessRateEvaluator'))
scores_nova.update(extract_scores(report_nova_tool_selection, 'ToolSelectionAccuracyEvaluator'))
scores_nova.update(extract_scores(report_nova_tool_parameter, 'ToolParameterAccuracyEvaluator'))

# Calculate totals dynamically
num_evaluators = len(scores_haiku)
num_models = 3
total_reports = num_models * num_evaluators

print(f"Scores extracted from all {total_reports} evaluation reports")
print(f"  Haiku: {num_evaluators} evaluation rounds")
print(f"  Sonnet: {num_evaluators} evaluation rounds")
print(f"  Nova: {num_evaluators} evaluation rounds")
print(f"\nEvaluator types found:")
for evaluator_name in scores_haiku.keys():
    print(f"  - {evaluator_name}")


### Calculate Statistical Measures

Compute mean and standard deviation for each evaluator across all models.

In [None]:
# Calculate statistics for each model
def calculate_statistics(scores: Dict[str, List[float]]) -> Dict[str, Dict[str, float]]:
    """Calculate mean and standard deviation for each evaluator."""
    stats = {}
    
    for evaluator_name, score_list in scores.items():
        if len(score_list) > 0:
            mean_score = statistics.mean(score_list)
            std_dev = statistics.stdev(score_list) if len(score_list) > 1 else 0.0
            stats[evaluator_name] = {
                "mean": mean_score,
                "std_dev": std_dev,
                "count": len(score_list)
            }
    
    return stats

# Calculate for all models
stats_haiku = calculate_statistics(scores_haiku)
stats_sonnet = calculate_statistics(scores_sonnet)
stats_nova = calculate_statistics(scores_nova)

### Side-by-Side Comparison

Display evaluation scores side-by-side for direct comparison across models.

In [None]:
# Create comparison table
comparison_md = "## Model Performance Comparison\n\n"
comparison_md += "### Mean Scores by Evaluator\n\n"
comparison_md += "| Evaluator | Claude Haiku | Claude Sonnet | Nova Lite |\n"
comparison_md += "|:----------|-------------:|--------------:|----------:|\n"

# Get all evaluator names
all_evaluators = set(stats_haiku.keys()) | set(stats_sonnet.keys()) | set(stats_nova.keys())

for evaluator_name in sorted(all_evaluators):
    haiku_mean = stats_haiku.get(evaluator_name, {}).get('mean', 0.0)
    sonnet_mean = stats_sonnet.get(evaluator_name, {}).get('mean', 0.0)
    nova_mean = stats_nova.get(evaluator_name, {}).get('mean', 0.0)
    
    # Highlight best score
    max_score = max(haiku_mean, sonnet_mean, nova_mean)
    
    haiku_str = f"**{haiku_mean:.3f}**" if haiku_mean == max_score else f"{haiku_mean:.3f}"
    sonnet_str = f"**{sonnet_mean:.3f}**" if sonnet_mean == max_score else f"{sonnet_mean:.3f}"
    nova_str = f"**{nova_mean:.3f}**" if nova_mean == max_score else f"{nova_mean:.3f}"
    
    comparison_md += f"| {evaluator_name} | {haiku_str} | {sonnet_str} | {nova_str} |\n"

comparison_md += "\n*Bold indicates best performance for that evaluator*\n"

display(Markdown(comparison_md))

### Statistical Variability Analysis

Examine the standard deviation to understand consistency across test cases.

In [None]:
# Create variability table
variability_md = "## Performance Variability (Standard Deviation)\n\n"
variability_md += "| Evaluator | Claude Haiku | Claude Sonnet | Nova Lite |\n"
variability_md += "|:----------|-------------:|--------------:|----------:|\n"

for evaluator_name in sorted(all_evaluators):
    haiku_std = stats_haiku.get(evaluator_name, {}).get('std_dev', 0.0)
    sonnet_std = stats_sonnet.get(evaluator_name, {}).get('std_dev', 0.0)
    nova_std = stats_nova.get(evaluator_name, {}).get('std_dev', 0.0)
    
    # Highlight lowest variability (most consistent)
    min_std = min(haiku_std, sonnet_std, nova_std) if any([haiku_std, sonnet_std, nova_std]) else 0
    
    haiku_str = f"**{haiku_std:.3f}**" if haiku_std == min_std and min_std > 0 else f"{haiku_std:.3f}"
    sonnet_str = f"**{sonnet_std:.3f}**" if sonnet_std == min_std and min_std > 0 else f"{sonnet_std:.3f}"
    nova_str = f"**{nova_std:.3f}**" if nova_std == min_std and min_std > 0 else f"{nova_std:.3f}"
    
    variability_md += f"| {evaluator_name} | {haiku_str} | {sonnet_str} | {nova_str} |\n"

variability_md += "\n*Bold indicates lowest variability (most consistent performance)*\n"
variability_md += "\n**Lower standard deviation = More consistent performance across test cases**\n"

display(Markdown(variability_md))

### Cost-Performance Analysis

Analyze cost-performance trade-offs based on AWS Bedrock pricing.

In [None]:
# Approximate pricing (as of 2024, per 1K tokens)
# Note: Actual pricing may vary by region and over time
pricing = {
    "haiku": {"input": 0.00025, "output": 0.00125},  # Haiku pricing
    "sonnet": {"input": 0.003, "output": 0.015},    # Sonnet pricing
    "nova_lite": {"input": 0.00006, "output": 0.00024}  # Nova Lite pricing
}

# Create cost-performance summary
cost_md = "## Cost-Performance Trade-offs\n\n"
cost_md += "### Approximate Pricing (per 1K tokens)\n\n"
cost_md += "| Model | Input Cost | Output Cost | Relative Cost |\n"
cost_md += "|:------|:-----------|:------------|:--------------|\n"

# Calculate relative costs (normalized to Haiku)
haiku_total = pricing["haiku"]["input"] + pricing["haiku"]["output"]
sonnet_total = pricing["sonnet"]["input"] + pricing["sonnet"]["output"]
nova_total = pricing["nova_lite"]["input"] + pricing["nova_lite"]["output"]

cost_md += f"| Claude Haiku | ${pricing['haiku']['input']:.5f} | ${pricing['haiku']['output']:.5f} | 1.0x (baseline) |\n"
cost_md += f"| Claude Sonnet | ${pricing['sonnet']['input']:.5f} | ${pricing['sonnet']['output']:.5f} | {sonnet_total/haiku_total:.1f}x |\n"
cost_md += f"| Nova Lite | ${pricing['nova_lite']['input']:.6f} | ${pricing['nova_lite']['output']:.6f} | {nova_total/haiku_total:.2f}x |\n"

cost_md += "\n*Note: Pricing is approximate and may vary by region*\n"

display(Markdown(cost_md))

### Model Selection Recommendations

Generate actionable recommendations based on evaluation results and cost analysis.

In [None]:
# Calculate overall performance score (simple average of means)
def overall_score(stats: Dict[str, Dict[str, float]]) -> float:
    """Calculate simple average of all mean scores."""
    means = [stat['mean'] for stat in stats.values()]
    return statistics.mean(means) if means else 0.0

overall_haiku = overall_score(stats_haiku)
overall_sonnet = overall_score(stats_sonnet)
overall_nova = overall_score(stats_nova)

# Create recommendations
recommendations_md = "## Model Selection Recommendations\n\n"
recommendations_md += "### Overall Performance Scores\n\n"
recommendations_md += f"- **Claude Haiku**: {overall_haiku:.3f}\n"
recommendations_md += f"- **Claude Sonnet**: {overall_sonnet:.3f}\n"
recommendations_md += f"- **Nova Lite**: {overall_nova:.3f}\n"
recommendations_md += "\n### Use Case Recommendations\n\n"

# Determine winner
best_model = max([("Haiku", overall_haiku), ("Sonnet", overall_sonnet), ("Nova Lite", overall_nova)], key=lambda x: x[1])

recommendations_md += f"**Best Overall Performance**: Claude {best_model[0]}\n\n"
display(Markdown(recommendations_md))

### Summary

You've successfully learned how to perform systematic A/B testing evaluation for agent models. You now understand:

- How to load agent configurations and recreate multiple variants
- Creating diverse evaluation datasets from real-world scenarios
- Running systematic evaluations with multiple built-in evaluators
- Extracting and aggregating evaluation scores across models
- Calculating statistical measures (mean, standard deviation)
- Comparing model performance side-by-side
- Analyzing cost-performance trade-offs
- Generating actionable model selection recommendations

This evaluation methodology enables data-driven decision making for model selection in production systems. By combining performance metrics with cost analysis, you can optimize for both quality and efficiency based on your specific requirements.