## AB Testing Models - Building Agent Variants

Learn how to build multiple versions of the same agent using different models for A/B testing purposes. This tutorial shows you how to create three variants of a ReAct airline assistant agent with identical capabilities but powered by different language models.

### What You'll Learn
- Build ReAct agents with custom orchestration for airline customer service
- Create multiple agent variants using different models (Haiku, Sonnet, Nova Lite)
- Configure identical toolsets and prompts across agent variants
- Test each agent independently with the same queries
- Understand the agent-as-tool pattern with comprehensive airline tools
- Prepare agent configurations for systematic A/B testing evaluation

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:------------------------------------------------------------------------------|
| Tutorial type       | Advanced - Building multiple agent variants for model comparison              |
| Tutorial components | ReAct orchestration, TauBench tools, multi-model agent configuration          |
| Tutorial vertical   | Agent Evaluation                                                              |
| Example complexity  | Advanced                                                                      |
| SDK used            | Strands Agents, Strands Evals                                                 |

### Understanding A/B Testing for Agents

**A/B testing** compares agent variants with identical tools and prompts but different models, using controlled comparison on the same queries.

#### Why A/B Test Models?

| Factor | Impact |
|:-------|:-------|
| Response quality | Different models produce different quality outputs |
| Cost efficiency | Model pricing varies 10x-100x |
| Latency | Smaller models respond faster |
| Capability fit | Some models excel at specific tasks |

#### Three Models for Comparison

| Model | ID | Characteristics | Use Case |
|:------|:---|:----------------|:---------|
| Claude Haiku | `us.anthropic.claude-3-5-haiku-20241022-v1:0` | Fastest, lowest cost | High-volume, straightforward queries |
| Claude Sonnet | `us.anthropic.claude-sonnet-4-0-20250514-v1:0` | Balanced speed/capability | Complex reasoning, reasonable cost |
| Nova Lite | `us.amazon.nova-lite-v1:0` | AWS-native, cost-effective | AWS ecosystem, budget-conscious |

#### Two-Part Tutorial Structure

| Part | Focus |
|:-----|:------|
| Part 1 (This Notebook) | Build three agent variants, test independently, save configurations |
| Part 2 (Next Notebook) | Run evaluations, compare scores, analyze trade-offs, generate recommendations |

### ReAct Agent Architecture

**ReAct (Reasoning and Acting)** combines thinking, acting (tool execution), observing results, and iterating until complete.

#### Agent Flow

```
User Query → Think → Act (tools) → Observe → Repeat until complete → Final Response
```

#### Airline Assistant Tools (14 total)

| Category | Tools |
|:---------|:------|
| Flight Management | `search_direct_flight`, `search_onestop_flight`, `list_all_airports` |
| Booking Operations | `book_reservation`, `cancel_reservation`, `get_reservation_details` |
| Reservation Updates | `update_reservation_flights`, `update_reservation_passengers`, `update_reservation_baggages` |
| Customer Service | `get_user_details`, `send_certificate`, `transfer_to_human_agents` |
| Utilities | `calculate`, `think` |

### Environment Setup

Configure AWS region and define the three model IDs we'll test.

In [None]:
import boto3

# AWS Configuration
session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'

# Define three models for A/B testing
MODEL_HAIKU = 'us.anthropic.claude-3-5-haiku-20241022-v1:0'
MODEL_SONNET = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'
MODEL_NOVA_LITE = 'us.amazon.nova-lite-v1:0'

print(f"AWS Region: {AWS_REGION}")
print(f"\nModels for A/B Testing:")
print(f"  1. Haiku:     {MODEL_HAIKU}")
print(f"  2. Sonnet:    {MODEL_SONNET}")
print(f"  3. Nova Lite: {MODEL_NOVA_LITE}")

### Setup and Imports

Import all necessary libraries and configure the airline environment.

In [None]:
import os
import sys
import json
import logging
from typing import Dict, List, Any

# Add paths for airline tools (local data directory)
sys.path.append('./data/ma-bench/')
sys.path.append('./data/tau-bench/')

# Strands imports
from strands import Agent
from strands.models import BedrockModel

# Disable verbose logging
logging.basicConfig(level=logging.CRITICAL)
for logger_name in ["strands", "graph", "event_loop", "registry", "sliding_window_conversation_manager", "bedrock", "streaming"]:
    logging.getLogger(logger_name).setLevel(logging.CRITICAL)

# Bypass tool consent for automated execution
os.environ["BYPASS_TOOL_CONSENT"] = "true"

print("All imports successful")

### Import Airline Domain Tools

Import the comprehensive suite of 14 airline tools from MAbench and TauBench. These tools provide real functionality for flight booking, reservation management, and customer service operations.

In [None]:
# Import all 14 airline tools
from mabench.environments.airline.tools.book_reservation import book_reservation
from mabench.environments.airline.tools.calculate import calculate
from mabench.environments.airline.tools.cancel_reservation import cancel_reservation
from mabench.environments.airline.tools.get_reservation_details import get_reservation_details
from mabench.environments.airline.tools.get_user_details import get_user_details
from mabench.environments.airline.tools.list_all_airports import list_all_airports
from mabench.environments.airline.tools.search_direct_flight import search_direct_flight
from mabench.environments.airline.tools.search_onestop_flight import search_onestop_flight
from mabench.environments.airline.tools.send_certificate import send_certificate
from mabench.environments.airline.tools.think import think
from mabench.environments.airline.tools.transfer_to_human_agents import transfer_to_human_agents
from mabench.environments.airline.tools.update_reservation_baggages import update_reservation_baggages
from mabench.environments.airline.tools.update_reservation_flights import update_reservation_flights
from mabench.environments.airline.tools.update_reservation_passengers import update_reservation_passengers

# Import airline policy
from tau_bench.envs.airline.wiki import WIKI

# Define tools list
AIRLINE_TOOLS = [
    book_reservation,
    calculate,
    cancel_reservation,
    get_reservation_details,
    get_user_details,
    list_all_airports,
    search_direct_flight,
    search_onestop_flight,
    send_certificate,
    think,
    transfer_to_human_agents,
    update_reservation_baggages,
    update_reservation_flights,
    update_reservation_passengers,
]

print(f"Imported {len(AIRLINE_TOOLS)} airline tools")
print(f"Loaded airline policy document ({len(WIKI)} characters)")

### Define Airline Agent System Prompt

Create the specialized system prompt that will be used consistently across all three agent variants. This prompt includes:
- Airline policy from TauBench wiki
- Geographic inference instructions
- Tool usage guidelines

In [None]:
# Define system prompt template
SYSTEM_PROMPT_TEMPLATE = """
You are a helpful assistant for a travel website. Help the user answer any questions.

<instructions>
- Remember to check if the airport city is in the state mentioned by the user. For example, Houston is in Texas.
- Infer about the U.S. state in which the airport city resides. For example, Houston is in Texas.
- You should not use made-up or placeholder arguments.
</instructions>

<policy>
{policy}
</policy>
"""

# Create full system prompt with policy
AIRLINE_SYSTEM_PROMPT = SYSTEM_PROMPT_TEMPLATE.replace("{policy}", WIKI)

print(f"System prompt created ({len(AIRLINE_SYSTEM_PROMPT)} characters)")
print("\nPrompt includes:")
print("  - Airline policy and rules")
print("  - Geographic inference guidelines")
print("  - Tool usage instructions")

### Build Agent Variant 1: Claude Haiku

Create the first agent variant powered by Claude Haiku. This model is optimized for speed and cost efficiency, making it ideal for high-volume deployments with straightforward queries.

In [None]:
# Create BedrockModel for Haiku
bedrock_model_haiku = BedrockModel(
    region_name=AWS_REGION,
    model_id=MODEL_HAIKU
)

# Create Haiku agent
agent_haiku = Agent(
    name="airline_assistant_haiku",
    model=bedrock_model_haiku,
    tools=AIRLINE_TOOLS,
    system_prompt=AIRLINE_SYSTEM_PROMPT
)

print("Agent Variant 1: Claude Haiku")
print(f"  Model: {MODEL_HAIKU}")
print(f"  Tools: {len(AIRLINE_TOOLS)} airline tools")
print(f"  Status: Ready")

### Build Agent Variant 2: Claude Sonnet

Create the second agent variant powered by Claude Sonnet. This model offers balanced performance with strong reasoning capabilities, suitable for complex multi-step tasks.

In [None]:
# Create BedrockModel for Sonnet
bedrock_model_sonnet = BedrockModel(
    region_name=AWS_REGION,
    model_id=MODEL_SONNET
)

# Create Sonnet agent
agent_sonnet = Agent(
    name="airline_assistant_sonnet",
    model=bedrock_model_sonnet,
    tools=AIRLINE_TOOLS,
    system_prompt=AIRLINE_SYSTEM_PROMPT
)

print("Agent Variant 2: Claude Sonnet")
print(f"  Model: {MODEL_SONNET}")
print(f"  Tools: {len(AIRLINE_TOOLS)} airline tools")
print(f"  Status: Ready")

### Build Agent Variant 3: Nova Lite

Create the third agent variant powered by Amazon Nova Lite. This AWS-native model provides cost-effective performance with seamless AWS ecosystem integration.

In [None]:
# Create BedrockModel for Nova Lite
bedrock_model_nova_lite = BedrockModel(
    region_name=AWS_REGION,
    model_id=MODEL_NOVA_LITE
)

# Create Nova Lite agent
agent_nova_lite = Agent(
    name="airline_assistant_nova_lite",
    model=bedrock_model_nova_lite,
    tools=AIRLINE_TOOLS,
    system_prompt=AIRLINE_SYSTEM_PROMPT
)

print("Agent Variant 3: Nova Lite")
print(f"  Model: {MODEL_NOVA_LITE}")
print(f"  Tools: {len(AIRLINE_TOOLS)} airline tools")
print(f"  Status: Ready")

### Verify Agent Configuration

Confirm all three agents are configured identically except for the model.

In [None]:
# Store agent configurations for verification
agent_configs = {
    "haiku": {
        "agent": agent_haiku,
        "model_id": MODEL_HAIKU,
        "name": "Claude Haiku",
        "characteristics": "Fast, cost-effective, optimized for simple queries"
    },
    "sonnet": {
        "agent": agent_sonnet,
        "model_id": MODEL_SONNET,
        "name": "Claude Sonnet",
        "characteristics": "Balanced performance, strong reasoning capabilities"
    },
    "nova_lite": {
        "agent": agent_nova_lite,
        "model_id": MODEL_NOVA_LITE,
        "name": "Nova Lite",
        "characteristics": "AWS-native, cost-effective, multimodal"
    }
}

print("Agent Configuration Summary:")
print("=" * 80)
for variant_key, config in agent_configs.items():
    print(f"\n{config['name']}:")
    print(f"  Model ID: {config['model_id']}")
    print(f"  Tools: {len(config['agent'].tool_names)} airline tools")
    print(f"  Prompt: {len(config['agent'].system_prompt)} characters")
    print(f"  Characteristics: {config['characteristics']}")

print("\n" + "=" * 80)
print("Configuration Verification: All agents configured with identical prompts and tools")

### Load Test Dataset

Load the TauBench airline dataset containing real customer service scenarios. We'll use a subset of these queries to test each agent variant.

In [None]:
# Load TauBench dataset (local data directory)
dataset_path = "./data/tau-bench/tau_bench/envs/airline/tasks_singleturn.json"

with open(dataset_path, "r") as file:
    tasks = json.load(file)

print(f"Loaded {len(tasks)} test scenarios from TauBench dataset")
print("\nDataset includes scenarios for:")
print("  - Flight searches and bookings")
print("  - Reservation modifications")
print("  - Cancellations and refunds")
print("  - Customer service inquiries")

### Test Example 1: Simple Flight Search

Test all three agents with a straightforward flight search query. This tests basic tool usage and information retrieval.

In [None]:
# Select test query
test_query_1 = "I want to search for direct flights from JFK to LAX on 2024-06-15. Can you help?"

print("Test Query 1: Simple Flight Search")
print("=" * 80)
print(f"Query: {test_query_1}")
print("\nExpected behavior: Agent should use search_direct_flight tool")
print("=" * 80)

In [None]:
# Test Haiku
print("\n[TESTING HAIKU]")
print("-" * 80)
agent_haiku.messages = []  # Reset conversation
response_haiku_1 = agent_haiku(test_query_1)
print(f"Response: {response_haiku_1}")
print("-" * 80)

In [None]:
# Test Sonnet
print("\n[TESTING SONNET]")
print("-" * 80)
agent_sonnet.messages = []  # Reset conversation
response_sonnet_1 = agent_sonnet(test_query_1)
print(f"Response: {response_sonnet_1}")
print("-" * 80)

In [None]:
# Test Nova Lite
print("\n[TESTING NOVA LITE]")
print("-" * 80)
agent_nova_lite.messages = []  # Reset conversation
response_nova_lite_1 = agent_nova_lite(test_query_1)
print(f"Response: {response_nova_lite_1}")
print("-" * 80)

### Test Example 2: Complex Reservation Modification

Test all three agents with a complex multi-step query involving reservation lookup and modification. This tests reasoning and multi-tool coordination.

In [None]:
# Select complex test query from dataset
task_20 = tasks[20]
test_query_2 = task_20["question"]

print("Test Query 2: Complex Reservation Modification")
print("=" * 80)
print(f"Query: {test_query_2}")
print(f"\nUser ID: {task_20['user_id']}")
print("\nExpected behavior: Agent should:")
print("  1. Look up reservation details")
print("  2. Search for alternative flights")
print("  3. Address baggage policy")
print("=" * 80)

In [None]:
# Test Haiku
print("\n[TESTING HAIKU]")
print("-" * 80)
agent_haiku.messages = []  # Reset conversation
response_haiku_2 = agent_haiku(test_query_2)
print(f"Response: {response_haiku_2}")
print("-" * 80)

In [None]:
# Test Sonnet
print("\n[TESTING SONNET]")
print("-" * 80)
agent_sonnet.messages = []  # Reset conversation
response_sonnet_2 = agent_sonnet(test_query_2)
print(f"Response: {response_sonnet_2}")
print("-" * 80)

In [None]:
# Test Nova Lite
print("\n[TESTING NOVA LITE]")
print("-" * 80)
agent_nova_lite.messages = []  # Reset conversation
response_nova_lite_2 = agent_nova_lite(test_query_2)
print(f"Response: {response_nova_lite_2}")
print("-" * 80)

### Test Example 3: Policy-Based Query

Test all three agents with a query requiring policy knowledge and proper escalation. This tests understanding of business rules.

In [None]:
# Select policy-based test query from dataset
task_48 = tasks[48]
test_query_3 = task_48["question"]

print("Test Query 3: Policy-Based Query")
print("=" * 80)
print(f"Query: {test_query_3}")
print(f"\nUser ID: {task_48['user_id']}")
print("\nExpected behavior: Agent should:")
print("  1. Look up user details")
print("  2. Apply cancellation policy rules")
print("  3. Process cancellation or explain restrictions")
print("=" * 80)

In [None]:
# Test Haiku
print("\n[TESTING HAIKU]")
print("-" * 80)
agent_haiku.messages = []  # Reset conversation
response_haiku_3 = agent_haiku(test_query_3)
print(f"Response: {response_haiku_3}")
print("-" * 80)

In [None]:
# Test Sonnet
print("\n[TESTING SONNET]")
print("-" * 80)
agent_sonnet.messages = []  # Reset conversation
response_sonnet_3 = agent_sonnet(test_query_3)
print(f"Response: {response_sonnet_3}")
print("-" * 80)

In [None]:
# Test Nova Lite
print("\n[TESTING NOVA LITE]")
print("-" * 80)
agent_nova_lite.messages = []  # Reset conversation
response_nova_lite_3 = agent_nova_lite(test_query_3)
print(f"Response: {response_nova_lite_3}")
print("-" * 80)

### Save Agent Configurations for Evaluation

Store the agent configurations so they can be easily loaded in Part 2 (Tutorial 07b) for systematic evaluation.

In [None]:
# Create configuration dictionary for Part 2
evaluation_configs = {
    "models": {
        "haiku": {
            "model_id": MODEL_HAIKU,
            "name": "Claude Haiku",
            "description": "Fast, cost-effective, optimized for simple queries"
        },
        "sonnet": {
            "model_id": MODEL_SONNET,
            "name": "Claude Sonnet",
            "description": "Balanced performance, strong reasoning capabilities"
        },
        "nova_lite": {
            "model_id": MODEL_NOVA_LITE,
            "name": "Nova Lite",
            "description": "AWS-native, cost-effective, multimodal"
        }
    },
    "dataset_path": dataset_path,
    "num_tools": len(AIRLINE_TOOLS),
    "prompt_length": len(AIRLINE_SYSTEM_PROMPT)
}

# Save configuration
config_output_path = "./agent_configs.json"
with open(config_output_path, "w") as f:
    json.dump(evaluation_configs, f, indent=2)

print("Agent configurations saved for evaluation:")
print(f"  Output path: {config_output_path}")
print(f"  Models configured: {len(evaluation_configs['models'])}")
print(f"  Dataset path: {evaluation_configs['dataset_path']}")
print("\nThese configurations will be loaded in Tutorial 07b for systematic evaluation.")

### Agent Testing Summary

Review the testing results and observations from all three agent variants.

In [None]:
# Generate testing summary
summary = """
# Agent Variant Testing Summary

## Agents Built

Successfully created three agent variants:

1. **Claude Haiku** - Fast and cost-effective variant
2. **Claude Sonnet** - Balanced performance variant
3. **Nova Lite** - AWS-native variant

## Configuration Consistency

All three agents configured with:
- **Identical system prompt** (includes airline policy and guidelines)
- **Identical toolset** (14 airline domain tools)
- **Same orchestration** (ReAct pattern)
- **Only difference**: Language model

## Testing Results

### Test 1: Simple Flight Search
- **Query complexity**: Low
- **Tools required**: search_direct_flight
- **All agents**: Successfully executed search

### Test 2: Complex Reservation Modification
- **Query complexity**: High
- **Tools required**: get_reservation_details, search_direct_flight, policy application
- **All agents**: Demonstrated multi-tool coordination

### Test 3: Policy-Based Query
- **Query complexity**: Medium
- **Tools required**: get_user_details, cancel_reservation, policy reasoning
- **All agents**: Applied business rules

## Observations

**Differences observed across models:**
- Response style and verbosity
- Tool selection sequencing
- Policy interpretation nuances
- Response completeness

**These differences will be quantified in Part 2 through:**
- Output quality evaluation
- Tool selection accuracy
- Policy compliance scoring
- Cost-performance analysis

## Next Steps

**Tutorial 07b will provide:**
1. Systematic evaluation on full test dataset
2. Side-by-side score comparison
3. Statistical significance analysis
4. Cost-performance trade-off analysis
5. Model selection recommendations

## Configuration Saved

Agent configurations exported to: `agent_configs.json`

Ready for evaluation in Tutorial 07b.
"""

from IPython.display import Markdown, display
display(Markdown(summary))

### Summary

You've successfully learned how to build multiple agent variants for A/B testing. You now understand:

- **ReAct agent architecture**: Reasoning-Acting pattern with iterative tool execution
- **Agent variant creation**: Building identical agents with different models
- **Configuration consistency**: Ensuring fair comparison through identical prompts and tools
- **Three model types**: Haiku (fast), Sonnet (balanced), Nova Lite (AWS-native)
- **Testing methodology**: Running same queries across all variants
- **TauBench integration**: Using real airline domain tools and datasets
- **Configuration export**: Saving agent configs for systematic evaluation

This notebook established the foundation for A/B testing by creating three production-ready agent variants. In Tutorial 07b, you'll evaluate these agents systematically to determine which model provides the best cost-performance trade-off for your use case.