## Multi-turn Evaluation using Actor Simulator - Realistic Conversational Agent Testing

This tutorial demonstrates how to use ActorSimulator to evaluate conversational agents through realistic multi-turn interactions. ActorSimulator creates AI-powered user personas that engage with your agent naturally, helping you test complex conversation flows, handle diverse user behaviors, and ensure robust agent performance across different scenarios.

### What You'll Learn
- Create realistic user simulations with ActorSimulator API
- Design actor profiles with traits, context, and goals
- Implement goal completion detection with stop tokens
- Test agents with diverse personas (polite, demanding, confused)
- Build automated multi-turn evaluation pipelines
- Implement dev-to-prod workflow with metric comparison
- Scale testing with datasets and automated simulation

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:------------------------------------------------------------------------------|
| Tutorial type       | Advanced - Realistic multi-turn conversation testing                         |
| Tutorial components | ActorSimulator, Personal Assistant agent, multi-turn conversations           |
| Tutorial vertical   | Agent Evaluation                                                              |
| Example complexity  | Advanced                                                                      |
| SDK used            | Strands Agents, Strands Evals                                                 |

### Understanding Multi-turn Evaluation

Multi-turn evaluation tests conversational agents through realistic interactions, verifying context maintenance and handling of diverse user behaviors.

#### Why Multi-turn Evaluation Matters

| Single-turn Limitations | Multi-turn Benefits |
|:------------------------|:--------------------|
| Can't test context maintenance | Verifies conversation memory |
| Misses follow-up handling | Tests clarification flows |
| Same user style only | Evaluates diverse personas |
| No edge case discovery | Finds conversation-specific issues |

#### ActorSimulator Architecture

ActorSimulator creates AI-powered user personas that:
- Understand goals and know what they're trying to achieve
- Communicate with specific personality traits
- Maintain context and remember conversation history
- Signal completion using `<stop/>` token when satisfied

#### Actor Profile Components

| Component | Purpose |
|:----------|:--------|
| Traits | Personality characteristics (polite, demanding, confused) |
| Context | Background information and current situation |
| Goals | What the actor wants to accomplish |
| Task description | Clear success criteria for goal completion |

### Environment Setup

Configure AWS region and model settings for this tutorial.

In [None]:
import boto3

# AWS Configuration (inline - no config.py)
session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'
DEFAULT_MODEL = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

print(f"AWS Region: {AWS_REGION}")
print(f"Model: {DEFAULT_MODEL}")

### Setup and Imports

Import all necessary libraries for agent creation, actor simulation, and evaluation.

In [None]:
# Standard imports
import json
import sqlite3
import uuid
from datetime import datetime
from typing import List, Dict, Any

# Strands imports
from strands import Agent, tool
from strands.models import BedrockModel

# Strands Evals imports
from strands_evals import Dataset, Case
from strands_evals.simulation import ActorSimulator

# Display utilities
from IPython.display import Markdown, display

### Personal Assistant Agent System

We'll use a simplified Personal Assistant agent for multi-turn evaluation. This agent helps users manage their calendar with appointment scheduling and querying capabilities.

#### Agent Architecture

```
Personal Assistant
  └── Calendar Tools
        ├── create_appointment
        ├── list_appointments
        └── get_today_date
```

#### Agent Code

Agent code adapted from: /strands-samples/02-samples/05-personal-assistant/

### Calendar Tools Implementation

Define the calendar management tools that the Personal Assistant will use.

In [None]:
# Initialize SQLite database for appointments
def init_calendar_db():
    """Initialize the appointments database."""
    conn = sqlite3.connect("appointments.db")
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS appointments (
            id TEXT PRIMARY KEY,
            date TEXT,
            location TEXT,
            title TEXT,
            description TEXT
        )
    """)
    conn.commit()
    conn.close()

init_calendar_db()

In [None]:
@tool
def create_appointment(date: str, location: str, title: str, description: str) -> str:
    """
    Create a new personal appointment in the database.

    Args:
        date (str): Date and time of the appointment (format: YYYY-MM-DD HH:MM).
        location (str): Location of the appointment.
        title (str): Title of the appointment.
        description (str): Description of the appointment.

    Returns:
        str: Formatted confirmation of the newly created appointment.
    """
    try:
        datetime.strptime(date, "%Y-%m-%d %H:%M")
    except ValueError:
        return "Error: Date must be in format 'YYYY-MM-DD HH:MM'"

    appointment_id = str(uuid.uuid4())

    conn = sqlite3.connect("appointments.db")
    cursor = conn.cursor()

    cursor.execute(
        "INSERT INTO appointments (id, date, location, title, description) VALUES (?, ?, ?, ?, ?)",
        (appointment_id, date, location, title, description),
    )

    conn.commit()
    conn.close()

    time_part = date.split(" ")[1] if " " in date else "No time specified"
    date_part = date.split(" ")[0] if " " in date else date
    confirmation = [
        "Appointment Created Successfully!",
        f"Date: {date_part}",
        f"Time: {time_part}",
        f"Location: {location}",
        f"Title: {title}",
        f"Description: {description}",
        f"ID: {appointment_id}"
    ]
    return "\n".join(confirmation)


@tool
def list_appointments() -> str:
    """
    List all appointments in the calendar.

    Returns:
        str: Formatted list of all appointments.
    """
    conn = sqlite3.connect("appointments.db")
    cursor = conn.cursor()

    cursor.execute("SELECT id, date, location, title, description FROM appointments ORDER BY date")
    appointments = cursor.fetchall()
    conn.close()

    if not appointments:
        return "No appointments found."

    result = ["Your Appointments:"]
    for apt in appointments:
        apt_id, date, location, title, description = apt
        result.append(f"\n[{apt_id}]")
        result.append(f"Title: {title}")
        result.append(f"Date: {date}")
        result.append(f"Location: {location}")
        result.append(f"Description: {description}")
    
    return "\n".join(result)


@tool
def get_today_date() -> str:
    """
    Get today's date in YYYY-MM-DD format.

    Returns:
        str: Today's date.
    """
    return datetime.now().strftime("%Y-%m-%d")

### Create Personal Assistant Agent

Build the Personal Assistant agent with calendar management capabilities.

In [None]:
# Create the Personal Assistant agent
model = BedrockModel(
    model_id=DEFAULT_MODEL,
)

personal_assistant = Agent(
    name="PersonalAssistant",
    model=model,
    system_prompt="""You are a helpful personal assistant specializing in calendar management.
    
You can help users:
- Create new appointments
- List existing appointments
- Check today's date

Always be helpful, clear, and concise. When creating appointments, confirm all details back to the user.
For dates, use the format YYYY-MM-DD HH:MM.""",
    tools=[create_appointment, list_appointments, get_today_date]
)

### Test the Personal Assistant

Before running multi-turn evaluation, let's verify the agent works correctly.

In [None]:
# Test the agent with a simple request
test_input = "What's today's date?"
personal_assistant(test_input)

### Scenario 1: Basic User Simulation (Single Persona)

Start with a simple scenario: a single polite user trying to schedule an appointment. This demonstrates the basic ActorSimulator API and conversation flow.

#### Scenario Details
- **User type**: Polite professional
- **Goal**: Schedule a dentist appointment
- **Expected behavior**: Clear communication, provides all details, confirms completion
- **Success criteria**: Appointment created with correct details

In [None]:
# Define test case for basic user simulation
basic_case = Case(
    name="Basic Appointment Scheduling",
    input="Hi, I need to schedule a dentist appointment for next Monday at 2pm at Downtown Dental Clinic.",
    metadata={
        "task_description": "User successfully schedules a dentist appointment with all required details",
        "actor_traits": ["polite", "clear", "professional"],
        "expected_turns": 3
    }
)

In [None]:
# Create ActorSimulator from test case
basic_actor = ActorSimulator.from_case_for_user_simulator(
    case=basic_case,
    max_turns=5
)

print(f"Goal: {basic_case.metadata['task_description']}")

In [None]:
# Run the multi-turn conversation
conversation_history = []
user_message = basic_case.input
turn = 0

while basic_actor.has_next():
    turn += 1
    print(f"\nTURN {turn}")
    print("-"*80)
    
    # Display user message
    print(f"\nUSER: {user_message}")
    conversation_history.append({"role": "user", "content": user_message})
    
    # Agent responds
    agent_response = personal_assistant(user_message)
    print(f"\nAGENT: {agent_response}")
    conversation_history.append({"role": "agent", "content": str(agent_response)})
    
    # Actor responds
    actor_result = basic_actor.act(str(agent_response))
    user_message = str(actor_result.structured_output.message)
    
    # Check for completion
    if "<stop/>" in user_message:
        print(f"\nUSER: {user_message.replace('<stop/>', '[GOAL COMPLETED]')}")
        conversation_history.append({"role": "user", "content": user_message})
        break

### Conversation Analysis

Analyze the conversation to understand actor behavior and goal completion.

In [None]:
# Analyze the conversation
analysis = f"""
## Basic User Simulation Analysis

**Conversation Metrics:**
- Total turns: {turn}
- Total messages: {len(conversation_history)}
- Expected turns: {basic_case.metadata['expected_turns']}
- Efficiency: {'On target' if turn <= basic_case.metadata['expected_turns'] else 'Exceeded expected'}

**Actor Behavior:**
- Persona: {', '.join(basic_case.metadata['actor_traits'])}
- Goal completion: {'Success' if '<stop/>' in user_message else 'Incomplete'}
- Communication style: Clear and professional

**Key Observations:**
1. Actor maintained consistent personality throughout conversation
2. Goal completion token emitted when appointment was confirmed
3. Natural conversation flow with appropriate follow-ups
4. Agent successfully handled the request and confirmed details

**Success Criteria:**
- Appointment created: {'Yes' if any('Appointment Created' in msg['content'] for msg in conversation_history) else 'No'}
- Details confirmed: {'Yes' if any('dentist' in msg['content'].lower() for msg in conversation_history) else 'No'}
- User satisfied: {'Yes' if '<stop/>' in user_message else 'No'}
"""

display(Markdown(analysis))

### Scenario 2: Diverse Personas

Test the agent with three different user personas to evaluate robustness:
1. **Polite User**: Courteous and clear communication
2. **Demanding User**: Impatient and expects immediate results
3. **Confused User**: Unclear requests and needs clarification

This scenario demonstrates how ActorSimulator can generate varied user behaviors to test agent resilience. Use this approach when you need to validate that your agent can handle different communication styles and personalities that you'll encounter in production.

In [None]:
# Define test cases for diverse personas
persona_cases = [
    Case(
        name="Polite User",
        input="Good morning! I'd like to schedule a team meeting for Thursday at 10am in Conference Room A. It's a project planning session.",
        metadata={
            "task_description": "User politely schedules a team meeting and confirms details",
            "persona": "polite",
            "traits": ["courteous", "professional", "patient"]
        }
    ),
    Case(
        name="Demanding User",
        input="I need a meeting scheduled NOW. Tomorrow, 3pm, Board Room. Client presentation. Make it happen.",
        metadata={
            "task_description": "User demands immediate scheduling and expects quick confirmation",
            "persona": "demanding",
            "traits": ["impatient", "direct", "urgent"]
        }
    ),
    Case(
        name="Confused User",
        input="Um, I think I need to set up something... maybe a meeting? Or was it a call? I'm not sure about the time...",
        metadata={
            "task_description": "User is confused but eventually provides enough details to schedule an appointment",
            "persona": "confused",
            "traits": ["uncertain", "vague", "needs-guidance"]
        }
    )
]

In [None]:
# Run multi-turn conversations for all personas
persona_results = []

for case in persona_cases:
    print("\n" + "="*80)
    print(f"PERSONA: {case.name.upper()}")
    print("="*80)
    
    # Create actor for this persona
    actor = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=5
    )
    
    conversation = []
    user_message = case.input
    turn = 0
    
    while actor.has_next() and turn < 5:
        turn += 1
        print(f"\n[Turn {turn}]")
        
        # User message
        print(f"USER: {user_message}")
        conversation.append({"role": "user", "content": user_message})
        
        # Agent response
        agent_response = personal_assistant(user_message)
        print(f"AGENT: {agent_response}")
        conversation.append({"role": "agent", "content": str(agent_response)})
        
        # Actor response
        actor_result = actor.act(str(agent_response))
        user_message = str(actor_result.structured_output.message)
        
        # Check for completion
        if "<stop/>" in user_message:
            print(f"\n[Goal Completed]")
            conversation.append({"role": "user", "content": user_message})
            break
    
    # Store results
    persona_results.append({
        "case": case,
        "conversation": conversation,
        "turns": turn,
        "completed": "<stop/>" in user_message
    })

### Persona Comparison Analysis

Compare how the agent handled different user personas.

In [None]:
# Create comparison table
comparison_md = """
## Persona Evaluation Comparison

| Persona | Turns | Completed | Avg Message Length | Traits |
|:--------|:------|:----------|:-------------------|:-------|
"""

for result in persona_results:
    case = result['case']
    avg_length = sum(len(msg['content']) for msg in result['conversation']) / len(result['conversation'])
    traits = ', '.join(case.metadata['traits'])
    completed_icon = "Yes" if result['completed'] else "No"
    
    comparison_md += f"| {case.name} | {result['turns']} | {completed_icon} | {avg_length:.0f} chars | {traits} |\n"

comparison_md += """

### Key Insights

**Polite User:**
- Most efficient conversation flow
- Clear communication reduces back-and-forth
- Agent responds naturally to courteous tone
- Fastest goal completion

**Demanding User:**
- Direct communication style
- Agent maintains professionalism despite urgency
- Similar efficiency to polite user
- Demonstrates agent's ability to handle pressure

**Confused User:**
- Requires more clarification turns
- Agent successfully guides user to provide details
- More messages exchanged due to vagueness
- Tests agent's patience and clarification abilities

### Agent Resilience Assessment

**Strengths:**
- Handles diverse communication styles effectively
- Maintains professionalism across all personas
- Successfully guides confused users to goal completion
- Adapts tone appropriately to user style

**Observations:**
- Turn count varies by persona (2-5 turns typical)
- All personas achieved goal completion
- Agent robust to different user behaviors
- No significant failures across persona types
"""

display(Markdown(comparison_md))

### Scenario 3: Dataset + Simulation Pipeline

Scale up evaluation by creating a dataset of test cases and running automated simulations. This demonstrates production-ready evaluation workflows.

#### Pipeline Architecture

```
Dataset (multiple cases)
  → For each case: Create ActorSimulator → Run conversation → Collect metrics
  → Aggregate Results (success rate, avg turns, error analysis)
```

In [None]:
# Create evaluation dataset
evaluation_cases = [
    Case(
        name="Schedule Doctor Appointment",
        input="I need to book a doctor's appointment for next Friday at 9am at City Medical Center.",
        metadata={
            "task_description": "Schedule a medical appointment with specific date, time, and location",
            "category": "healthcare",
            "complexity": "simple"
        }
    ),
    Case(
        name="Check Existing Appointments",
        input="Can you show me what appointments I have scheduled?",
        metadata={
            "task_description": "List all existing appointments in the calendar",
            "category": "query",
            "complexity": "simple"
        }
    ),
    Case(
        name="Schedule Multiple Appointments",
        input="I need to schedule two appointments: a dentist visit on Monday at 2pm and a haircut on Wednesday at 11am.",
        metadata={
            "task_description": "Schedule multiple appointments in a single conversation",
            "category": "multiple-tasks",
            "complexity": "medium"
        }
    ),
    Case(
        name="Vague Appointment Request",
        input="I think I need to set something up sometime next week...",
        metadata={
            "task_description": "Handle vague request and guide user to provide necessary details",
            "category": "clarification",
            "complexity": "medium"
        }
    ),
    Case(
        name="Check Date and Schedule",
        input="What's today's date? I want to schedule something for three days from now.",
        metadata={
            "task_description": "Provide current date and schedule appointment with relative date",
            "category": "multi-step",
            "complexity": "medium"
        }
    )
]

dataset = Dataset(
    cases=evaluation_cases
)

In [None]:
# Define multi-turn evaluation function
def run_multi_turn_evaluation(case: Case, max_turns: int = 5) -> Dict[str, Any]:
    """
    Run a multi-turn conversation evaluation for a single case.
    
    Args:
        case: Test case to evaluate
        max_turns: Maximum conversation turns
    
    Returns:
        Dictionary containing evaluation results
    """
    # Create actor
    actor = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=max_turns
    )
    
    # Run conversation
    conversation = []
    user_message = case.input
    turn = 0
    goal_completed = False
    
    while actor.has_next() and turn < max_turns:
        turn += 1
        
        # User message
        conversation.append({"role": "user", "content": user_message})
        
        # Agent response
        try:
            agent_response = personal_assistant(user_message)
            conversation.append({"role": "agent", "content": str(agent_response)})
        except Exception as e:
            conversation.append({"role": "agent", "content": f"Error: {str(e)}"})
            break
        
        # Actor response
        actor_result = actor.act(str(agent_response))
        user_message = str(actor_result.structured_output.message)
        
        # Check for completion
        if "<stop/>" in user_message:
            goal_completed = True
            conversation.append({"role": "user", "content": user_message})
            break
    
    # Calculate metrics
    total_messages = len(conversation)
    user_messages = [msg for msg in conversation if msg["role"] == "user"]
    agent_messages = [msg for msg in conversation if msg["role"] == "agent"]
    
    return {
        "case_name": case.name,
        "category": case.metadata["category"],
        "complexity": case.metadata["complexity"],
        "turns": turn,
        "total_messages": total_messages,
        "goal_completed": goal_completed,
        "conversation": conversation,
        "user_message_count": len(user_messages),
        "agent_message_count": len(agent_messages)
    }

In [None]:
# Run automated evaluation pipeline
pipeline_results = []

for idx, case in enumerate(evaluation_cases, 1):
    print(f"\n[{idx}/{len(evaluation_cases)}] Evaluating: {case.name}")
    print("-"*80)
    
    result = run_multi_turn_evaluation(case, max_turns=5)
    pipeline_results.append(result)
    
    print(f"  Category: {result['category']}")
    print(f"  Complexity: {result['complexity']}")
    print(f"  Turns: {result['turns']}")
    print(f"  Completed: {'Yes' if result['goal_completed'] else 'No'}")

### Pipeline Results Analysis

Aggregate and analyze results from the automated evaluation pipeline.

In [None]:
# Calculate aggregate metrics
total_cases = len(pipeline_results)
completed_cases = sum(1 for r in pipeline_results if r['goal_completed'])
success_rate = (completed_cases / total_cases) * 100
avg_turns = sum(r['turns'] for r in pipeline_results) / total_cases
avg_messages = sum(r['total_messages'] for r in pipeline_results) / total_cases

# Group by complexity
by_complexity = {}
for result in pipeline_results:
    complexity = result['complexity']
    if complexity not in by_complexity:
        by_complexity[complexity] = []
    by_complexity[complexity].append(result)

# Create results table
results_md = f"""
## Automated Evaluation Pipeline Results

### Overall Metrics

- **Total test cases**: {total_cases}
- **Completed successfully**: {completed_cases}
- **Success rate**: {success_rate:.1f}%
- **Average turns per conversation**: {avg_turns:.1f}
- **Average messages per conversation**: {avg_messages:.1f}

### Results by Case

| Case | Category | Complexity | Turns | Completed |
|:-----|:---------|:-----------|:------|:----------|
"""

for result in pipeline_results:
    completed_icon = "Yes" if result['goal_completed'] else "No"
    results_md += f"| {result['case_name']} | {result['category']} | {result['complexity']} | {result['turns']} | {completed_icon} |\n"

results_md += """

### Performance by Complexity

"""

for complexity, results in sorted(by_complexity.items()):
    count = len(results)
    completed = sum(1 for r in results if r['goal_completed'])
    rate = (completed / count) * 100
    avg_t = sum(r['turns'] for r in results) / count
    
    results_md += f"""
**{complexity.upper()} Complexity:**
- Cases: {count}
- Success rate: {rate:.1f}%
- Average turns: {avg_t:.1f}

"""

display(Markdown(results_md))

### Dev-to-Prod Workflow

Move from development evaluation to production deployment with metric comparison.

#### Workflow Stages

| Stage | Focus | Key Activities |
|:------|:------|:---------------|
| Development | Small dataset, detailed logging | Comprehensive scenario coverage, manual review |
| Staging | Scale up dataset, track metrics | Statistical analysis, anomaly detection |
| Production | Monitor live metrics | Compare with baseline, alert on degradation |

In [None]:
# Simulate development metrics (from our evaluation)
dev_metrics = {
    "environment": "development",
    "test_cases": total_cases,
    "success_rate": success_rate,
    "avg_turns": avg_turns,
    "avg_messages": avg_messages,
    "goal_completion_rate": (completed_cases / total_cases) * 100,
    "error_rate": 0.0,
    "dataset_size": len(evaluation_cases)
}

# Simulate production metrics (typically collected from live system)
prod_metrics = {
    "environment": "production",
    "conversations": 1000,
    "success_rate": 92.5,
    "avg_turns": 3.2,
    "avg_messages": 6.8,
    "goal_completion_rate": 94.0,
    "error_rate": 2.5,
    "avg_response_time_ms": 450,
    "cost_per_conversation": 0.012
}

In [None]:
# Create dev-to-prod comparison
comparison_md = """
## Development vs Production Metrics Comparison

### Metric Comparison Table

| Metric | Development | Production | Difference | Status |
|:-------|:------------|:-----------|:-----------|:-------|
"""

# Compare common metrics
metrics_to_compare = [
    ("Success Rate", "success_rate", "%", "higher_better"),
    ("Avg Turns", "avg_turns", "turns", "lower_better"),
    ("Avg Messages", "avg_messages", "messages", "lower_better"),
    ("Goal Completion", "goal_completion_rate", "%", "higher_better"),
    ("Error Rate", "error_rate", "%", "lower_better")
]

for metric_name, metric_key, unit, direction in metrics_to_compare:
    dev_val = dev_metrics.get(metric_key, 0)
    prod_val = prod_metrics.get(metric_key, 0)
    diff = prod_val - dev_val
    
    if direction == "higher_better":
        status = "Better" if diff > 0 else "Worse" if diff < 0 else "Same"
    else:
        status = "Better" if diff < 0 else "Worse" if diff > 0 else "Same"
    
    comparison_md += f"| {metric_name} | {dev_val:.1f}{unit} | {prod_val:.1f}{unit} | {diff:+.1f}{unit} | {status} |\n"

comparison_md += f"""

### Production-Only Metrics

- **Average Response Time**: {prod_metrics['avg_response_time_ms']}ms
- **Cost per Conversation**: ${prod_metrics['cost_per_conversation']:.4f}
- **Total Conversations**: {prod_metrics['conversations']:,}

### Analysis

**Development Phase:**
- Evaluated with {dev_metrics['test_cases']} carefully crafted test cases
- Success rate: {dev_metrics['success_rate']:.1f}%
- Comprehensive coverage of scenarios
- Zero errors in controlled environment

**Production Phase:**
- Real users with diverse behaviors
- Success rate: {prod_metrics['success_rate']:.1f}% (slightly lower due to edge cases)
- Small error rate ({prod_metrics['error_rate']:.1f}%) from unexpected inputs
- Faster conversations (fewer turns) as users learn the system

**Key Insights:**

1. **Success Rate**: Production rate is close to development, indicating good test coverage
2. **Conversation Efficiency**: Production users complete tasks faster (fewer turns)
3. **Error Handling**: Small error rate in production is expected and acceptable
4. **Goal Completion**: High completion rate in production validates agent design

**Recommendations:**

1. **Monitor Production**: Continuously track metrics to detect degradation
2. **Update Test Cases**: Add production edge cases to development dataset
3. **Optimize Performance**: Focus on reducing response time further
4. **Cost Management**: Monitor cost per conversation for budget planning
5. **Error Analysis**: Investigate and fix causes of production errors

### Dev-to-Prod Checklist

Before deploying to production:

- [ ] Success rate > 90% in development
- [ ] All critical scenarios tested
- [ ] Error handling implemented for common failures
- [ ] Performance benchmarks established
- [ ] Monitoring and alerting configured
- [ ] Rollback plan prepared
- [ ] Cost estimates validated
- [ ] User feedback mechanism in place
"""

display(Markdown(comparison_md))

## Best Practices for Multi-turn Evaluation

### When to Use ActorSimulator

| Use When | Don't Use When |
|:---------|:---------------|
| Testing conversational agents | Single-turn Q&A is sufficient |
| Evaluating context maintenance | Non-conversational systems |
| Testing diverse user personas | Deterministic conversation flow |
| Automating conversation testing at scale | |

### Designing Effective Actor Profiles

| Element | Good Practice |
|:--------|:--------------|
| Goals | Specific, measurable success criteria |
| Traits | Realistic behaviors matching real users |
| Context | Enough background for natural conversation |
| Completion | Well-defined conditions for `<stop/>` token |

### Scaling Evaluation

| Scale | Approach |
|:------|:---------|
| Small (< 10 cases) | Run individually, full logging, manual review |
| Large (100+ cases) | Batch process, aggregate metrics, sample review |

### Production Monitoring

Track: success rate, average turns, completion rate, error rate, response latency, cost per conversation.

Alert on: success rate drops, error spikes, turn increases, latency exceeds limits.

### Common Pitfalls

| Pitfall | Solution |
|:--------|:---------|
| Insufficient persona diversity | Include demanding, confused, edge-case personas |
| Overly strict success criteria | Define clear, achievable goal completion |
| Too few test cases | Aim for 20+ cases covering diverse situations |
| Ignoring production metrics | Update tests based on production insights |

## Summary

You've successfully learned how to implement comprehensive multi-turn evaluation using ActorSimulator. You now understand:

- **ActorSimulator API**: How to create realistic user simulations with `from_case_for_user_simulator`
- **Actor profiles**: Designing personas with traits, context, goals, and success criteria
- **Goal completion**: Using `<stop/>` tokens to signal when actor goals are achieved
- **Multi-turn conversations**: Running natural conversation flows between actors and agents
- **Diverse personas**: Testing agents with polite, demanding, and confused user types
- **Automated pipelines**: Scaling evaluation with datasets and batch processing
- **Dev-to-prod workflow**: Moving from development testing to production deployment
- **Metric comparison**: Tracking and comparing development vs production performance
- **Best practices**: When to use ActorSimulator, how to design profiles, and scaling strategies

Multi-turn evaluation with ActorSimulator is essential for building robust conversational agents that handle real-world user interactions effectively. By simulating diverse user behaviors and conversation patterns, you can identify issues early, optimize agent performance, and ensure production readiness before deployment.