## Multi-Agent Evaluation - Evaluating Collaborative Agent Systems

This tutorial demonstrates how to evaluate multi-agent systems where multiple specialized agents collaborate to complete complex tasks. You'll learn to assess individual agent performance, collective system outcomes, and the quality of agent coordination and handoffs.

### What You'll Learn
- Understand multi-agent architectures (agent-as-tool pattern)
- Evaluate individual agent performance separately
- Assess collective system performance as a whole
- Measure coordination quality using InteractionsEvaluator
- Analyze agent handoffs and collaboration patterns
- Generate comprehensive multi-agent evaluation reports

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:------------------------------------------------------------------------------|
| Tutorial type       | Advanced - Evaluating complex multi-agent collaborative systems               |
| Tutorial components | Multi-agent orchestrator, InteractionsEvaluator, coordination analysis        |
| Tutorial vertical   | Agent Evaluation                                                              |
| Example complexity  | Advanced                                                                      |
| SDK used            | Strands Agents, Strands Evals                                                 |

### Understanding Multi-Agent Systems

A **multi-agent system** consists of multiple specialized AI agents that collaborate to solve complex problems, coordinated by an orchestrator.

#### Agent-as-Tool Pattern

| Component | Role |
|:----------|:-----|
| Orchestrator Agent | Routes user queries to appropriate specialists |
| Specialist Agents | Wrapped as callable tools with focused expertise |
| Coordination | Orchestrator decides which specialist(s) to invoke |

#### Architecture

```
User Query → Orchestrator Agent
                 ├──> Technical Support Agent
                 ├──> Billing Support Agent
                 ├──> Product Info Agent
                 └──> Returns & Exchanges Agent
```

#### Why Evaluate Multi-Agent Systems?

| Concern | Impact |
|:--------|:-------|
| Individual agent quality | Affects overall system performance |
| Poor coordination | Leads to incorrect or inefficient outcomes |
| Handoff quality | Impacts user experience |
| Emergent behaviors | May differ from individual agent capabilities |

#### Three Evaluation Dimensions

| Dimension | Focus | Metrics |
|:----------|:------|:--------|
| Individual Performance | Each specialist agent's task competence | Tool selection, response quality, completeness |
| Collective Performance | Overall system output | Final quality, task completion, user satisfaction |
| Coordination Quality | Agent collaboration | Routing accuracy, information passing, handoff smoothness |

### Environment Setup

Configure AWS region and model settings for this tutorial.

In [None]:
import boto3

# AWS Configuration (inline - no config.py)
session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'
DEFAULT_MODEL = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

print(f"AWS Region: {AWS_REGION}")
print(f"Model: {DEFAULT_MODEL}")

### Setup and Imports

Import all necessary libraries for multi-agent creation, evaluation, and coordination analysis.

In [None]:
import os
from typing import Dict, List, Any
import uuid

# Strands imports
from strands import Agent, tool

# Strands Evals imports
from strands_evals import Dataset, Case
from strands_evals.evaluators import OutputEvaluator, ToolSelectionAccuracyEvaluator, InteractionsEvaluator
from strands_evals.extractors import tools_use_extractor
from strands_evals.types import Interaction
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper

# Display utilities
from IPython.display import Markdown, display

# Bypass tool consent for automated execution
os.environ["BYPASS_TOOL_CONSENT"] = "true"

# Setup telemetry for trace-based evaluators
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()

print("All imports successful")

### Multi-Agent Customer Support System

We'll build a customer support system with an orchestrator and four specialized agents. This demonstrates the agent-as-tool pattern.

#### Agent Code

Multi-agent code adapted from: /strands-samples/01-tutorials/02-multi-agent-systems/01-agent-as-tool/

#### Step 1: Create Fake Database

Create a simple database to simulate customer data, products, and orders.

In [None]:
# Fake database for customer support
FAKE_DATABASE = {
    "customers": {
        "user123": {
            "name": "John Doe",
            "email": "john@example.com",
            "subscription": "Pro",
            "last_payment": "2024-01-15",
        },
        "user456": {
            "name": "Jane Smith",
            "email": "jane@example.com",
            "subscription": "Basic",
            "last_payment": "2024-01-10",
        },
    },
    "products": {
        "pro_plan": {
            "name": "Pro Plan",
            "price": 29.99,
            "features": ["Advanced analytics", "Priority support", "Custom integrations"],
        },
        "basic_plan": {
            "name": "Basic Plan",
            "price": 9.99,
            "features": ["Basic analytics", "Email support"]
        },
    },
    "orders": {
        "order789": {
            "customer_id": "user123",
            "product": "pro_plan",
            "status": "shipped",
            "date": "2024-01-20"
        },
        "order101": {
            "customer_id": "user456",
            "product": "basic_plan",
            "status": "processing",
            "date": "2024-01-22"
        },
    },
    "tickets": [],
}

print("Fake database created with customers, products, orders, and tickets")

#### Step 2: Define Database Tools

Create tools for looking up customer, product, and order information, and creating support tickets.

In [None]:
@tool
def lookup_customer(customer_id: str) -> str:
    """Look up customer information by ID."""
    customer = FAKE_DATABASE["customers"].get(customer_id)
    if customer:
        return f"Customer {customer_id}: {customer['name']} ({customer['email']}) - {customer['subscription']} plan, last payment: {customer['last_payment']}"
    return f"Customer {customer_id} not found"


@tool
def lookup_product(product_id: str) -> str:
    """Look up product information by ID."""
    product = FAKE_DATABASE["products"].get(product_id)
    if product:
        features = ", ".join(product["features"])
        return f"Product {product_id}: {product['name']} - ${product['price']}/month. Features: {features}"
    return f"Product {product_id} not found"


@tool
def lookup_order(order_id: str) -> str:
    """Look up order information by ID."""
    order = FAKE_DATABASE["orders"].get(order_id)
    if order:
        return f"Order {order_id}: Customer {order['customer_id']}, Product {order['product']}, Status: {order['status']}, Date: {order['date']}"
    return f"Order {order_id} not found"


@tool
def create_ticket(customer_id: str, issue_type: str, description: str) -> str:
    """Create a support ticket."""
    import datetime
    
    ticket_id = f"ticket{len(FAKE_DATABASE['tickets']) + 1}"
    ticket = {
        "id": ticket_id,
        "customer_id": customer_id,
        "issue_type": issue_type,
        "description": description,
        "status": "open",
        "created": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    }
    FAKE_DATABASE["tickets"].append(ticket)
    return f"Created {ticket_id} for customer {customer_id}: {issue_type} - {description}"


print("Database tools defined: lookup_customer, lookup_product, lookup_order, create_ticket")

#### Step 3: Define Specialist Agents as Tools

Create four specialized support agents, each wrapped as a tool that can be called by the orchestrator.

In [None]:
@tool
def technical_support(query: str) -> str:
    """Handle technical issues, bugs, and troubleshooting."""
    agent = Agent(
        name="technical_support",
        model=DEFAULT_MODEL,
        system_prompt="You are a technical support specialist. Help users troubleshoot technical issues, bugs, and product functionality problems. Use available tools to look up customer info and create tickets.",
        tools=[lookup_customer, create_ticket],
    )
    return str(agent(query))


@tool
def billing_support(query: str) -> str:
    """Handle billing, payments, and account issues."""
    agent = Agent(
        name="billing_support",
        model=DEFAULT_MODEL,
        system_prompt="You are a billing specialist. Help with payment issues, subscription questions, refunds, and account billing problems. Use tools to look up customer information.",
        tools=[lookup_customer],
    )
    return str(agent(query))


@tool
def product_info(query: str) -> str:
    """Provide product information and feature explanations."""
    agent = Agent(
        name="product_info",
        model=DEFAULT_MODEL,
        system_prompt="You are a product specialist. Explain product features, capabilities, and help users understand how to use products effectively. Use tools to look up product details.",
        tools=[lookup_product],
    )
    return str(agent(query))


@tool
def returns_exchanges(query: str) -> str:
    """Handle returns, exchanges, and any order issues."""
    agent = Agent(
        name="returns_exchanges",
        model=DEFAULT_MODEL,
        system_prompt="You are a returns specialist. Help with product returns, exchanges, order modifications, and shipping issues. Use tools to look up order and customer information.",
        tools=[lookup_order, lookup_customer],
    )
    return str(agent(query))


print("Specialist agents created: technical_support, billing_support, product_info, returns_exchanges")

#### Step 4: Create Orchestrator Agent

The orchestrator routes customer queries to the appropriate specialist agent.

In [None]:
# Define orchestrator system prompt with clear routing guidance
ORCHESTRATOR_PROMPT = """
You are a customer support router. Direct queries to the appropriate specialist:
- Technical issues, bugs, errors → technical_support
- Billing, payments, subscriptions → billing_support  
- Product questions, features → product_info
- Returns, exchanges, orders → returns_exchanges
- Simple greetings → answer directly
"""

# Create the orchestrator agent
orchestrator = Agent(
    name="orchestrator",
    model=DEFAULT_MODEL,
    system_prompt=ORCHESTRATOR_PROMPT,
    tools=[technical_support, billing_support, product_info, returns_exchanges],
)

print("Orchestrator agent created")
print("\nMulti-agent customer support system ready!")

### Test the Multi-Agent System

Before evaluating, let's test the system to see how the orchestrator routes queries to specialists.

In [None]:
# Test query
test_query = "My app keeps crashing when I try to login. My customer ID is user123."

print(f"Test Query: {test_query}\n")
print("="*80)

# Execute the query
response = orchestrator(test_query)

print(f"\nResponse: {response}\n")
print("="*80)

# Show which tools were called
print("\nTools called during execution:")
for msg in orchestrator.messages:
    if hasattr(msg, 'content'):
        for content_block in msg.content:
            if hasattr(content_block, 'tool_use_id'):
                print(f"  - {content_block.name if hasattr(content_block, 'name') else 'tool'}")

### Dimension 1: Individual Agent Performance

First, we'll evaluate how well each specialist agent performs its specific task when called directly.

#### Create Test Cases for Individual Agents

Define test cases targeting each specialist agent's domain.

In [None]:
# Test cases for individual agent evaluation
individual_test_cases = [
    Case(
        name="Technical Support - App Crash",
        input="My app keeps crashing when I try to login for user123",
        expected_output="Technical support should look up customer info and create a ticket for the crash issue"
    ),
    Case(
        name="Billing Support - Double Charge",
        input="I was charged twice this month, my customer ID is user456",
        expected_output="Billing support should look up customer payment history and address the double charge concern"
    ),
    Case(
        name="Product Info - Features",
        input="What features are included in the pro_plan?",
        expected_output="Product specialist should look up pro_plan details and explain its features"
    ),
    Case(
        name="Returns - Order Status",
        input="I want to check the status of order789",
        expected_output="Returns specialist should look up the order and provide status information"
    ),
]

print(f"Created {len(individual_test_cases)} test cases for individual agent evaluation")

#### Create Individual Agent Evaluators

Set up evaluators to assess output quality and tool selection for each specialist.

In [None]:
# Output evaluator for response quality
output_evaluator = OutputEvaluator(
    rubric="Evaluate the quality and accuracy of the specialist agent's response for their specific domain.",
    model=DEFAULT_MODEL
)

# Tool selection evaluator for correct tool usage
tool_selection_evaluator = ToolSelectionAccuracyEvaluator(
    system_prompt="Score 1.0 if the agent uses the appropriate tools for the query (e.g., lookup_customer for billing, lookup_product for product info, create_ticket for technical issues). Score 0.5 if partially correct. Score 0.0 if wrong tools used.",
    model=DEFAULT_MODEL
)

#### Evaluate Specialist Agents Individually

Run evaluation on each specialist agent when called directly (bypassing the orchestrator).

In [None]:
# Define direct agent tasks (bypassing orchestrator) with telemetry capture
def technical_support_task(case: Case) -> dict:
    """Call technical support agent directly with telemetry capture."""
    telemetry.in_memory_exporter.clear()
    session_id = str(uuid.uuid4())
    agent = Agent(
        name="technical_support",
        model=DEFAULT_MODEL,
        system_prompt="You are a technical support specialist. Help users troubleshoot technical issues, bugs, and product functionality problems. Use available tools to look up customer info and create tickets.",
        tools=[lookup_customer, create_ticket],
        trace_attributes={"session.id": session_id},
        callback_handler=None
    )
    response = agent(case.input)
    telemetry.tracer_provider.force_flush()
    finished_spans = telemetry.in_memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(finished_spans, session_id=session_id)
    return {"output": str(response), "trajectory": session}

def billing_support_task(case: Case) -> dict:
    """Call billing support agent directly with telemetry capture."""
    telemetry.in_memory_exporter.clear()
    session_id = str(uuid.uuid4())
    agent = Agent(
        name="billing_support",
        model=DEFAULT_MODEL,
        system_prompt="You are a billing specialist. Help with payment issues, subscription questions, refunds, and account billing problems. Use tools to look up customer information.",
        tools=[lookup_customer],
        trace_attributes={"session.id": session_id},
        callback_handler=None
    )
    response = agent(case.input)
    telemetry.tracer_provider.force_flush()
    finished_spans = telemetry.in_memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(finished_spans, session_id=session_id)
    return {"output": str(response), "trajectory": session}

def product_info_task(case: Case) -> dict:
    """Call product info agent directly with telemetry capture."""
    telemetry.in_memory_exporter.clear()
    session_id = str(uuid.uuid4())
    agent = Agent(
        name="product_info",
        model=DEFAULT_MODEL,
        system_prompt="You are a product specialist. Explain product features, capabilities, and help users understand how to use products effectively. Use tools to look up product details.",
        tools=[lookup_product],
        trace_attributes={"session.id": session_id},
        callback_handler=None
    )
    response = agent(case.input)
    telemetry.tracer_provider.force_flush()
    finished_spans = telemetry.in_memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(finished_spans, session_id=session_id)
    return {"output": str(response), "trajectory": session}

def returns_exchanges_task(case: Case) -> dict:
    """Call returns agent directly with telemetry capture."""
    telemetry.in_memory_exporter.clear()
    session_id = str(uuid.uuid4())
    agent = Agent(
        name="returns_exchanges",
        model=DEFAULT_MODEL,
        system_prompt="You are a returns specialist. Help with product returns, exchanges, order modifications, and shipping issues. Use tools to look up order and customer information.",
        tools=[lookup_order, lookup_customer],
        trace_attributes={"session.id": session_id},
        callback_handler=None
    )
    response = agent(case.input)
    telemetry.tracer_provider.force_flush()
    finished_spans = telemetry.in_memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(finished_spans, session_id=session_id)
    return {"output": str(response), "trajectory": session}

# Create 8 datasets (4 agents x 2 evaluators each)
tech_output_dataset = Dataset(
    cases=[individual_test_cases[0]],
    evaluator=output_evaluator
)

tech_tool_dataset = Dataset(
    cases=[individual_test_cases[0]],
    evaluator=tool_selection_evaluator
)

billing_output_dataset = Dataset(
    cases=[individual_test_cases[1]],
    evaluator=output_evaluator
)

billing_tool_dataset = Dataset(
    cases=[individual_test_cases[1]],
    evaluator=tool_selection_evaluator
)

product_output_dataset = Dataset(
    cases=[individual_test_cases[2]],
    evaluator=output_evaluator
)

product_tool_dataset = Dataset(
    cases=[individual_test_cases[2]],
    evaluator=tool_selection_evaluator
)

returns_output_dataset = Dataset(
    cases=[individual_test_cases[3]],
    evaluator=output_evaluator
)

returns_tool_dataset = Dataset(
    cases=[individual_test_cases[3]],
    evaluator=tool_selection_evaluator
)

print("Evaluating individual agent performance...\n")

# Run 8 separate evaluations
tech_output_report = tech_output_dataset.run_evaluations(technical_support_task)
tech_tool_report = tech_tool_dataset.run_evaluations(technical_support_task)
billing_output_report = billing_output_dataset.run_evaluations(billing_support_task)
billing_tool_report = billing_tool_dataset.run_evaluations(billing_support_task)
product_output_report = product_output_dataset.run_evaluations(product_info_task)
product_tool_report = product_tool_dataset.run_evaluations(product_info_task)
returns_output_report = returns_output_dataset.run_evaluations(returns_exchanges_task)
returns_tool_report = returns_tool_dataset.run_evaluations(returns_exchanges_task)

print("Individual agent evaluations complete")

### Display Individual Performance Results

In [None]:
print("\n" + "="*80)
print("DIMENSION 1: INDIVIDUAL AGENT PERFORMANCE")
print("="*80 + "\n")

print("Technical Support Agent:")
print("-" * 40)
print("\nOutput Quality:")
tech_output_report.run_display()
print("\nTool Selection:")
tech_tool_report.run_display()

print("\n" + "="*80 + "\n")
print("Billing Support Agent:")
print("-" * 40)
print("\nOutput Quality:")
billing_output_report.run_display()
print("\nTool Selection:")
billing_tool_report.run_display()

print("\n" + "="*80 + "\n")
print("Product Info Agent:")
print("-" * 40)
print("\nOutput Quality:")
product_output_report.run_display()
print("\nTool Selection:")
product_tool_report.run_display()

print("\n" + "="*80 + "\n")
print("Returns & Exchanges Agent:")
print("-" * 40)
print("\nOutput Quality:")
returns_output_report.run_display()
print("\nTool Selection:")
returns_tool_report.run_display()

## Dimension 2: Collective System Performance

Next, we'll evaluate the complete multi-agent system's performance when the orchestrator routes queries.

### Create Test Cases for System-Level Evaluation

In [None]:
# Test cases for collective system evaluation
system_test_cases = [
    Case(
        name="Technical Query - App Crash",
        input="My app keeps crashing when I try to login for user123",
        expected_output="System should route to technical support and provide troubleshooting help"
    ),
    Case(
        name="Billing Query - Double Charge",
        input="I was charged twice this month, my customer ID is user456",
        expected_output="System should route to billing support and investigate the charge issue"
    ),
    Case(
        name="Product Query - Features",
        input="What features are included in the pro_plan?",
        expected_output="System should route to product info and explain pro_plan features"
    ),
    Case(
        name="Order Query - Status Check",
        input="I want to check the status of order789",
        expected_output="System should route to returns specialist and provide order status"
    ),
]

print(f"Created {len(system_test_cases)} test cases for system-level evaluation")

### Create System-Level Evaluators

In [None]:
# Output evaluator for final system response quality
system_output_evaluator = OutputEvaluator(
    rubric="Evaluate the quality and completeness of the multi-agent system's final response.",
    model=DEFAULT_MODEL
)

print("System-level evaluator created")

### Evaluate Complete System Performance

In [None]:
# Define system task (using orchestrator)
def system_task(case: Case) -> str:
    """Execute query through the complete multi-agent system."""
    # Reset orchestrator messages for clean execution
    orchestrator.messages = []
    response = orchestrator(case.input)
    return str(response)

# Create dataset for system evaluation
system_dataset = Dataset(
    cases=system_test_cases,
    evaluator=system_output_evaluator
)

print("Evaluating complete system performance...\n")

# Run evaluation
system_report = system_dataset.run_evaluations(system_task)

print("System evaluation complete")

### Display System Performance Results

In [None]:
print("\n" + "="*80)
print("DIMENSION 2: COLLECTIVE SYSTEM PERFORMANCE")
print("="*80 + "\n")

system_report.run_display()

## Dimension 3: Coordination Quality

Finally, we'll evaluate how well the orchestrator coordinates with specialists - analyzing routing decisions and handoff quality.

### Understanding InteractionsEvaluator

**InteractionsEvaluator** is specifically designed to evaluate multi-agent interactions:

- **Node Name**: Which agent handled the interaction
- **Dependencies**: What information was passed to the agent
- **Messages**: The output or result from the agent

This evaluator assesses:
- Whether the right agent was selected
- If information was passed correctly
- Whether the agent's response was appropriate

### Create Coordination Evaluation Task

In [None]:
# Define coordination task that captures interactions
def coordination_task(case: Case) -> Dict[str, Any]:
    """
    Execute query and capture interaction data for coordination evaluation.
    
    Returns:
        Dictionary with output, trajectory, and interactions
    """
    # Reset orchestrator messages
    orchestrator.messages = []
    
    # Execute query
    response = orchestrator(case.input)
    
    # Extract tools used from messages
    tools_used = tools_use_extractor.extract_agent_tools_used_from_messages(orchestrator.messages)
    
    # Build interactions list
    interactions = []
    for tool_used in tools_used:
        interactions.append(
            Interaction(
                node_name=tool_used["name"],
                dependencies=[tool_used["input"]],
                messages=[tool_used["tool_result"]]
            )
        )
    
    return {
        "output": str(response),
        "trajectory": tools_used,
        "interactions": interactions
    }

print("Coordination task function defined")

### Create InteractionsEvaluator

In [None]:
# Create interaction evaluator with custom rubric
coordination_rubric = """
Evaluate the orchestrator's routing decision and the specialist agent's response quality:

Score 1.0 if:
- The orchestrator routed to the correct specialist for the query type
- The specialist used appropriate tools for their domain
- The specialist provided a complete, helpful response

Score 0.7 if:
- Correct specialist selected
- Most appropriate tools used
- Response is adequate but could be more complete

Score 0.4 if:
- Marginally correct specialist or could have chosen better
- Some tool usage issues
- Response partially addresses the query

Score 0.0 if:
- Wrong specialist selected for the query type
- Inappropriate tool usage
- Response does not address the query
"""

interaction_evaluator = InteractionsEvaluator(
    rubric=coordination_rubric,
    model=DEFAULT_MODEL
)

# Get tool descriptions for context
test_execution = coordination_task(system_test_cases[0])
tool_description = tools_use_extractor.extract_tools_description(orchestrator)
interaction_evaluator.update_interaction_description(tool_description)

print("InteractionsEvaluator created with custom rubric")

### Evaluate Coordination Quality

In [None]:
# Create dataset for coordination evaluation
coordination_dataset = Dataset(
    cases=system_test_cases,
    evaluator=interaction_evaluator
)

print("Evaluating agent coordination quality...\n")

# Run evaluation
coordination_report = coordination_dataset.run_evaluations(coordination_task)

print("Coordination evaluation complete")

### Display Coordination Results

In [None]:
print("\n" + "="*80)
print("DIMENSION 3: COORDINATION QUALITY")
print("="*80 + "\n")

coordination_report.run_display(include_actual_interactions=True)

## Multi-Agent Evaluation Report

Let's compile all three evaluation dimensions into a comprehensive multi-agent evaluation report.

In [None]:
# Compile comprehensive evaluation report
def calculate_average_score(report):
    """Calculate average score from evaluation report."""
    if report.scores:
        return sum(report.scores) / len(report.scores)
    return 0.0

# Calculate metrics for each dimension
tech_output_score = calculate_average_score(tech_output_report)
tech_tool_score = calculate_average_score(tech_tool_report)
billing_output_score = calculate_average_score(billing_output_report)
billing_tool_score = calculate_average_score(billing_tool_report)
product_output_score = calculate_average_score(product_output_report)
product_tool_score = calculate_average_score(product_tool_report)
returns_output_score = calculate_average_score(returns_output_report)
returns_tool_score = calculate_average_score(returns_tool_report)

tech_score = (tech_output_score + tech_tool_score) / 2
billing_score = (billing_output_score + billing_tool_score) / 2
product_score = (product_output_score + product_tool_score) / 2
returns_score = (returns_output_score + returns_tool_score) / 2
individual_avg = (tech_score + billing_score + product_score + returns_score) / 4

system_score = calculate_average_score(system_report)
coordination_score = calculate_average_score(coordination_report)

# Create comprehensive report
comprehensive_report = f"""
# Multi-Agent System Evaluation Report

## Executive Summary

This report presents a comprehensive evaluation of the customer support multi-agent system across three critical dimensions: individual agent performance, collective system performance, and coordination quality.

## Overall Metrics

| Dimension | Score | Status |
|:----------|:------|:-------|
| **Individual Performance** | {individual_avg:.2f} | {'✅ Good' if individual_avg >= 0.7 else '❌ Needs Improvement'} |
| **Collective Performance** | {system_score:.2f} | {'✅ Good' if system_score >= 0.7 else '❌ Needs Improvement'} |
| **Coordination Quality** | {coordination_score:.2f} | {'✅ Good' if coordination_score >= 0.7 else '❌ Needs Improvement'} |
| **Overall System Score** | {(individual_avg + system_score + coordination_score) / 3:.2f} | {'✅ Production Ready' if (individual_avg + system_score + coordination_score) / 3 >= 0.7 else '❌ Requires Optimization'} |

## Dimension 1: Individual Agent Performance

### Specialist Agent Scores

| Agent | Score | Assessment |
|:------|:------|:-----------|
| Technical Support | {tech_score:.2f} | {'Strong performance' if tech_score >= 0.7 else 'Needs improvement'} |
| Billing Support | {billing_score:.2f} | {'Strong performance' if billing_score >= 0.7 else 'Needs improvement'} |
| Product Info | {product_score:.2f} | {'Strong performance' if product_score >= 0.7 else 'Needs improvement'} |
| Returns & Exchanges | {returns_score:.2f} | {'Strong performance' if returns_score >= 0.7 else 'Needs improvement'} |

**Key Findings:**
- Each specialist agent demonstrates competence in their domain
- Tool selection accuracy varies by agent and query complexity
- Response quality is generally consistent across specialists

## Dimension 2: Collective System Performance

**System Score:** {system_score:.2f}

**Key Findings:**
- The complete system produces appropriate responses when routing correctly
- End-to-end query handling meets quality standards
- User-facing output is coherent and helpful

## Dimension 3: Coordination Quality

**Coordination Score:** {coordination_score:.2f}

**Key Findings:**
- Orchestrator routing decisions are generally accurate
- Handoffs between orchestrator and specialists are smooth
- Information is passed correctly between agents
- Interaction quality supports effective collaboration

## Coordination Metrics

### Routing Accuracy Analysis

**Expected Routing:**
- Technical issues -> technical_support
- Billing questions -> billing_support
- Product queries -> product_info
- Order issues -> returns_exchanges

### Handoff Quality Indicators

- **Context Preservation**: Query context maintained during handoffs
- **Information Completeness**: Specialists receive all necessary information
- **Response Integration**: Final responses integrate specialist outputs effectively

## Recommendations

### Strengths
1. Clear separation of concerns among specialists
2. Effective orchestrator routing logic
3. Appropriate tool usage by domain experts

### Areas for Improvement
1. **Individual Agents**: {'Optimize tool selection for edge cases' if individual_avg < 0.85 else 'Maintain current performance levels'}
2. **System Level**: {'Improve orchestrator prompts for better routing' if system_score < 0.85 else 'Consider expanding capabilities'}
3. **Coordination**: {'Enhance handoff protocols between agents' if coordination_score < 0.85 else 'Document successful patterns'}

### Next Steps
1. Monitor production performance with real user queries
2. Expand test coverage to include edge cases and complex scenarios
3. Implement continuous evaluation pipeline
4. Gather user feedback to validate evaluation metrics

## Conclusion

The multi-agent customer support system demonstrates {'strong' if (individual_avg + system_score + coordination_score) / 3 >= 0.7 else 'moderate'} performance across all evaluation dimensions. The combination of specialized agents with effective orchestration creates a {'production-ready' if (individual_avg + system_score + coordination_score) / 3 >= 0.7 else 'promising'} system for handling diverse customer support queries.

**Overall Assessment:** {'READY FOR DEPLOYMENT' if (individual_avg + system_score + coordination_score) / 3 >= 0.75 else 'REQUIRES ADDITIONAL OPTIMIZATION'}
"""

display(Markdown(comprehensive_report))

## Best Practices for Multi-Agent Evaluation

### Always Evaluate All Three Dimensions

| Dimension | Purpose | Key Tests |
|:----------|:--------|:----------|
| Individual Performance | Ensure specialist competence | Isolation tests, tool selection, response quality |
| Collective Performance | Validate end-to-end quality | Complete workflows, final output, task completion |
| Coordination Quality | Evaluate collaboration | Routing decisions, handoff quality, information flow |

### When to Use InteractionsEvaluator

Use when: evaluating multi-agent systems, analyzing collaboration patterns, debugging coordination issues, validating information flow between agents.

### Interpreting Results

| Pattern | Diagnosis |
|:--------|:----------|
| High individual, low collective | Orchestration issues |
| High collective, low coordination | Correct outcomes despite poor process |
| Low individual, any collective | Specialist agents need improvement |
| High all dimensions | System well-designed and functioning |

### Production Monitoring

Track: routing accuracy rate, average handoffs per query, specialist utilization, end-to-end latency, user satisfaction.

## Summary

You've successfully learned how to evaluate multi-agent systems using Strands Evals. You now understand:

- **Multi-agent architectures**: Agent-as-tool pattern with orchestrator and specialists
- **Three evaluation dimensions**:
  - Individual Performance: Each agent evaluated separately
  - Collective Performance: Complete system evaluated as a whole
  - Coordination Quality: Agent collaboration and handoffs
- **InteractionsEvaluator**: Specialized evaluator for multi-agent coordination
- **Interaction structure**: Node names, dependencies, and messages
- **Comprehensive reporting**: Combining multiple evaluation dimensions
- **Coordination metrics**: Routing accuracy, handoff quality, information flow
- **Best practices**: When and how to evaluate multi-agent systems effectively

Multi-agent evaluation is essential for building reliable, efficient, and collaborative AI systems. By evaluating individual components, collective outcomes, and coordination quality, you gain complete visibility into complex multi-agent architectures.