# Strands Agent Evaluation Framework

## Overview
This notebook demonstrates comprehensive evaluation techniques for Strands agents using built-in observability features and custom evaluation metrics.

### Strands Evaluation Dimensions:
- **Agent Performance**: Measuring accuracy using ground truth datasets
- **Tool Execution**: Analyzing tool selection and execution success rates
- **Resource Efficiency**: Token usage, latency, and cycle duration analysis
- **Agent Reliability**: Consistency across multiple test scenarios

## 1. Dependencies Installation

In [None]:
!pip install strands-agents strands-agents-tools
!pip install ddgs

## 2. Imports

In [None]:
from strands import Agent, tool 
from strands.models import BedrockModel 
from googlesearch import search 
from bs4 import BeautifulSoup 
import requests 
import pandas as pd, re

## 3. Ground Truth Dataset Loading

In [None]:
# Read the CSV file
#this contains the city, state, population, and land area in square miles in 2024.
gold_standard_city_pop = pd.read_csv('city_pop.csv')
# Clean the dataset once when loading, wikipedia has commas in the numbers.
gold_standard_city_pop['population'] = gold_standard_city_pop['population'].astype(str).str.replace(',', '').astype(float)
gold_standard_city_pop['land_area_mi2'] = gold_standard_city_pop['land_area_mi2'].astype(str).str.replace(',', '').astype(float)

# Show the first 3 rows, as a reference
print(gold_standard_city_pop.head(3))  # First 3 rows

## 4. Core Evaluation Function

In [None]:
def evaluate_city_guess(city, state, chatbot_response, dataset):
    """
    Evaluate population and area guesses against the gold standard dataset.
    
    Parameters:
    - city: str, city name
    - state: str, state abbreviation (e.g., 'NY', 'CA')
    - chatbot_response: Strands AgentResult object to be evaluated
    - dataset: pandas DataFrame, the gold standard dataset
    
    Returns:
    - dict with percent errors for population and area, and total tokens, execution time, and tool calls.
    
    Raises:
    - ValueError if city/state combination not found
    """
    
    # Clean the city name for matching
    city_clean = city.strip()

    
    #use regex to grab the final answer as numbers
    final_msg = chatbot_response.message['content'][0]['text']
    try:
        guessed_pop, guessed_area = int(re.search(r'<pop>(.*?)</pop>', final_msg).group(1)), float(re.search(r'<area>(.*?)</area>', final_msg).group(1))
    except:
        raise ValueError(f"XML tags not found in reply")

    
    #extract agent loop metrics
    total_tokens = chatbot_response.metrics.accumulated_usage['totalTokens']
    total_time = sum(chatbot_response.metrics.cycle_durations)
    
    tool_calls = 0
    for t in chatbot_response.metrics.tool_metrics.keys():
        tool_calls+= chatbot_response.metrics.tool_metrics[t].call_count

    
    # Find the city in the dataset
    # Use case-insensitive matching and handle potential annotations
    mask = (dataset['city'].str.replace(r'\[.*\]', '', regex=True).str.strip().str.lower() == city_clean.lower()) & \
           (dataset['state'].str.upper() == state.upper())
    
    matching_rows = dataset[mask]
    
    if len(matching_rows) == 0:
        raise ValueError(f"City '{city}' in state '{state}' not found in dataset")
    
    if len(matching_rows) > 1:
        print(f"Warning: Multiple matches found for {city}, {state}. Using first match.")
    
    # Get the actual values
    actual_pop = matching_rows.iloc[0]['population']
    actual_area = matching_rows.iloc[0]['land_area_mi2']
    
    # Calculate percent error: |actual - guess| / actual * 100
    pop_error = abs(actual_pop - guessed_pop) / actual_pop * 100
    area_error = abs(actual_area - guessed_area) / actual_area * 100
    
    return {
        'city': matching_rows.iloc[0]['city'],
        'state': matching_rows.iloc[0]['state'],
        'actual_population': actual_pop,
        'guessed_population': guessed_pop,
        'population_error_percent': round(pop_error, 2),
        'actual_area': actual_area,
        'guessed_area': guessed_area,
        'area_error_percent': round(area_error, 2),
        'total_tokens': total_tokens,
        'total_time': total_time,
        'tool_calls': tool_calls
    }

## 5. Strands Agent Tools

In [None]:
@tool
def web_search(topic: str) -> str:
    """Search Duck Duck go Service for a given topic."""
    try:
        from ddgs import DDGS
        results = DDGS().text(topic, max_results=5)
        
        if not results:
            return "No search results found"
        
        result_string = ""
        for i, result in enumerate(results):
            result_string += f"Result {i+1}: {result.get('title', 'No title')}\nURL: {result.get('href', 'No URL')}\nSnippet: {result.get('body', 'No description')}\n\n"
        
        return result_string
        
    except Exception as e:
        return f"Search error: {str(e)}"
    
@tool      
def get_page(url: str) -> str:
    """this function takes a URL and returns the raw text from that page.
    it can be used to get more info based on a Google search result listing."""
    response = requests.get(url)
    response.raise_for_status()
    bs = BeautifulSoup(response.text,'html.parser')
    return bs.text

## 6. AWS Bedrock Configuration

In [None]:
from botocore.config import Config

#A custom config for Bedrock to only allow short connections - for our demo we expect all calls to be fast.
#here we turn off retries, and we time out after 20 seconds.
quick_config = Config(
    connect_timeout=5,
    read_timeout=20,
    retries={"max_attempts": 0}
)

longer_config = Config(
    connect_timeout=10,
    read_timeout=60,
    retries={"max_attempts": 1}
)

## 7. Single Agent Baseline Test

In [None]:
#Create the chatbot.  We'll use Nova Micro to optimize for latency, cost, and capacity
chatbot_model_name = "us.amazon.nova-micro-v1:0"
#add custom timeout for the model, to keep the tool from hanging or retrying too much.
chatbot_model = BedrockModel(
    model_id=chatbot_model_name,
    boto_client_config=quick_config    
)
chatbot = Agent(tools=[web_search,get_page], model=chatbot_model)
#Call the chat bot with a simple request.
prompt = """How many people live in New York, and what's the area of the city in square miles?
After you respond, also include your answer in 'pop' and 'area' XML tags, for programatic processing.
The values in the XML tags should only be numbers, no words or commas."""
chatbot_response = chatbot(prompt)

In [None]:
result = evaluate_city_guess("El monte", "CA", chatbot_response, gold_standard_city_pop)
print(f"Population error: {result['population_error_percent']}%")
print(f"Area error: {result['area_error_percent']}%")
print(f"Total Tokens: {result['total_tokens']} tokens")
print(f"Total Time: {result['total_time']:.2f} seconds")
print(f"Tool Calls: {result['tool_calls']}")

## 8. Single Model Evaluation Tool

In [None]:
@tool
def eval_model(model_name: str) -> str:
    """Start an evaluator for a particular model.
    model_name is the model endpoint to be evaluated.
    Retruns a string containing information about this model.
    """
    #add custom timeout for the model, to keep the tool from hanging or retrying too much.
    chatbot_model = BedrockModel(
        model_id=model_name,
        boto_client_config=quick_config    
    )
    
    chatbot = Agent(tools=[web_search,get_page], model=chatbot_model, callback_handler=None)# callback_handler=None to suppress sub agent print outs
    #Call the chat bot with a simple request.
    prompt = """How many people live in Phoenix, AZ, and what's the area of the city in square miles?
    After you respond, also include your answer in 'pop' and 'area' XML tags, for programatic processing.
    The values in the XML tags should only be numbers, no words or commas."""
    chatbot_response = chatbot(prompt)
    result = evaluate_city_guess("Phoenix", "AZ", chatbot_response, gold_standard_city_pop)
    result_string = ""
    result_string = result_string + f"Population error: {result['population_error_percent']}%" + '\n'
    result_string = result_string + f"Area error: {result['area_error_percent']}%" + '\n'
    result_string = result_string + f"Total Tokens: {result['total_tokens']} tokens" + '\n'
    result_string = result_string + f"Total Time: {result['total_time']:.2f} seconds" + '\n'
    result_string = result_string + f"Tool Calls: {result['tool_calls']}"
    print (result_string)
    return result_string

## 9. Multi-Model Comparison

In [None]:
evaluator_prompt = """
Use the eval_model tool to evaluate these models:
Nova Micro: "us.amazon.nova-micro-v1:0",
Nova Lite: "us.amazon.nova-lite-v1:0",
Nova Pro: "us.amazon.nova-pro-v1:0",
Claude 3 Haiku: "us.anthropic.claude-3-haiku-20240307-v1:0",
Claude 3 Sonnet: "us.anthropic.claude-3-sonnet-20240229-v1:0"
Provide a table comparason on the results, and include columns for all evaluation data points, including number of tool calls, and the number of times the model failed to evaluate and had to be retried.
Do not include the endpoint names in the table, only the model names, to save space.
If a model fails to evaluate, you should retry it up to 3 times.
"""

In [None]:
evaluator = Agent(tools=[eval_model], model=chatbot_model)
evaluator_response = evaluator(evaluator_prompt)

## 10. Multi-City Evaluation Framework

### Expanding Evaluation to Multiple Data Points

Next, we'll expand our evaluator to be able to check based on more than one data point. We add the calculator too to assist.

In [None]:
import random


@tool
def calculate(expression: str) -> str:
    """Evaluate mathematical expressions safely. Use for calculations like population density."""
    try:
        allowed_chars = set('0123456789+-*/()., ')
        if not all(c in allowed_chars for c in expression):
            return "Error: Invalid characters"
        return str(eval(expression))
    except:
        return "Error: Invalid calculation"

### Multi-City Evaluation Function

In [None]:
import statistics
import random

def evaluate_multiple_cities(model_name, num_cities=3):
    """Multi-city evaluation using original evaluate_city_guess function"""
    MAJOR_CITIES = [
        ("New York", "NY"), ("Los Angeles", "CA"), ("Chicago", "IL"),
        ("Houston", "TX"), ("Phoenix", "AZ"), ("Philadelphia", "PA")
    ]
    
    test_cities = random.sample(MAJOR_CITIES, num_cities)
    results = []
    
    for city, state in test_cities:
        try:
            chatbot_model = BedrockModel(model_id=model_name, boto_client_config=quick_config)
            chatbot = Agent(tools=[web_search, get_page, calculate], model=chatbot_model, callback_handler=None)
            
            prompt = f"""How many people live in {city}, {state}, and what's the area of the city in square miles?
After you respond, also include your answer in 'pop' and 'area' XML tags, for programatic processing.
The values in the XML tags should only be numbers, no words or commas."""
            
            response = chatbot(prompt)
            result = evaluate_city_guess(city, state, response, gold_standard_city_pop)
            results.append(result)
            print(f"✓ {city}, {state}")
            
        except Exception as e:
            print(f"✗ Failed {city}, {state}: {e}")
            continue
    
    if results:
        return {
            'cities_tested': len(results),
            'avg_population_error': round(statistics.mean([r['population_error_percent'] for r in results]), 2),
            'avg_area_error': round(statistics.mean([r['area_error_percent'] for r in results]), 2),
            'total_tokens': sum([r['total_tokens'] for r in results]),
            'avg_time_per_city': round(statistics.mean([r['total_time'] for r in results]), 2),
            'total_tool_calls': sum([r['tool_calls'] for r in results]),
            'individual_results': results
        }
    return None

### Multi-City Evaluation Tool

In [None]:
@tool
def eval_model_multi(model_name: str, num_cities: int = 3) -> str:
    """Multi-city version of eval_model using existing evaluation logic"""
    results = evaluate_multiple_cities(model_name, num_cities)
    
    if results:
        result_string = f"Cities tested: {results['cities_tested']}\n"
        result_string += f"Avg population error: {results['avg_population_error']}%\n"
        result_string += f"Avg area error: {results['avg_area_error']}%\n"
        result_string += f"Total tokens: {results['total_tokens']}\n"
        result_string += f"Avg time per city: {results['avg_time_per_city']:.2f} seconds\n"
        result_string += f"Total tool calls: {results['total_tool_calls']}"
        print(result_string)
        return result_string
    else:
        return "Evaluation failed - no cities successfully processed"

### Test Multi-City Evaluation

In [None]:
# Test it
results = evaluate_multiple_cities("us.amazon.nova-micro-v1:0", 3)
if results:
    print(f"Cities tested: {results['cities_tested']}")
    print(f"Avg population error: {results['avg_population_error']}%")
    print(f"Avg area error: {results['avg_area_error']}%")

In [None]:
multi_evaluator = Agent(tools=[eval_model_multi], model=chatbot_model)

multi_prompt = """
Use eval_model_multi to test these models on 3 cities each:
- "us.amazon.nova-lite-v1:0"
- "us.anthropic.claude-3-haiku-20240307-v1:0"

Create a comparison table with all metrics.
"""

multi_response = multi_evaluator(multi_prompt)

## 11. Tool Call Evaluation Framework

### Tool Selection Accuracy Testing

In [None]:
import json

dataset = [
  { "id": 1, "input": "What is 234 + 876?", "expected_tool": "calculator", "expected_output": "1110" },
  { "id": 2, "input": "Multiply 45 by 19.", "expected_tool": "calculator", "expected_output": "855" },
  { "id": 3, "input": "What is (15 * 4) + 9?", "expected_tool": "calculator", "expected_output": "69" },
  { "id": 4, "input": "Read the contents of notes.txt", "expected_tool": "file_read", "expected_output": "File contents of notes.txt" },
  { "id": 5, "input": "Open and show me what's inside data.csv", "expected_tool": "file_read", "expected_output": "CSV content from data.csv" },
  { "id": 6, "input": "Display everything in todo.md", "expected_tool": "file_read", "expected_output": "Markdown content of todo.md" },
  { "id": 7, "input": "Write 'Hello World' into hello.txt", "expected_tool": "file_write", "expected_output": "File hello.txt created with 'Hello World'" },
  { "id": 8, "input": "Save the text 'AgentCore Rocks!' into core.txt", "expected_tool": "file_write", "expected_output": "File core.txt created with text" },
  { "id": 9, "input": "Create a file log.txt that contains 'run successful'", "expected_tool": "file_write", "expected_output": "File log.txt written" },
  { "id": 10, "input": "Run Python code: print(2+3)", "expected_tool": "code_interpreter", "expected_output": "5" },
  { "id": 11, "input": "Execute Python code: for i in range(3): print(i)", "expected_tool": "code_interpreter", "expected_output": "0\n1\n2" },
  { "id": 12, "input": "Run a Python snippet to calculate factorial of 5", "expected_tool": "code_interpreter", "expected_output": "120" },
  { "id": 13, "input": "What is the capital of France?", "expected_tool": "none", "expected_output": "Paris" },
  { "id": 14, "input": "Who is the CEO of Amazon?", "expected_tool": "none", "expected_output": "Andy Jassy" },
  { "id": 15, "input": "Divide 500 by 25.", "expected_tool": "calculator", "expected_output": "20" },
  { "id": 16, "input": "Square root of 144?", "expected_tool": "calculator", "expected_output": "12" },
  { "id": 17, "input": "Show me what's inside config.yaml", "expected_tool": "file_read", "expected_output": "YAML file content" },
  { "id": 18, "input": "Write 'Done for today' in status.txt", "expected_tool": "file_write", "expected_output": "status.txt written" },
  { "id": 19, "input": "Execute Python: sum([10,20,30])", "expected_tool": "code_interpreter", "expected_output": "60" },
  { "id": 20, "input": "What is 99 * 99?", "expected_tool": "calculator", "expected_output": "9801" }
]

with open('dataset.json', 'w') as f:
    json.dump(dataset, f, indent=2)

In [None]:
from strands import Agent
from strands_tools import calculator, file_read, current_time, file_write, code_interpreter
# Create agent with multiple tools
agent = Agent(
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    tools=[calculator, file_read, current_time, file_write, code_interpreter],
    record_direct_tool_call = True
)

# Define tool-specific test cases

# Track tool usage
tool_usage_results = []
for case in dataset:
    response = agent(case["input"])

    # Extract used tools from the response metrics
    used_tools = []
    if hasattr(response, 'metrics') and hasattr(response.metrics, 'tool_metrics'):
        for tool_name, tool_metric in response.metrics.tool_metrics.items():
            if tool_metric.call_count > 0:
                used_tools.append(tool_name)

    tool_usage_results.append({
        "query": case["input"],
        "expected_tool": case["expected_tool"],
        "used_tools": used_tools,
        "correct_tool_used": case["expected_tool"] in used_tools
    })

# Analyze tool usage accuracy
correct_usage_count = sum(1 for result in tool_usage_results if result["correct_tool_used"])
accuracy = correct_usage_count / len(tool_usage_results)
print('\n Results:\n')
print(f"Tool selection accuracy: {accuracy:.2%}")

## 12. LLM As a Judge Evaluation

### Using Stronger Models to Evaluate Agent Responses

In [None]:
from strands import Agent
import json

# Create the agent to evaluate
# Create the agent to evaluate
agent = Agent(model="us.anthropic.claude-3-5-sonnet-20241022-v2:0")

# Create an evaluator agent with a stronger model
evaluator = Agent(
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    system_prompt="""
    You are an expert AI evaluator. Your job is to assess the quality of AI responses based on:
    1. Accuracy - factual correctness of the response
    2. Relevance - how well the response addresses the query
    3. Completeness - whether all aspects of the query are addressed
    4. Tool usage - appropriate use of available tools

    Score each criterion from 1-5, where 1 is poor and 5 is excellent.
    Provide an overall score and brief explanation for your assessment.
    """
)


# Load test cases
with open("dataset.json", "r") as f:
    test_cases = json.load(f)

# Run evaluations
evaluation_results = []
for case in test_cases:
    # Get agent response
    print(case)
    agent_response = agent(case['input'])

    # Create evaluation prompt
    eval_prompt = f"""
    Query: {case['input']}

    Response to evaluate:
    {agent_response}

    Expected response (if available):
    {case.get('expected_output', 'Not provided')}

    Please evaluate the response based on accuracy, relevance, completeness, and tool usage.
    """

    # Get evaluation
    evaluation = evaluator(eval_prompt)

    # Store results
    evaluation_results.append({
        "test_id": case.get("id", ""),
        "query": case["input"],
        "agent_response": str(agent_response),
        "evaluation": evaluation.message['content']
    })

# Save evaluation results
with open("evaluation_results.json", "w") as f:
    json.dump(evaluation_results, f, indent=2)

## 13. Advanced Strands Metrics Analysis

### Detailed Performance Metrics Using Strands Observability

In [None]:
result = agent("What is the square root of 144?")

def display_metrics(result):
    summary = result.metrics.get_summary()
    
    print(" Agent Performance Summary")
    print("=" * 50)
    
    # Core metrics
    print(f" Execution: {summary['total_cycles']} cycles in {summary['total_duration']:.2f}s")
    print(f" Average cycle time: {summary['average_cycle_time']:.2f}s")
    
    # Tool usage
    print(f"\n Tool Performance:")
    for tool, data in summary['tool_usage'].items():
        stats = data['execution_stats']
        print(f"   {tool}: {stats['call_count']} calls | {stats['success_rate']:.0%} success | {stats['average_time']*1000:.1f}ms avg")
    
    # Token usage
    usage = summary['accumulated_usage']
    print(f"\n Token Usage:")
    print(f"   Input: {usage['inputTokens']:,} | Output: {usage['outputTokens']:,} | Total: {usage['totalTokens']:,}")
    
    # Latency
    print(f" Total latency: {summary['accumulated_metrics']['latencyMs']:,}ms")
    
    # Cycle breakdown
    print(f"\n Cycle Details:")
    for i, trace in enumerate(summary['traces'], 10):
        if trace['duration']:
            print(f"   Cycle {i}: {trace['duration']:.2f}s")
display_metrics(result)

## 14. Conclusions and Key Findings

### Strands Agent Evaluation Framework Summary

This comprehensive evaluation framework demonstrates multiple approaches for assessing Strands agent performance:

**🎯 Accuracy Assessment:**
- Ground truth validation using structured outputs (XML tags)
- Population and area estimation error calculations
- Multi-city consistency testing

**⚡ Performance Monitoring:**
- Strands `AgentResult.metrics` for comprehensive analysis
- Token usage tracking for cost optimization
- Execution time and cycle duration measurement
- Tool call frequency and success rate analysis

**🔧 Tool Effectiveness:**
- Tool selection accuracy across different task types
- Multi-tool coordination assessment
- Tool execution success rate monitoring

**📊 Advanced Evaluation Techniques:**
- LLM-as-a-Judge for qualitative assessment
- Batch evaluation for consistency analysis
- Comparative model performance analysis

### Key Insights:

**Model Performance Patterns:**
- Larger models (Claude, Nova Pro) show better accuracy but higher token costs
- Smaller models (Nova Micro) are faster but less reliable with complex instructions
- Tool selection accuracy varies significantly between model families

**Strands Framework Benefits:**
- Built-in observability provides comprehensive performance metrics
- Agent cycle tracking enables detailed execution analysis
- Tool metrics facilitate optimization of agent capabilities
- Structured evaluation supports production monitoring

**Evaluation Methodology Learnings:**
- Structured output requirements (XML tags) are crucial for automated evaluation
- Multi-city testing reveals consistency issues not apparent in single-case tests
- LLM-as-a-Judge provides valuable qualitative insights
- Tool call efficiency is as important as accuracy for production deployments

### Production Recommendations:

1. **Implement structured outputs** for automated evaluation pipelines
2. **Use Strands metrics** for continuous performance monitoring
3. **Establish accuracy baselines** using ground truth datasets
4. **Monitor tool success rates** for reliability assessment
5. **Track token efficiency** for cost optimization
6. **Deploy LLM-as-a-Judge** for qualitative response evaluation

### Future Enhancements:

**Expanded Test Coverage:**
- Additional domains beyond city demographics
- More complex multi-step reasoning tasks
- Real-time data accuracy validation

**Advanced Metrics:**
- Semantic similarity scoring for text outputs
- Confidence calibration analysis
- Error pattern classification

**Automation Improvements:**
- Continuous evaluation pipelines
- A/B testing frameworks
- Performance regression detection

---

**📚 Reference**: This evaluation framework follows Strands documentation best practices for agent observability and performance measurement. For more details, see: https://strandsagents.com/latest/documentation/docs/user-guide/observability-evaluation/evaluation/

**🔗 Repository**: Save this notebook and datasets for reproducible evaluations and comparative analysis across different model versions and configurations.