# City Search Agent Evaluation Framework

## Overview

This notebook evaluates the city search agent running on Amazon Bedrock AgentCore Runtime.

### Pre-Requisite
Execute 05-04-01-Agentic-Metrics-AgentCore.ipynb before this notebook

### Evaluation Strategy

- **Multi-dimensional Quality Assessment**: Helpfulness, accuracy, clarity, professionalism, completeness
- **Tool Usage Analysis**: web_search tool usage patterns
- **Performance Metrics**: Response times and success rates
- **LLM-as-Judge**: Claude Sonnet for objective evaluation

In [1]:
!pip install boto3 requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Lets retrieve the citysearch agent arn that was stored in the notebook and setup the evaluator model to be used as the Judge LLM

In [2]:
# Configuration
%store -r citysearch_agent_arn
AGENT_NAME = "citysearch" 
EVALUATOR_MODEL = "us.anthropic.claude-sonnet-4-20250514-v1:0"
AGENT_ENDPOINT = citysearch_agent_arn
AGENT_QUALIFIER="DEFAULT"
print(f"✅ Configured for agent {AGENT_NAME}, with endpoint {AGENT_ENDPOINT}")

✅ Configured for agent citysearch, with endpoint arn:aws:bedrock-agentcore:us-east-1:XXXXXXXXXXXX:runtime/citysearch-583XaH96wT


In [3]:
import asyncio
import json
import time
import uuid
import boto3
import requests
from dataclasses import dataclass, asdict
from typing import Dict, List, Any, Optional
from datetime import datetime


# AWS clients
bedrock = boto3.client('bedrock-runtime')
print("✅ Dependencies loaded successfully")

✅ Dependencies loaded successfully


## Step 1) Lets define classes for Test Cases and Evaluation responses

Three specific tests for a city search agent:

    Basic greeting - Tests politeness, expects no tools

    Population query - Tests factual lookup, expects web_search tool

    Area query - Tests measurement data, expects web_search tool

Each test defines what tools should be used and what criteria make a good response, enabling automated evaluation of agent performance.

In [4]:
@dataclass
class TestCase:
    id: str
    query: str
    category: str
    expected_tools: List[str]
    expected_criteria: Dict[str, Any]
    description: str

@dataclass
class EvaluationResult:
    test_case_id: str
    query: str
    response: str
    metrics: Dict[str, float]
    response_time: float
    success: bool
    error_message: Optional[str] = None
    tool_calls: List[str] = None
    
    def to_dict(self):
        # Convert to dict manually to avoid serialization issues
        return {
            "test_case_id": self.test_case_id,
            "query": self.query,
            "response": str(self.response),  # Ensure string conversion
            "metrics": dict(self.metrics),
            "response_time": self.response_time,
            "success": self.success,
            "error_message": self.error_message,
            "tool_calls": list(self.tool_calls) if self.tool_calls else []
        }

# Test cases for city search agent
TEST_CASES = [
    TestCase(
        id="basic_greeting",
        query="Hi, I need help with finding information about cities",
        category="basic_inquiry",
        expected_tools=[],
        expected_criteria={"should_be_polite": True, "should_ask_for_details": True},
        description="Basic greeting and help request"
    ),
    TestCase(
        id="city_population_search",
        query="What is the population of Seattle?",
        category="population_inquiry",
        expected_tools=["web_search"],
        expected_criteria={"should_provide_population": True, "should_be_accurate": True},
        description="City population information request"
    ),
    TestCase(
        id="city_area_search",
        query="How large is Los Angeles in square miles?",
        category="area_inquiry",
        expected_tools=["web_search"],
        expected_criteria={"should_provide_area": True, "should_be_clear": True},
        description="City area information request"
    )
]

print(f"✅ Loaded {len(TEST_CASES)} test cases")

✅ Loaded 3 test cases


## Step 2) Define the methods for to invoke agent which includes evaluating LLM responses as well as tool use
For tool use detection, X-Ray observability is primarily used. The implementation looks for gen_ai.tool.name annotations in trace segments of both main and sub segments. As a fallback, content analysis of the output is used to determine tool use. 
The implementation below provides the details of the actual agent invocation with tools used.

In [5]:
# AgentCore client using bedrock-agentcore service
agentcore_client = boto3.client('bedrock-agentcore', region_name='us-east-1')

async def invoke_agent(query: str) -> Dict[str, Any]:
    """Invoke the agent using AgentCore Runtime"""
    start_time = time.time()
    
    try:
        payload = json.dumps({"prompt": query})
        session_id = f"eval-session-{uuid.uuid4()}"
        
        response = agentcore_client.invoke_agent_runtime(
            agentRuntimeArn=AGENT_ENDPOINT,
            runtimeSessionId=session_id,
            payload=payload,
            qualifier=AGENT_QUALIFIER
        )
        
        print("AgentCore Response keys:", list(response.keys()))
        
        # Extract response text from StreamingBody
        response_text = ""
        if isinstance(response, dict) and 'response' in response:
            streaming_body = response['response']
            if hasattr(streaming_body, 'read'):
                response_text = streaming_body.read().decode('utf-8')
                print(f"Extracted response text: {response_text[:200]}...")
            else:
                response_text = str(streaming_body)
        else:
            response_text = str(response)
            
        # Extract tool calls from response metadata or content
        log_group_name = extract_agent_log_name(AGENT_ENDPOINT)
        tool_calls = extract_tool_calls_from_agentcore_observability(session_id, log_group_name, AGENT_QUALIFIER) 
        
        return {
            "response": response_text,
            "success": True,
            "tool_calls": tool_calls,
            "response_time": time.time() - start_time,
            "session_id": session_id
        }
        
    except Exception as e:
        error_msg = f"Error invoking agent: {str(e)}"
        print(error_msg)
        return {
            "response": error_msg,
            "success": False,
            "tool_calls": [],
            "response_time": time.time() - start_time
        }

def extract_agent_log_name(arn):
    return arn.split('/')[-1]


### Below methods utilize Cloudwatch 
def extract_tool_calls_from_agentcore_observability(session_id, log_group_name, agent_qualifier, log_group_prefix='/aws/bedrock-agentcore/runtimes'):
    logs_client = boto3.client('logs')
    log_group_name = f"{log_group_prefix}/{log_group_name}-{agent_qualifier}"
    print("Log group name", log_group_name)
    response = logs_client.filter_log_events(
        logGroupName=log_group_name,
        filterPattern=session_id
    )
    logs_list = [event['message'] for event in response['events']]
    return extract_logs_for_session(logs_list)

def extract_tooluse_from_log(log_message):
    tools = []
    try:
        log_data = json.loads(log_message)
        
        # Navigate to output messages
        messages = log_data.get('body', {}).get('output', {}).get('messages', [])
        
        for message in messages:
            content = message.get('content', {}).get('content', '')
            
            # Parse the content JSON string
            if content:
                content_data = json.loads(content)
                # Extract toolUse from each item
                for item in content_data:
                    if 'toolUse' in item:
                        tool_name = item['toolUse']['name']
                        tools.append(tool_name)
                        
    except (json.JSONDecodeError, KeyError) as e:
        print(f"Error parsing log: {e}")
    
    return tools

def extract_logs_for_session(logs_list):
    all_tools = []
    for log in logs_list:
        tools = extract_tooluse_from_log(log)
        all_tools.extend(tools)
    return list(set(all_tools))

print("✅ Agent invocation functions defined")

✅ Agent invocation functions defined


## Step 3) Next lets configure the prompts to test against the evaluator model with the response generated and response expected. The evaluator model provides evaluation metrics for both the LLM as well as tool use responses

In [6]:
async def evaluate_response_quality(query: str, response: str, criteria: Dict[str, Any]) -> Dict[str, float]:
    """Evaluate response quality using Claude as judge"""
    
    evaluation_prompt = f"""
    You are an expert evaluator for city search AI agents. Evaluate the following response on a scale of 1-5 for each metric.

    Customer Query: {query}
    Agent Response: {response}

    Evaluate on these metrics (1=Poor, 2=Below Average, 3=Average, 4=Good, 5=Excellent):

    1. HELPFULNESS: Does the response address the user's needs and provide useful information?
    2. ACCURACY: Is the information provided factually correct and reliable?
    3. CLARITY: Is the response clear, well-structured, and easy to understand?
    4. PROFESSIONALISM: Does the response maintain appropriate tone and professionalism?
    5. COMPLETENESS: Does the response fully address all aspects of the query?

    Expected criteria: {json.dumps(criteria, indent=2)}

    Respond with ONLY a JSON object in this format:
    {{
        "helpfulness": <score>,
        "accuracy": <score>,
        "clarity": <score>,
        "professionalism": <score>,
        "completeness": <score>,
        "reasoning": "Brief explanation of scores"
    }}
    """
    
    try:
        response_obj = bedrock.invoke_model(
            modelId=EVALUATOR_MODEL,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 1000,
                "messages": [
                    {"role": "user", "content": evaluation_prompt}
                ]
            })
        )
        
        result = json.loads(response_obj['body'].read())
        content = result['content'][0]['text']
        
        # Extract JSON from response
        start_idx = content.find('{')
        end_idx = content.rfind('}') + 1
        json_str = content[start_idx:end_idx]
        
        scores = json.loads(json_str)
        return {k: v for k, v in scores.items() if k != "reasoning"}
        
    except Exception as e:
        print(f"Error in quality evaluation: {e}")
        return {
            "helpfulness": 0.0,
            "accuracy": 0.0,
            "clarity": 0.0,
            "professionalism": 0.0,
            "completeness": 0.0
        }

def evaluate_tool_usage(expected_tools: List[str], actual_tools: List[str]) -> float:
    """Evaluate tool usage effectiveness"""
    if not expected_tools:
        return 5.0 if not actual_tools else 3.0
    
    if not actual_tools:
        print(f"Expected tools {expected_tools}, but no tools were called")
        return 0.0
    
    expected_set = set(expected_tools)
    actual_set = set(actual_tools)
    
    precision = len(expected_set.intersection(actual_set)) / len(actual_set) if actual_set else 0
    recall = len(expected_set.intersection(actual_set)) / len(expected_set) if expected_set else 0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return f1 * 5  # Scale to 0-5

print("✅ Evaluation functions defined")

✅ Evaluation functions defined


## Step 4) Lets test a single Testcase end to end

In [7]:
async def evaluate_test_case(test_case: TestCase) -> EvaluationResult:
    """Evaluate a single test case"""
    print(f"🔍 Evaluating: {test_case.id} - {test_case.description}")
    
    # Invoke agent
    agent_result = await invoke_agent(test_case.query)
    
    if not agent_result["success"]:
        return EvaluationResult(
            test_case_id=test_case.id,
            query=test_case.query,
            response="",
            metrics={},
            response_time=agent_result["response_time"],
            success=False,
            error_message=agent_result.get("response", "Unknown error")
        )
    
    # Evaluate response quality
    quality_scores = await evaluate_response_quality(
        test_case.query,
        agent_result["response"],
        test_case.expected_criteria
    )
    
    # Evaluate tool usage
    tool_score = evaluate_tool_usage(
        test_case.expected_tools,
        agent_result["tool_calls"]
    )
    
    # Combine all metrics
    metrics = {
        **quality_scores,
        "tool_usage": tool_score,
        "response_time": agent_result["response_time"]
    }
    
    return EvaluationResult(
        test_case_id=test_case.id,
        query=test_case.query,
        response=agent_result["response"],
        metrics=metrics,
        response_time=agent_result["response_time"],
        success=True,
        tool_calls=agent_result["tool_calls"]
    )

print("✅ Test case evaluation function defined")

✅ Test case evaluation function defined


In [8]:
# Test a single case first
demo_test = TEST_CASES[1]  # Population search
demo_result = await evaluate_test_case(demo_test)

print(f"\n📊 Demo Result for '{demo_test.id}':")
print(f"Query: {demo_result.query}")
response_str = str(demo_result.response)
print(f"Response: {response_str[:200]}..." if len(response_str) > 200 else f"Response: {response_str}")
print(f"Response Time: {demo_result.response_time:.3f}s")
print(f"Tool Calls: {demo_result.tool_calls}")
print(f"Success: {demo_result.success}")
print(f"Metrics: {demo_result.metrics}")

🔍 Evaluating: city_population_search - City population information request
AgentCore Response keys: ['ResponseMetadata', 'runtimeSessionId', 'traceId', 'baggage', 'contentType', 'statusCode', 'response']
Extracted response text: "The population of Seattle, according to the most recent data from the US Census Bureau's population estimates, is approximately 755,078 as of 2023. \n\nHere is the information in XML format:\n\n```xm...
Log group name /aws/bedrock-agentcore/runtimes/citysearch-583XaH96wT-DEFAULT

📊 Demo Result for 'city_population_search':
Query: What is the population of Seattle?
Response: "The population of Seattle, according to the most recent data from the US Census Bureau's population estimates, is approximately 755,078 as of 2023. \n\nHere is the information in XML format:\n\n```xm...
Response Time: 5.349s
Tool Calls: ['web_search']
Success: True
Metrics: {'helpfulness': 5, 'accuracy': 4, 'clarity': 5, 'professionalism': 5, 'completeness': 5, 'tool_usage': 5.0, 'response

## Step 5) Now that we see the evaluation working end to end for a single testcase, lets run through all the testcases

In [9]:
async def run_full_evaluation(test_cases: List[TestCase]) -> Dict[str, Any]:
    """Run evaluation on all test cases"""
    print(f"🚀 Starting evaluation of {len(test_cases)} test cases...")
    
    results = []
    for i, test_case in enumerate(test_cases, 1):
        print(f"\n[{i}/{len(test_cases)}] Processing: {test_case.id}")
        result = await evaluate_test_case(test_case)
        results.append(result)
        
        # Brief pause between tests
        await asyncio.sleep(1)
    
    # Calculate summary statistics
    summary = calculate_summary(results)
    
    return {
        "agent_name": AGENT_NAME,
        "total_test_cases": len(test_cases),
        "results": [result.to_dict() for result in results],
        "summary": summary,
        "timestamp": datetime.now().isoformat()
    }

def calculate_summary(results: List[EvaluationResult]) -> Dict[str, Any]:
    """Calculate summary statistics"""
    successful_results = [r for r in results if r.success]
    
    if not successful_results:
        return {"error": "No successful test cases"}
    
    # Average scores
    metrics = ["helpfulness", "accuracy", "clarity", "professionalism", "completeness", "tool_usage"]
    avg_scores = {}
    
    for metric in metrics:
        scores = [r.metrics.get(metric, 0) for r in successful_results if metric in r.metrics]
        avg_scores[metric] = sum(scores) / len(scores) if scores else 0
    
    # Response time statistics
    response_times = sorted([r.response_time for r in successful_results])
    n = len(response_times)
    
    percentiles = {
        "p50": response_times[n//2] if n > 0 else 0,
        "p90": response_times[int(n*0.9)] if n > 0 else 0,
        "p95": response_times[int(n*0.95)] if n > 0 else 0,
        "p99": response_times[int(n*0.99)] if n > 0 else 0,
    }
    
    return {
        "success_rate": len(successful_results) / len(results),
        "average_scores": avg_scores,
        "overall_score": sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0,
        "response_time_percentiles": percentiles,
        "total_successful": len(successful_results),
        "total_failed": len(results) - len(successful_results)
    }

print("✅ Full evaluation functions defined")

✅ Full evaluation functions defined


In [10]:
# Run full evaluation
evaluation_results = await run_full_evaluation(TEST_CASES)

print("\n" + "="*60)
print("📊 EVALUATION COMPLETE")
print("="*60)

# Display results
summary = evaluation_results.get("summary", {})
print(f"\n🤖 Agent: {evaluation_results['agent_name']}")
print(f"📝 Total Test Cases: {evaluation_results['total_test_cases']}")
print(f"✅ Success Rate: {summary.get('success_rate', 0):.1%}")
print(f"🎯 Overall Score: {summary.get('overall_score', 0):.2f}/5.0")

print("\n📈 QUALITY METRICS (1-5 scale):")
avg_scores = summary.get("average_scores", {})
for metric, score in avg_scores.items():
    if metric != "response_time":
        emoji = "🟢" if score >= 4.0 else "🟡" if score >= 3.0 else "🔴"
        print(f"  {emoji} {metric.title()}: {score:.2f}")

print("\n⏱️  RESPONSE TIME PERCENTILES:")
percentiles = summary.get("response_time_percentiles", {})
for p, time_val in percentiles.items():
    print(f"  {p.upper()}: {time_val:.3f}s")

🚀 Starting evaluation of 3 test cases...

[1/3] Processing: basic_greeting
🔍 Evaluating: basic_greeting - Basic greeting and help request
AgentCore Response keys: ['ResponseMetadata', 'runtimeSessionId', 'traceId', 'baggage', 'contentType', 'statusCode', 'response']
Extracted response text: "Of course, I'd be happy to help you find information about cities! Whether you're looking for details about a specific city's demographics, history, culture, geography, or any other aspect, I can gui...
Log group name /aws/bedrock-agentcore/runtimes/citysearch-583XaH96wT-DEFAULT

[2/3] Processing: city_population_search
🔍 Evaluating: city_population_search - City population information request
AgentCore Response keys: ['ResponseMetadata', 'runtimeSessionId', 'traceId', 'baggage', 'contentType', 'statusCode', 'response']
Extracted response text: "<thinking>\nAccording to the latest data from the 'web_search' tool, Seattle's population has surpassed 800,000 people, as indicated by the state's Office 

In [11]:
# Save results to file
output_file = f"evaluation_results_{AGENT_NAME}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

with open(output_file, 'w') as f:
    json.dump(evaluation_results, f, indent=2)

print(f"💾 Results saved to: {output_file}")

💾 Results saved to: evaluation_results_citysearch_20250918_210208.json


## Conclusion

In this notebook, 
1. We built defined testcases to test the agent for expected and actual responses. 
2. Utilized LLM as Judge to determine evals for LLM responses
3. Utilized Cloudwatch log group name filtered for the current session id to determine tool invocations
