# City Search Agent Evaluation Framework

## Overview

This notebook evaluates the city search agent running on Amazon Bedrock AgentCore Runtime.

### Pre-Requisite
Execute 05-04-01-Agentic-Metrics-AgentCore.ipynb before this notebook

### Evaluation Strategy

- **Multi-dimensional Quality Assessment**: Helpfulness, accuracy, clarity, professionalism, completeness
- **Tool Usage Analysis**: web_search tool usage patterns
- **Performance Metrics**: Response times and success rates
- **LLM-as-Judge**: Claude Sonnet for objective evaluation

In [27]:
!pip install boto3 requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Lets retrieve the citysearch agent arn that was stored in the notebook and setup the evaluator model to be used as the Judge LLM

In [28]:
# Configuration
%store -r citysearch_agent_arn
AGENT_NAME = "citysearch" 
EVALUATOR_MODEL = "us.anthropic.claude-sonnet-4-20250514-v1:0"
AGENT_ENDPOINT = citysearch_agent_arn
print(f"✅ Configured for agent {AGENT_NAME}, with endpoint {AGENT_ENDPOINT}")

✅ Configured for agent citysearch, with endpoint arn:aws:bedrock-agentcore:us-east-1:146666888814:runtime/citysearch-583XaH96wT


In [29]:
import asyncio
import json
import time
import uuid
import boto3
import requests
from dataclasses import dataclass, asdict
from typing import Dict, List, Any, Optional
from datetime import datetime


# AWS clients
bedrock = boto3.client('bedrock-runtime')
print("✅ Dependencies loaded successfully")

✅ Dependencies loaded successfully


## Step 1) Lets define classes for Test Cases and Evaluation responses

Three specific tests for a city search agent:

    Basic greeting - Tests politeness, expects no tools

    Population query - Tests factual lookup, expects web_search tool

    Area query - Tests measurement data, expects web_search tool

Each test defines what tools should be used and what criteria make a good response, enabling automated evaluation of agent performance.

In [30]:
@dataclass
class TestCase:
    id: str
    query: str
    category: str
    expected_tools: List[str]
    expected_criteria: Dict[str, Any]
    description: str

@dataclass
class EvaluationResult:
    test_case_id: str
    query: str
    response: str
    metrics: Dict[str, float]
    response_time: float
    success: bool
    error_message: Optional[str] = None
    tool_calls: List[str] = None
    
    def to_dict(self):
        # Convert to dict manually to avoid serialization issues
        return {
            "test_case_id": self.test_case_id,
            "query": self.query,
            "response": str(self.response),  # Ensure string conversion
            "metrics": dict(self.metrics),
            "response_time": self.response_time,
            "success": self.success,
            "error_message": self.error_message,
            "tool_calls": list(self.tool_calls) if self.tool_calls else []
        }

# Test cases for city search agent
TEST_CASES = [
    TestCase(
        id="basic_greeting",
        query="Hi, I need help with finding information about cities",
        category="basic_inquiry",
        expected_tools=[],
        expected_criteria={"should_be_polite": True, "should_ask_for_details": True},
        description="Basic greeting and help request"
    ),
    TestCase(
        id="city_population_search",
        query="What is the population of Seattle?",
        category="population_inquiry",
        expected_tools=["web_search"],
        expected_criteria={"should_provide_population": True, "should_be_accurate": True},
        description="City population information request"
    ),
    TestCase(
        id="city_area_search",
        query="How large is Los Angeles in square miles?",
        category="area_inquiry",
        expected_tools=["web_search"],
        expected_criteria={"should_provide_area": True, "should_be_clear": True},
        description="City area information request"
    )
]

print(f"✅ Loaded {len(TEST_CASES)} test cases")

✅ Loaded 3 test cases


## Step 2) Define the methods for to invoke agent which includes evaluating LLM responses as well as tool use
For tool use detection, X-Ray observability is primarily used. The implementation looks for gen_ai.tool.name annotations in trace segments of both main and sub segments. As a fallback, content analysis of the output is used to determine tool use. 
The implementation below provides the details of the actual agent invocation with tools used.

In [31]:
# AgentCore client using bedrock-agentcore service
agentcore_client = boto3.client('bedrock-agentcore', region_name='us-east-1')

async def invoke_agent(query: str) -> Dict[str, Any]:
    """Invoke the agent using AgentCore Runtime"""
    start_time = time.time()
    
    try:
        payload = json.dumps({"prompt": query})
        session_id = f"eval-session-{uuid.uuid4()}"
        
        response = agentcore_client.invoke_agent_runtime(
            agentRuntimeArn=AGENT_ENDPOINT,
            runtimeSessionId=session_id,
            payload=payload,
            qualifier="DEFAULT"
        )
        
        print("AgentCore Response keys:", list(response.keys()))
        
        # Extract response text from StreamingBody
        response_text = ""
        if isinstance(response, dict) and 'response' in response:
            streaming_body = response['response']
            if hasattr(streaming_body, 'read'):
                response_text = streaming_body.read().decode('utf-8')
                print(f"Extracted response text: {response_text[:200]}...")
            else:
                response_text = str(streaming_body)
        else:
            response_text = str(response)
            
        # Extract tool calls from response metadata or content
        tool_calls = extract_tool_calls_from_agentcore_observability(response, response_text, session_id)
        
        return {
            "response": response_text,
            "success": True,
            "tool_calls": tool_calls,
            "response_time": time.time() - start_time,
            "session_id": session_id
        }
        
    except Exception as e:
        error_msg = f"Error invoking agent: {str(e)}"
        print(error_msg)
        return {
            "response": error_msg,
            "success": False,
            "tool_calls": [],
            "response_time": time.time() - start_time
        }

def extract_tool_calls_from_agentcore_observability(response_obj, response_text, session_id) -> List[str]:
    """Extract tool calls using AgentCore observability gen_ai.tool.name and tool.status."""
    tools = []
    
    # Query X-Ray for gen_ai.tool.name spans
    if session_id:
        try:
            xray_client = boto3.client('xray')
            print("Session id", session_id)
            # Get traces with gen_ai annotations
            response = xray_client.get_trace_summaries(
                TimeRangeType='Service',
                StartTime=time.time() - 300,
                EndTime=time.time(),
                FilterExpression=f'annotation.session_id = "{session_id}"'
            )
            
            for trace_summary in response.get('TraceSummaries', []):
                trace_response = xray_client.batch_get_traces(TraceIds=[trace_summary['Id']])
                
                for trace in trace_response.get('Traces', []):
                    for segment in trace.get('Segments', []):
                        segment_doc = json.loads(segment['Document'])
                        
                        # Check for gen_ai.tool.name in annotations
                        annotations = segment_doc.get('annotations', {})
                        if 'gen_ai.tool.name' in annotations:
                            tool_name = annotations['gen_ai.tool.name']
                            tool_status = annotations.get('tool.status', 'success')
                            
                            # Only include successful tool calls
                            if tool_status in ['success', 'completed']:
                                tools.append(tool_name)
                        
                        # Check subsegments
                        for subsegment in segment_doc.get('subsegments', []):
                            sub_annotations = subsegment.get('annotations', {})
                            if 'gen_ai.tool.name' in sub_annotations:
                                tool_name = sub_annotations['gen_ai.tool.name']
                                tool_status = sub_annotations.get('tool.status', 'success')
                                
                                if tool_status in ['success', 'completed']:
                                    tools.append(tool_name)
                                    
        except Exception as e:
            print(f"X-Ray observability extraction failed: {e}")
    
    # Enhanced fallback to content analysis if no observability data
    if not tools and response_text:
        response_lower = str(response_text).lower()
        
        # Look for web search indicators
        web_search_indicators = [
            "gathered from reliable sources", "based on", "web search", "search results",
            "according to", "information shows", "data indicates", "results show",
            "found that", "research shows", "sources indicate", "data suggests",
            "population of", "area of", "square miles", "residents", "million people",
            "thousand people", "sq mi", "km²", "census data", "demographic",
            "approximately", "estimated", "as of", "current population", "latest data"
        ]
        
        if any(phrase in response_lower for phrase in web_search_indicators):
            tools.append("web_search")
            print(f"Tool detected via content analysis: web_search")
    
    return list(set(tools))

print("✅ Agent invocation functions defined")

✅ Agent invocation functions defined


## Step 3) Next lets configure the prompts to test against the evaluator model with the response generated and response expected. The evaluator model provides evaluation metrics for both the LLM as well as tool use responses

In [32]:
async def evaluate_response_quality(query: str, response: str, criteria: Dict[str, Any]) -> Dict[str, float]:
    """Evaluate response quality using Claude as judge"""
    
    evaluation_prompt = f"""
    You are an expert evaluator for city search AI agents. Evaluate the following response on a scale of 1-5 for each metric.

    Customer Query: {query}
    Agent Response: {response}

    Evaluate on these metrics (1=Poor, 2=Below Average, 3=Average, 4=Good, 5=Excellent):

    1. HELPFULNESS: Does the response address the user's needs and provide useful information?
    2. ACCURACY: Is the information provided factually correct and reliable?
    3. CLARITY: Is the response clear, well-structured, and easy to understand?
    4. PROFESSIONALISM: Does the response maintain appropriate tone and professionalism?
    5. COMPLETENESS: Does the response fully address all aspects of the query?

    Expected criteria: {json.dumps(criteria, indent=2)}

    Respond with ONLY a JSON object in this format:
    {{
        "helpfulness": <score>,
        "accuracy": <score>,
        "clarity": <score>,
        "professionalism": <score>,
        "completeness": <score>,
        "reasoning": "Brief explanation of scores"
    }}
    """
    
    try:
        response_obj = bedrock.invoke_model(
            modelId=EVALUATOR_MODEL,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 1000,
                "messages": [
                    {"role": "user", "content": evaluation_prompt}
                ]
            })
        )
        
        result = json.loads(response_obj['body'].read())
        content = result['content'][0]['text']
        
        # Extract JSON from response
        start_idx = content.find('{')
        end_idx = content.rfind('}') + 1
        json_str = content[start_idx:end_idx]
        
        scores = json.loads(json_str)
        return {k: v for k, v in scores.items() if k != "reasoning"}
        
    except Exception as e:
        print(f"Error in quality evaluation: {e}")
        return {
            "helpfulness": 0.0,
            "accuracy": 0.0,
            "clarity": 0.0,
            "professionalism": 0.0,
            "completeness": 0.0
        }

def evaluate_tool_usage(expected_tools: List[str], actual_tools: List[str]) -> float:
    """Evaluate tool usage effectiveness"""
    if not expected_tools:
        return 5.0 if not actual_tools else 3.0
    
    if not actual_tools:
        print(f"Expected tools {expected_tools}, but no tools were called")
        return 0.0
    
    expected_set = set(expected_tools)
    actual_set = set(actual_tools)
    
    precision = len(expected_set.intersection(actual_set)) / len(actual_set) if actual_set else 0
    recall = len(expected_set.intersection(actual_set)) / len(expected_set) if expected_set else 0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return f1 * 5  # Scale to 0-5

print("✅ Evaluation functions defined")

✅ Evaluation functions defined


## Step 4) Lets test a single Testcase end to end

In [33]:
async def evaluate_test_case(test_case: TestCase) -> EvaluationResult:
    """Evaluate a single test case"""
    print(f"🔍 Evaluating: {test_case.id} - {test_case.description}")
    
    # Invoke agent
    agent_result = await invoke_agent(test_case.query)
    
    if not agent_result["success"]:
        return EvaluationResult(
            test_case_id=test_case.id,
            query=test_case.query,
            response="",
            metrics={},
            response_time=agent_result["response_time"],
            success=False,
            error_message=agent_result.get("response", "Unknown error")
        )
    
    # Evaluate response quality
    quality_scores = await evaluate_response_quality(
        test_case.query,
        agent_result["response"],
        test_case.expected_criteria
    )
    
    # Evaluate tool usage
    tool_score = evaluate_tool_usage(
        test_case.expected_tools,
        agent_result["tool_calls"]
    )
    
    # Combine all metrics
    metrics = {
        **quality_scores,
        "tool_usage": tool_score,
        "response_time": agent_result["response_time"]
    }
    
    return EvaluationResult(
        test_case_id=test_case.id,
        query=test_case.query,
        response=agent_result["response"],
        metrics=metrics,
        response_time=agent_result["response_time"],
        success=True,
        tool_calls=agent_result["tool_calls"]
    )

print("✅ Test case evaluation function defined")

✅ Test case evaluation function defined


In [34]:
# Test a single case first
demo_test = TEST_CASES[1]  # Population search
demo_result = await evaluate_test_case(demo_test)

print(f"\n📊 Demo Result for '{demo_test.id}':")
print(f"Query: {demo_result.query}")
response_str = str(demo_result.response)
print(f"Response: {response_str[:200]}..." if len(response_str) > 200 else f"Response: {response_str}")
print(f"Response Time: {demo_result.response_time:.3f}s")
print(f"Tool Calls: {demo_result.tool_calls}")
print(f"Success: {demo_result.success}")
print(f"Metrics: {demo_result.metrics}")

🔍 Evaluating: city_population_search - City population information request
AgentCore Response keys: ['ResponseMetadata', 'runtimeSessionId', 'traceId', 'baggage', 'contentType', 'statusCode', 'response']
Extracted response text: "As an AI, I don't have real-time data access, but as of the most recent estimates from the U.S. Census Bureau, Seattle, Washington, had a population of around 740,000 as of 2020. For the most up-to-d...
Session id eval-session-68787f58-a081-40ac-84ce-fb7a6a68e359
Tool detected via content analysis: web_search

📊 Demo Result for 'city_population_search':
Query: What is the population of Seattle?
Response: "As an AI, I don't have real-time data access, but as of the most recent estimates from the U.S. Census Bureau, Seattle, Washington, had a population of around 740,000 as of 2020. For the most up-to-d...
Response Time: 3.157s
Tool Calls: ['web_search']
Success: True
Metrics: {'helpfulness': 4, 'accuracy': 4, 'clarity': 5, 'professionalism': 5, 'completeness': 

## Step 5) Now that we see the evaluation working end to end for a single testcase, lets run through all the testcases

In [35]:
async def run_full_evaluation(test_cases: List[TestCase]) -> Dict[str, Any]:
    """Run evaluation on all test cases"""
    print(f"🚀 Starting evaluation of {len(test_cases)} test cases...")
    
    results = []
    for i, test_case in enumerate(test_cases, 1):
        print(f"\n[{i}/{len(test_cases)}] Processing: {test_case.id}")
        result = await evaluate_test_case(test_case)
        results.append(result)
        
        # Brief pause between tests
        await asyncio.sleep(1)
    
    # Calculate summary statistics
    summary = calculate_summary(results)
    
    return {
        "agent_name": AGENT_NAME,
        "total_test_cases": len(test_cases),
        "results": [result.to_dict() for result in results],
        "summary": summary,
        "timestamp": datetime.now().isoformat()
    }

def calculate_summary(results: List[EvaluationResult]) -> Dict[str, Any]:
    """Calculate summary statistics"""
    successful_results = [r for r in results if r.success]
    
    if not successful_results:
        return {"error": "No successful test cases"}
    
    # Average scores
    metrics = ["helpfulness", "accuracy", "clarity", "professionalism", "completeness", "tool_usage"]
    avg_scores = {}
    
    for metric in metrics:
        scores = [r.metrics.get(metric, 0) for r in successful_results if metric in r.metrics]
        avg_scores[metric] = sum(scores) / len(scores) if scores else 0
    
    # Response time statistics
    response_times = sorted([r.response_time for r in successful_results])
    n = len(response_times)
    
    percentiles = {
        "p50": response_times[n//2] if n > 0 else 0,
        "p90": response_times[int(n*0.9)] if n > 0 else 0,
        "p95": response_times[int(n*0.95)] if n > 0 else 0,
        "p99": response_times[int(n*0.99)] if n > 0 else 0,
    }
    
    return {
        "success_rate": len(successful_results) / len(results),
        "average_scores": avg_scores,
        "overall_score": sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0,
        "response_time_percentiles": percentiles,
        "total_successful": len(successful_results),
        "total_failed": len(results) - len(successful_results)
    }

print("✅ Full evaluation functions defined")

✅ Full evaluation functions defined


In [36]:
# Run full evaluation
evaluation_results = await run_full_evaluation(TEST_CASES)

print("\n" + "="*60)
print("📊 EVALUATION COMPLETE")
print("="*60)

# Display results
summary = evaluation_results.get("summary", {})
print(f"\n🤖 Agent: {evaluation_results['agent_name']}")
print(f"📝 Total Test Cases: {evaluation_results['total_test_cases']}")
print(f"✅ Success Rate: {summary.get('success_rate', 0):.1%}")
print(f"🎯 Overall Score: {summary.get('overall_score', 0):.2f}/5.0")

print("\n📈 QUALITY METRICS (1-5 scale):")
avg_scores = summary.get("average_scores", {})
for metric, score in avg_scores.items():
    if metric != "response_time":
        emoji = "🟢" if score >= 4.0 else "🟡" if score >= 3.0 else "🔴"
        print(f"  {emoji} {metric.title()}: {score:.2f}")

print("\n⏱️  RESPONSE TIME PERCENTILES:")
percentiles = summary.get("response_time_percentiles", {})
for p, time_val in percentiles.items():
    print(f"  {p.upper()}: {time_val:.3f}s")

🚀 Starting evaluation of 3 test cases...

[1/3] Processing: basic_greeting
🔍 Evaluating: basic_greeting - Basic greeting and help request
AgentCore Response keys: ['ResponseMetadata', 'runtimeSessionId', 'traceId', 'baggage', 'contentType', 'statusCode', 'response']
Extracted response text: "Of course, I'd be happy to help you find information about cities! Here are a few general categories and tips on how to gather information about any city you're interested in:\n\n### General Informat...
Session id eval-session-26d8f0e7-056f-48be-b886-235848608847
Tool detected via content analysis: web_search

[2/3] Processing: city_population_search
🔍 Evaluating: city_population_search - City population information request
AgentCore Response keys: ['ResponseMetadata', 'runtimeSessionId', 'traceId', 'baggage', 'contentType', 'statusCode', 'response']
Extracted response text: "The population of Seattle, according to the 2020 United States census, is 737,015.\n\nHere is the information provided in XM

In [37]:
# Save results to file
output_file = f"evaluation_results_{AGENT_NAME}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

with open(output_file, 'w') as f:
    json.dump(evaluation_results, f, indent=2)

print(f"💾 Results saved to: {output_file}")

💾 Results saved to: evaluation_results_citysearch_20250916_185225.json
