## Lab 5: AgentCore Evaluations - Online Evaluation for Customer Support Agent

### Overview

This lab demonstrates how to use AgentCore Evaluations to continuously monitor your production customer support agent from Lab 4. You'll configure online evaluation to automatically assess agent performance in real-time as customers interact with it.

**Workshop Journey:**

- **Lab 1 (Done):** Create Agent Prototype - Built a functional customer support agent
- **Lab 2 (Done):** Enhance with Memory - Added conversation context and personalization
- **Lab 3 (Done):** Scale with Gateway & Identity - Shared tools across agents securely
- **Lab 4 (Done):** Deploy to Production - Used AgentCore Runtime with observability
- **Lab 5 (Current):** Evaluate Agent Performance - Monitor quality with online evaluations
- **Lab 6:** Build User Interface - Create a customer-facing application

### What You'll Learn

You'll configure online evaluation with built-in evaluators, generate test interactions, and analyze quality metrics through AgentCore Observability dashboards to improve agent performance.

### Online Evaluation Overview

Online evaluation continuously monitors deployed agents in production, unlike on-demand evaluation which analyzes specific selected interactions. It consists of three components: session sampling with configurable rules, multiple evaluation methods (built-in or custom evaluators), and monitoring through dashboards with quality trends and low-scoring session investigation.

Since your agent runs on AgentCore Runtime, AgentCore Observability automatically instruments the code and provides comprehensive logs and traces using [OTEL](https://opentelemetry.io/) instrumentation.

### Prerequisites

Complete Lab 4 to have the customer support agent deployed. You'll need AWS account access to Amazon Bedrock AgentCore with Evaluations permissions.

### Architecture
<div style="text-align:left">
    <img src="images/architecture_lab5_evaluation.png" width="75%"/>
</div>

*Online evaluation automatically monitors agent interactions, applies evaluators based on sampling rules, and outputs results to CloudWatch for analysis.*

### Step 1: Import Required Libraries and Initialize Clients

In [1]:
from bedrock_agentcore_starter_toolkit import Evaluation, Runtime
import json
import uuid
from pathlib import Path
from boto3.session import Session
from IPython.display import Markdown, display
from lab_helpers.utils import get_ssm_parameter, get_or_create_cognito_pool

In [2]:
boto_session = Session()
region = boto_session.region_name
print(f"Region: {region}")

Region: us-west-2


In [3]:
eval_client = Evaluation(region=region)
runtime_client = Runtime()

### Step 2: Retrieve Agent Information from Lab 4

Retrieve the customer support agent ARN from SSM Parameter Store where it was saved during Lab 4 deployment.

In [4]:
try:
    # Get agent ARN from SSM parameter store (saved in Lab 4)
    agent_arn = get_ssm_parameter("/app/customersupport/agentcore/runtime_arn")
    
    # Extract agent ID from ARN
    agent_id = agent_arn.split(":")[-1].split("/")[-1]
    
    # Set runtime client config path
    runtime_client._config_path = Path.cwd() / ".bedrock_agentcore.yaml"
    
    print("Agent ID:", agent_id)
    print("Agent ARN:", agent_arn)
except Exception as e:
    raise Exception(f"""Missing agent information from Lab 4. Please run lab-04-agentcore-runtime.ipynb first. Error: {str(e)}""")

Agent ID: customer_support_agent-v9Id4341wf
Agent ARN: arn:aws:bedrock-agentcore:us-west-2:426068478522:runtime/customer_support_agent-v9Id4341wf


### Step 3: Create Online Evaluation Configuration

Now let's create an online evaluation configuration for our customer support agent. We'll use built-in evaluators to assess different aspects of agent performance:

- **Builtin.GoalSuccessRate** - Measures how well the agent achieves user goals
- **Builtin.Correctness** - Evaluates factual accuracy of responses
- **Builtin.ToolSelectionAccuracy** - Evaluates appropriate tool selection

We'll set the sampling rate to 100% for demonstration purposes, but in production you might use a lower rate (e.g., 10-20%) based on your traffic volume.

In [5]:
response = eval_client.create_online_config(
    agent_id=agent_id,
    config_name="customer_support_agent_eval",
    sampling_rate=100,  # Evaluate 100% of sessions for demo
    evaluator_list=[
        "Builtin.GoalSuccessRate", 
        "Builtin.Correctness",
        "Builtin.ToolSelectionAccuracy"
    ],
    config_description="Customer support agent online evaluation",
    auto_create_execution_role=True
)

print("Online evaluation configuration created successfully!")
print(f"Configuration ID: {response['onlineEvaluationConfigId']}")

Creating online evaluation config: customer_support_agent_eval for agent: customer_support_agent-v9Id4341wf
Configuration: sampling_rate=100.0%, evaluators=['Builtin.GoalSuccessRate', 'Builtin.Correctness', 'Builtin.ToolSelectionAccuracy']
Creating online evaluation config: customer_support_agent_eval for agent: customer_support_agent-v9Id4341wf
Auto-creating execution role for config: customer_support_agent_eval
Getting or creating evaluation execution role for config: customer_support_agent_eval
Using AWS region: us-west-2, account ID: 426068478522
Role name: AgentCoreEvalsSDK-us-west-2-d04ba7b68b
Role doesn't exist, creating new evaluation execution role: AgentCoreEvalsSDK-us-west-2-d04ba7b68b
Creating IAM role: AgentCoreEvalsSDK-us-west-2-d04ba7b68b
‚úì Role created: arn:aws:iam::426068478522:role/AgentCoreEvalsSDK-us-west-2-d04ba7b68b
‚úì Execution policy attached: AgentCoreEvaluationPolicy-us-west-2-d04ba7b68b
Waiting for IAM role propagation...
Role creation complete and ready f

Online evaluation configuration created successfully!
Configuration ID: customer_support_agent_eval-VuUOlc4onO


### Step 4: Verify Configuration Status

Verify the evaluation configuration is properly created and enabled by retrieving its details.

In [6]:
config_details = eval_client.get_online_config(config_id=response['onlineEvaluationConfigId'])
print("Configuration Details:")
print(json.dumps(config_details, indent=2, default=str))

Configuration Details:
{
  "ResponseMetadata": {
    "RequestId": "3a6c83c6-48c4-4175-8505-22659d33f98e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 16 Jan 2026 19:49:03 GMT",
      "content-type": "application/json",
      "content-length": "1052",
      "connection": "keep-alive",
      "x-amzn-requestid": "3a6c83c6-48c4-4175-8505-22659d33f98e",
      "x-amzn-remapped-x-amzn-requestid": "c4db37e5-b99a-469e-a8ec-5a32f5b05452",
      "x-amzn-remapped-content-length": "1052",
      "x-amzn-remapped-connection": "keep-alive",
      "x-cache": "Miss from cloudfront",
      "via": "1.1 11017c4db22106ac70e16ce75190a430.cloudfront.net (CloudFront)",
      "x-amz-cf-id": "zx_R60ZSlztRplKRIMfKh3n4Y3CGdFXjGSJY6bwyYVuPtTDPjjFQXA==",
      "x-amz-apigw-id": "XSxneEagPHcEQHQ=",
      "x-amzn-trace-id": "Root=1-696a962f-36e8e9f62d52aaa5017acd33",
      "x-amz-cf-pop": "HIO52-P4",
      "x-amzn-remapped-date": "Fri, 16 Jan 2026 19:49:03 GMT"
    },
    "RetryAttempts": 0
  }

### Step 5: Generate Test Interactions

Invoke the customer support agent with various queries to generate traces for evaluation. Different test scenarios will demonstrate how the evaluators assess agent performance.

In [7]:
# Get authentication token
access_token = get_or_create_cognito_pool(refresh_token=True)
print(f"Access token obtained: {access_token['bearer_token'][:20]}...")

def invoke_agent_runtime(prompt, session_id=None):
    """Invoke the agent runtime using starter toolkit"""
    if not session_id:
        session_id = str(uuid.uuid4())
    
    response = runtime_client.invoke(
        payload={"prompt": prompt},
        session_id=session_id,
        bearer_token=access_token['bearer_token']
    )
    
    return response, session_id

Access token obtained: eyJraWQiOiJsWktuejBt...


#### Test Scenario 1: Product Information Query

In [8]:
session1 = str(uuid.uuid4())
response, _ = invoke_agent_runtime(
    "I need information about the Gaming Console Pro. What are its specifications and price?",
    session1
)
print("Customer Query: Product information request")
display(Markdown(response["response"].replace('\\n', '\n')))

Using JWT authentication


Customer Query: Product information request


"Based on the search results, here's what I found about the **PlayStation 5 Pro** (which appears to be the \"Gaming Console Pro\" you're referring to):

## Key Specifications:
- **All-Digital Console** - No disc drive included
- **Advanced Ray Tracing** - Enhanced graphics capabilities
- **4K Display Support** - Super sharp image clarity for 4K TVs
- **High Frame Rate Gameplay** - Smooth performance for demanding games
- **Storage**: 2TB SSD
- **Release Date**: November 7, 2024

## Pricing:
- **Price**: $700 USD
- **Bundle Options**: Available with two DualSense controllers and other accessories

## Notable Features:
The PS5 Pro is designed to give game creators advanced tools to enhance their games with better visuals and performance compared to the standard PS5.

---

**Is there anything specific about the Gaming Console Pro you'd like to know more about?** For example, I can help you with:
- Return or warranty policies
- Technical support for any issues
- Information about compatible accessories
- Other gaming products

I also noticed from your profile that you're interested in low-latency headphones for competitive FPS gaming - I'd be happy to help you find the right gaming headphones to pair with your console!"

#### Test Scenario 2: Technical Support Request

In [9]:
session2 = str(uuid.uuid4())
response, _ = invoke_agent_runtime(
    "My laptop won't start up. Can you help me troubleshoot this issue?",
    session2
)
print("Customer Query: Technical support request")
display(Markdown(response["response"].replace('\\n', '\n')))

Using JWT authentication


Customer Query: Technical support request


"Great! I found some troubleshooting guidance. Here are the **steps to troubleshoot your laptop startup issue**:

## Power Issues Troubleshooting

**Step 1: Check Power Connection**
- Verify the power cable is firmly connected to both the laptop and the power outlet
- Look for any visible damage to the power cable or connector

**Step 2: Verify Power Outlet**
- Try plugging the power cable into a different power outlet to ensure the original outlet is working
- Test with a device you know works (like a lamp) to confirm the outlet has power

**Step 3: Try a Different Power Cable**
- If available, borrow or use an alternative power cable that's compatible with your laptop
- This helps determine if the power supply is faulty

**Step 4: Check for Physical Damage**
- Inspect your laptop for any signs of physical damage, water damage, or loose components
- Look at the charging port for corrosion or debris

## Additional Information

Since you have a ThinkPad, these general steps should apply to your model as well.

**What I need to know to help further:**
- Does your laptop show ANY signs of power (lights, fans, screen activity)?
- Is the power LED indicator lighting up?
- Any sounds or signs of activity when you press the power button?

Once you provide these details, I can guide you through more specific troubleshooting steps. If the issue persists after these checks, it may require professional technical support.

**Is there anything else I can help you with?**"

#### Test Scenario 3: Return Policy Inquiry

In [10]:
session3 = str(uuid.uuid4())
response, _ = invoke_agent_runtime(
    "I bought a smartphone last week but it's not working properly. What's your return policy?",
    session3
)
print("Customer Query: Return policy inquiry")
display(Markdown(response["response"].replace('\\n', '\n')))

Using JWT authentication


Customer Query: Return policy inquiry


"Great news! Here's our **smartphone return policy**:

**Return Window:** 30 days from delivery

**Conditions for Returns:**
- Device must be in original packaging
- No physical damage
- Factory reset required

**Return Process:** You can initiate a return through:
- Our online RMA (Return Merchandise Authorization) portal, or
- Our technical support team

**Shipping:** Free return shipping with a prepaid label provided

**Refund Timeline:** 5-7 business days after we inspect the device

**Warranty:** Your smartphone includes a 1-year manufacturer warranty

Since you purchased it last week, you're well within the 30-day return window. Before initiating a return, would you like me to help troubleshoot the issue? Sometimes problems can be resolved quickly without needing a return. Could you tell me more about what's not working properly with your phone? 

Is it the overheating issue you mentioned, or something else? I'm here to help! üì±"

#### Test Scenario 4: Complex Multi-Tool Query

In [11]:
session4 = str(uuid.uuid4())
response, _ = invoke_agent_runtime(
    "I need help with my Gaming Console Pro. First, can you tell me about its warranty? Then I need technical support for connection issues.",
    session4
)
print("Customer Query: Complex multi-tool request")
display(Markdown(response["response"].replace('\\n', '\n')))

Using JWT authentication


Customer Query: Complex multi-tool request


"Great! I've gathered information for you. Here's what I found:

## **Warranty Status - ‚ö†Ô∏è Important News**
Your Gaming Console Pro warranty **has expired**:
- **Serial Number:** MNO33333333
- **Purchase Date:** November 25, 2023
- **Warranty End Date:** November 25, 2024
- **Status:** Expired (418 days ago)
- **Warranty Type:** Gaming Warranty
- **Original Coverage:** Controller issues, overheating protection, and hard drive replacement

Since your warranty has expired, repair services may now have associated costs. However, I'd recommend contacting our support team to discuss extended warranty options or repair service pricing.

---

## **Technical Support - Connection Issues**

Here are the recommended troubleshooting steps for connection problems:

### **Wi-Fi Connection Issues:**
1. Restart both your router and Gaming Console Pro
2. Verify your Wi-Fi password is entered correctly
3. Update your network drivers
4. Reset network settings if problems persist
5. Move closer to your router and check for interference

### **Bluetooth Pairing Issues:**
1. Clear your Bluetooth cache
2. Remove and re-pair your devices
3. Verify device compatibility
4. Update your Bluetooth drivers
5. Ensure both devices are in pairing mode

---

**Could you provide more specific details about your connection issue?** For example:
- Are you having Wi-Fi or Bluetooth problems?
- Can you not connect at all, or is the connection dropping?
- Are there any error messages you're seeing?

This will help me provide more targeted troubleshooting steps. I'm here to help! üéÆ"

#### Test Scenario 5: General Capability Query

In [12]:
session5 = str(uuid.uuid4())
response, _ = invoke_agent_runtime(
    "What kind of support can you provide? List all your available tools and capabilities.",
    session5
)
print("Customer Query: Capability inquiry")
display(Markdown(response["response"].replace('\\n', '\n')))

Using JWT authentication


Customer Query: Capability inquiry


"

I'm happy to help! Here's a comprehensive overview of my available tools and capabilities:

## üìã Available Tools & Capabilities

### 1. **Product Information** (`get_product_info`)
- Get detailed technical specifications for electronics
- Access warranty information and features
- Review product policies and capabilities
- **Supported categories:** laptops, smartphones, headphones, monitors, and other electronics

### 2. **Return & Warranty Policies** (`get_return_policy`)
- Check return timeframes and conditions
- Understand warranty coverage details
- Review return procedures for specific product categories
- **Coverage:** smartphones, laptops, accessories, and more

### 3. **Technical Support** (`get_technical_support`)
- Get assistance with technical issues and troubleshooting
- Receive guidance for hardware or software problems
- Access technical solutions for various devices

### 4. **Warranty Status Check** (`LambdaUsingSDK___check_warranty_status`)
- Verify warranty coverage using serial numbers
- Optional email verification for account confirmation
- Check warranty expiration and coverage details

### 5. **Web Search** (`LambdaUsingSDK___web_search`)
- Search for updated technical documentation
- Access current product information
- Find the latest specifications and compatibility details
- Support for multiple regions

## üéØ How I Can Help You

‚úÖ Answer questions about product specifications and features  
‚úÖ Explain warranty and return policies  
‚úÖ Troubleshoot technical issues  
‚úÖ Check warranty status on your devices  
‚úÖ Find updated technical documentation  
‚úÖ Provide recommendations based on your needs  

Given your profile, I can particularly assist with:
- **Gaming headphones** with low latency specifications
- **Linux-compatible laptops** recommendations
- **CPU installation documentation** and technical guidance
- Gaming peripheral support

**What can I help you with today?** Feel free to ask about products, technical issues, policies, or any other electronics support needs! üéÆüíª"

### Step 6: Monitor Evaluation Results

Monitor evaluation results through the AgentCore Observability console. Results may take a few minutes to appear as the system processes traces and applies evaluators.

#### Accessing the Dashboard

1. Navigate to the [AgentCore Observability console](https://console.aws.amazon.com/cloudwatch/home#gen-ai-observability/agent-core/agents)
2. Find your customer support agent in the agents list
3. Click on the `DEFAULT` endpoint to view evaluation metrics
4. Look for the evaluation scores in the traces and sessions views

#### What You'll See

The dashboard will show:
- **Goal Success Rate**: How well the agent achieves customer objectives
- **Correctness**: Accuracy of information provided
- **Tool Selection Accuracy**: Appropriate tool choices for queries

![Online Evaluation Dashboard](images/online_evaluations_dashboard.png)

*Evaluation metrics displayed in the AgentCore Observability dashboard*

### Step 7: Understanding Evaluation Metrics

**Goal Success Rate** measures whether the agent successfully addresses the customer's primary intent. High scores indicate effective problem-solving; low scores suggest unmet needs, incomplete responses, or misunderstood requests.

**Correctness** evaluates factual accuracy of responses. High scores indicate accurate and reliable information; low scores suggest incorrect facts, outdated information, or misleading guidance.

**Tool Selection Accuracy** evaluates whether the agent chooses appropriate tools for each task. High scores indicate proper tool selection; low scores suggest wrong tools, unnecessary calls, or missing tool usage.

### Step 8: Analyzing Results and Next Steps

**For Low Goal Success Rates:** Refine the agent's system prompt, improve tool descriptions and parameters, and add specific training examples.

**For Low Correctness Scores:** Update the knowledge base with current information, improve fact-checking mechanisms, and review tool responses.

**For Tool-Related Issues:** Refine tool parameter schemas, improve tool selection logic, and enhance tool documentation.

**Continuous Monitoring:** Set up CloudWatch alarms for evaluation metrics, create dashboards for trend analysis, and implement automated alerts for quality degradation.

### Step 9: Clean Up (Optional)

Disable the online evaluation configuration if needed by uncommenting the code below.

In [13]:
# Uncomment the following lines if you want to disable the evaluation configuration
# eval_client.delete_online_config(config_id=response['onlineEvaluationConfigId'])
# print("Online evaluation configuration disabled")

### Congratulations! üéâ

You have successfully completed **Lab 5: AgentCore Evaluations - Online Evaluation!**

### What You Accomplished

You configured automatic continuous online evaluation for your customer support agent with built-in evaluators assessing Goal Success Rate (customer satisfaction and problem resolution), Correctness (factual accuracy), and Tool Selection Accuracy (proper tool usage). Evaluation results are integrated with AgentCore Observability dashboards for real-time insights.

**Key Benefits:** Proactive quality assurance catches issues before customer impact, data-driven optimization guides improvements, production confidence through performance monitoring at scale, and continuous learning identifies patterns and opportunities.

**Next Steps:** Monitor your evaluation dashboard regularly, set up CloudWatch alarms for quality thresholds, use insights to iteratively improve your agent, and consider adding custom evaluators for domain-specific metrics.

### Next Up: [Lab 6: Build User Interface ‚Üí](lab-06-frontend.ipynb)

Complete the customer experience by building a user-friendly web interface for customers to interact with your quality-monitored agent.

Your customer support agent is now production-ready with comprehensive quality monitoring! üöÄ