# Agentic Evaluator
In this notebook, we set up a simple chat bot using Strands.  Then, we create an evaluator agent, which can test and correct our chat bot.

## 1) Set up dependances

In [1]:
# Install Strands Agents
!pip install strands-agents strands-agents-tools


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
!pip install googlesearch-python


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from strands import Agent, tool

In [4]:
from strands.models import BedrockModel

In [5]:
from googlesearch import search

In [6]:
from bs4 import BeautifulSoup

In [7]:
import requests

We'll also import a list of city data, to use as our gold standard set.
This is from: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

In [8]:
import pandas as pd, re

In [9]:
# Read the CSV file
#this contains the city, state, population, and land area in square miles in 2024.
gold_standard_city_pop = pd.read_csv('city_pop.csv')
# Clean the dataset once when loading, wikipedia has commas in the numbers.
gold_standard_city_pop['population'] = gold_standard_city_pop['population'].astype(str).str.replace(',', '').astype(float)
gold_standard_city_pop['land_area_mi2'] = gold_standard_city_pop['land_area_mi2'].astype(str).str.replace(',', '').astype(float)

# Show the first 3 rows, as a reference
print(gold_standard_city_pop.head(3))  # First 3 rows

          city state  population  land_area_mi2
0  New York[c]    NY   8478072.0          300.5
1  Los Angeles    CA   3878704.0          469.5
2      Chicago    IL   2721308.0          227.7


Now we'll make a quick checking function.  Our agent will be trying to find population and area, so we'll compare how well it did against this list.

In [10]:
def evaluate_city_guess(city, state, chatbot_response, dataset):
    """
    Evaluate population and area guesses against the gold standard dataset.
    
    Parameters:
    - city: str, city name
    - state: str, state abbreviation (e.g., 'NY', 'CA')
    - chatbot_response: Strands AgentResult object to be evaluated
    - dataset: pandas DataFrame, the gold standard dataset
    
    Returns:
    - dict with percent errors for population and area, and total tokens, execution time, and tool calls.
    
    Raises:
    - ValueError if city/state combination not found
    """
    
    # Clean the city name for matching
    city_clean = city.strip()

    
    #use regex to grab the final answer as numbers
    final_msg = chatbot_response.message['content'][0]['text']
    try:
        guessed_pop, guessed_area = int(re.search(r'<pop>(.*?)</pop>', final_msg).group(1)), float(re.search(r'<area>(.*?)</area>', final_msg).group(1))
    except:
        raise ValueError(f"XML tags not found in reply")

    
    #extract agent loop metrics
    total_tokens = chatbot_response.metrics.accumulated_usage['totalTokens']
    total_time = sum(chatbot_response.metrics.cycle_durations)
    
    tool_calls = 0
    for t in chatbot_response.metrics.tool_metrics.keys():
        tool_calls+= chatbot_response.metrics.tool_metrics[t].call_count

    
    # Find the city in the dataset
    # Use case-insensitive matching and handle potential annotations
    mask = (dataset['city'].str.replace(r'\[.*\]', '', regex=True).str.strip().str.lower() == city_clean.lower()) & \
           (dataset['state'].str.upper() == state.upper())
    
    matching_rows = dataset[mask]
    
    if len(matching_rows) == 0:
        raise ValueError(f"City '{city}' in state '{state}' not found in dataset")
    
    if len(matching_rows) > 1:
        print(f"Warning: Multiple matches found for {city}, {state}. Using first match.")
    
    # Get the actual values
    actual_pop = matching_rows.iloc[0]['population']
    actual_area = matching_rows.iloc[0]['land_area_mi2']
    
    # Calculate percent error: |actual - guess| / actual * 100
    pop_error = abs(actual_pop - guessed_pop) / actual_pop * 100
    area_error = abs(actual_area - guessed_area) / actual_area * 100
    
    return {
        'city': matching_rows.iloc[0]['city'],
        'state': matching_rows.iloc[0]['state'],
        'actual_population': actual_pop,
        'guessed_population': guessed_pop,
        'population_error_percent': round(pop_error, 2),
        'actual_area': actual_area,
        'guessed_area': guessed_area,
        'area_error_percent': round(area_error, 2),
        'total_tokens': total_tokens,
        'total_time': total_time,
        'tool_calls': tool_calls
    }

## 2) Create a simple chat bot

In [11]:
@tool
def web_search(topic: str) -> str:
    """Search Google for a given topic.
    Return a string listing the top 5 results including the url, title, and description of each result.
    """
    result_string = ""
    results = search(topic, num_results=5, advanced=True)
    for result in results:
        result_string += str(result)
    return result_string
    
@tool      
def get_page(url: str) -> str:
    """this function takes a URL and returns the raw text from that page.
    it can be used to get more info based on a Google search result listing."""
    response = requests.get(url)
    response.raise_for_status()
    bs = BeautifulSoup(response.text,'html.parser')
    return bs.text

In [12]:
from botocore.config import Config

#A custom config for Bedrock to only allow short connections - for our demo we expect all calls to be fast.
#here we turn off retries, and we time out after 20 seconds.
quick_config = Config(
    connect_timeout=5,
    read_timeout=20,
    retries={"max_attempts": 0}
)

In [13]:
#Create the chatbot.  We'll use Nova Micro to optimize for latency, cost, and capacity
chatbot_model_name = "us.amazon.nova-micro-v1:0"
#add custom timeout for the model, to keep the tool from hanging or retrying too much.
chatbot_model = BedrockModel(
    model_id=chatbot_model_name,
    boto_client_config=quick_config    
)
chatbot = Agent(tools=[web_search,get_page], model=chatbot_model)
#Call the chat bot with a simple request.
prompt = """How many people live in New York, and what's the area of the city in square miles?
After you respond, also include your answer in 'pop' and 'area' XML tags, for programatic processing.
The values in the XML tags should only be numbers, no words or commas."""
chatbot_response = chatbot(prompt)

<thinking> 
To answer the user's question, I need to first find out the population and area of New York City. Since I don't have this information readily available, I will use the web_search tool to find the most recent data. I will search for "current population of New York City" and "area of New York City in square miles". After getting the results, I will extract the necessary information and format it in XML tags as requested.
</thinking>


Tool #1: web_search

Tool #2: web_search
<thinking> 
It seems that the connection to search for the population and area of New York City failed due to a network error. Unfortunately, I cannot directly access the internet to fetch this data. I recommend checking official city or government websites for this information directly.
</thinking> 

I'm sorry, but I'm currently unable to provide the population and area of New York City due to network issues. You might want to check official city or government websites for this information. 

Here's the 

### Now that we have an answer from one call, let's check the error and other metrics using our eval function.

In [112]:
result = evaluate_city_guess("New York", "NY", chatbot_response, gold_standard_city_pop)
print(f"Population error: {result['population_error_percent']}%")
print(f"Area error: {result['area_error_percent']}%")
print(f"Total Tokens: {result['total_tokens']} tokens")
print(f"Total Time: {result['total_time']:.2f} seconds")
print(f"Tool Calls: {result['tool_calls']}")

Population error: 6.39%
Area error: 0.01%
Total Tokens: 2425 tokens
Total Time: 1.05 seconds
Tool Calls: 2


Pretty close!  Now let's build an evaluator agent that can run some tests for us using the goldstandard dataset.
We'll do this by wraping all of the above in a tool call.  First, let's make it a simple model comparason.

In [143]:
@tool
def eval_model(model_name: str) -> str:
    """Start an evaluator for a particular model.
    model_name is the model endpoint to be evaluated.
    Retruns a string containing information about this model.
    """
    #add custom timeout for the model, to keep the tool from hanging or retrying too much.
    chatbot_model = BedrockModel(
        model_id=model_name,
        boto_client_config=quick_config    
    )
    
    chatbot = Agent(tools=[web_search,get_page], model=chatbot_model, callback_handler=None)# callback_handler=None to suppress sub agent print outs
    #Call the chat bot with a simple request.
    prompt = """How many people live in New York, and what's the area of the city in square miles?
    After you respond, also include your answer in 'pop' and 'area' XML tags, for programatic processing.
    The values in the XML tags should only be numbers, no words or commas."""
    chatbot_response = chatbot(prompt)
    result = evaluate_city_guess("New York", "NY", chatbot_response, gold_standard_city_pop)
    result_string = ""
    result_string = result_string + f"Population error: {result['population_error_percent']}%" + '\n'
    result_string = result_string + f"Area error: {result['area_error_percent']}%" + '\n'
    result_string = result_string + f"Total Tokens: {result['total_tokens']} tokens" + '\n'
    result_string = result_string + f"Total Time: {result['total_time']:.2f} seconds" + '\n'
    result_string = result_string + f"Tool Calls: {result['tool_calls']}"
    print (result_string)
    return result_string

In [148]:
evaluator_prompt = """
Use the eval_model tool to evaluate these models:
Nova Micro: "us.amazon.nova-micro-v1:0",
Nova Lite: "us.amazon.nova-lite-v1:0",
Nova Pro: "us.amazon.nova-pro-v1:0",
Claude 3 Haiku: "us.anthropic.claude-3-haiku-20240307-v1:0",
Claude 3 Sonnet: "us.anthropic.claude-3-sonnet-20240229-v1:0"
Provide a table comparason on the results, and include columns for all evaluation data points, including number of tool calls, and the number of times the model failed to evaluate and had to be retried.
Do not include the endpoint names in the table, only the model names, to save space.
If a model fails to evaluate, you should retry it up to 3 times.
"""

In [149]:
evaluator = Agent(tools=[eval_model], model=chatbot_model)
evaluator_response = evaluator(evaluator_prompt)

<thinking> To evaluate the specified models using the `eval_model` tool, I will need to call this tool multiple times with different model names. Since each model has a unique endpoint, I will create a list of these endpoints and iterate over them. If a model fails to evaluate, I will retry it up to 3 times before considering it as failed. After collecting the results, I will compile them into a table format as requested.</thinking>


Tool #1: eval_model

Tool #2: eval_model

Tool #3: eval_model

Tool #4: eval_model

Tool #5: eval_model
Population error: 0.0%
Area error: 0.01%
Total Tokens: 1894 tokens
Total Time: 0.99 seconds
Tool Calls: 1
Population error: 0.0%
Area error: 0.01%
Total Tokens: 2595 tokens
Total Time: 1.16 seconds
Tool Calls: 2
Population error: 0.0%
Area error: 0.01%
Total Tokens: 1537 tokens
Total Time: 1.29 seconds
Tool Calls: 1
Population error: 0.0%
Area error: 0.01%
Total Tokens: 1774 tokens
Total Time: 1.40 seconds
Tool Calls: 1
Population error: 0.0%
Area error

### Awesome!  Now we have some basic model evals.  You'll note that the agent ran all 5 tests in parallel.
### Next, we'll expand our evaluator to be able to check based on more than one data point.  We add the calculator too to assist.