## Evaluate Biomarker Supervisor Agent

In this notebook we demonstrate how to use AgentCore Evaluations with our biomarker supervisor agent using on-demand built-in evaluators. 

#### Upgrade AgentCore and OpenTelemetry dependencies to latest versions

In [None]:
%pip install --upgrade bedrock-agentcore bedrock-agentcore-starter-toolkit aws-opentelemetry-distro

## Prerequisites

You need to complete notebook [05-multi_agent_biomarker_strands.ipynb](05-multi_agent_biomarker_strands.ipynb) and have the agent runtime deployed on Bedrock AgentCore. Also make sure you performed the Bedrock AgentCore Observability setup exaplined on [00-setup_environment.ipynb](00-setup_environment.ipynb).

## On-demand Evaluations

On-demand evaluation provides a flexible way to evaluate specific agent interactions by directly analyzing a chosen set of spans. Unlike online evaluation which continuously monitors production traffic, on-demand evaluation lets you perform targeted assessments of selected interactions at any time.

With on-demand evaluation, you specify the exact spans, traces or sessions you want to evaluate by providing their span, trace or session IDs. When using the AgentCore Starter toolkit you can also automatically evaluate all traces in a session.

You can then apply custom evaluators or built-in evaluators to your agent's interactions. This evaluation type is particularly useful when you need to investigate specific customer interactions, validate fixes for reported issues, or analyze historical data for quality improvements. Once you submit the evaluation request, the service processes only the specified spans and provides detailed results for your analysis.

#### Import required libraries

In [None]:
from bedrock_agentcore_starter_toolkit import Evaluation, Observability
import os
import json
import boto3
import uuid
from boto3.session import Session
from IPython.display import Markdown, display

### Generating Data for Evaluations

We are going to invoke the agent using sample questions within a session and will use the built in evaluators to see how our agent is performing.

In [None]:
boto3 = Session()
region = boto3.region_name

ssm_client = boto3.client('ssm', region)
agent_arn = ssm_client.get_parameter(Name='/streamlitapp/env1/AGENT_ARN')['Parameter']['Value']
session_id = str(uuid.uuid4())
print(f"Session ID: {session_id}")
print(agent_arn)


In [None]:
questions = [
    "How many patients are current smokers?",
    "What is the average age of patients diagnosed with Adenocarcinoma?",
    "Can you search PubMed for evidence around the effects of biomarker use in oncology on clinical trial failure risk?",
    "What are the FDA approved biomarkers for non small cell lung cancer?",
    "According to literature evidence, what metagene cluster does gdf15 belong to",
    "What properties of the tumor are associated with metagene 19 activity and EGFR pathway"
]

Iterate over list of questions invoking the agent hosted on Bedrock AgentCore. Please note that this next cell **will take around 5 minutes** to run.

In [None]:
agentcore_client = boto3.client(
    'bedrock-agentcore',
    region_name=region
)

def invoke_agentcore(test_query : str):
    response = agentcore_client.invoke_agent_runtime(
        agentRuntimeArn=agent_arn,
        qualifier="DEFAULT",
        payload=json.dumps({"prompt": test_query}),
        runtimeSessionId=session_id
    )

    print(f"Testing orchestrator agent boto3 client: {test_query}")
    print("=" * (41 + len(test_query)))

    if "text/event-stream" in response.get("contentType", ""):
        # Processing streaming response
        for line in response["response"].iter_lines(chunk_size=1):
            if line:
                line = line.decode("utf-8")
                if line.startswith("data: "):
                    # remove the SSE structure
                    data = line[6:]
                    # we need to parse it twice to convert from JSON str to a dictionary
                    data_obj = json.loads(data)
                    data_obj = json.loads(data_obj)
                    # for this example we only care about the data field
                    if "data" in data_obj:
                        print(data_obj.get("data"))
    else:
        # Handle non-streaming response
        try:
            events = []
            for event in response.get("response", []):
                events.append(event)
        except Exception as e:
            events = [f"Error reading EventStream: {e}"]
        if events:
            try:
                response_data = json.loads(events[0].decode("utf-8"))
                display(Markdown(response_data))
            except:
                print(f"Raw response: {events[0]}")

for question in questions:
    invoke_agentcore(question)

### Initialize AgentCore Evaluations Client

Now let's initiate the AgentCore Evaluations client from the AgentCore Starter toolkit.

In [None]:
eval_client = Evaluation(region=region)

# extract agent id from agent arn
agent_id = agent_arn.rsplit('/', 1)[-1]

You can use the ```list_evaluators()``` function to see a list of build in evaluators.

In [None]:
available_evaluators = eval_client.list_evaluators()

### Running Evaluations

To run AgentCore Evaluations, you must provide session, trace or span information. Different metrics require different level of information from your agent traces, as we saw in the table with built-in evaluations.

#### Trace level metrics

- **Builtin.Coherence:** Evaluates whether the response is logically structured and coherent.
- **Builtin.Conciseness:** Evaluates whether the response is appropriately brief without missing key information.
- **Builtin.Correctness:** Evaluates whether the information in the agent's response is factually accurate.
- **Builtin.InstructionFollowing:** Measures how well the agent follows the provided system instructions.

In [None]:
trace_level_results = eval_client.run(
    agent_id=agent_id,
    session_id=session_id, 
    evaluators=["Builtin.Coherence", "Builtin.Conciseness", "Builtin.Correctness", "Builtin.InstructionFollowing"]
)

#### Session level metrics

- **Builtin.GoalSuccessRate:** Evaluates whether the conversation successfully meets the user's goals.

In [None]:
goal_sucess_results = eval_client.run(
    agent_id=agent_id,
    session_id=session_id, 
    evaluators=["Builtin.GoalSuccessRate"]
)

#### Tool level metrics

- **Builtin.ToolSelectionAccuracy:** Component Level Metric. Evaluates whether the agent selected the appropriate tool for the task.

In [None]:
tool_selection_results = eval_client.run(
    agent_id=agent_id,
    session_id=session_id, 
    evaluators=["Builtin.ToolSelectionAccuracy"]
)

### Analyzing Evaluation Results

Let's now analyze the results. In this case, we are evaluating the session with two different metrics in the same run. That means that we now need to know which evaluator is producing each response. We can do that with the evaluator_name property of the result. Let's see how well our agent used tools:

In [None]:
for result in trace_level_results.results:
    if result.label != None:
        information = f"""
        {result.evaluator_name} Result: {result.label} ({result.value})
        Explanation: \n{result.explanation}]\n
        Token Usage: {result.token_usage}\n
        Context: {result.context}\n
        """
        print("===================================================")
        display(Markdown(information))

In [None]:
for result in goal_sucess_results.results:
    information = f"""
    {result.evaluator_name} Result: {result.label} ({result.value})
    Explanation: \n{result.explanation}]\n
    Token Usage: {result.token_usage}\n
    Context: {result.context}\n
    """
    print("===================================================")
    display(Markdown(information))

In [None]:
for result in tool_selection_results.results:
    if result.label != None:
        information = f"""
        {result.evaluator_name} Result: {result.label} ({result.value})
        Explanation: \n{result.explanation}]\n
        Token Usage: {result.token_usage}\n
        Context: {result.context}\n
        """
        print("===================================================")
        display(Markdown(information))

### Saving evaluation results

The AgentCore starter toolkit also helps you saving the results of your agent evaluation in structured output files. To do so all you need to provide is the ouput parameter during the run.

In [None]:
goal_sucess_results = eval_client.run(
    agent_id=agent_id,
    session_id=session_id, 
    evaluators=["Builtin.GoalSuccessRate"],
    output=f"eval_results/{session_id}_goal_sucess.json"
)