# Agent Evaluation

Evaluating multi-agent collaboration is crucial in development as it allows researchers and engineers to fine-tune the interactions between different AI agents. This process helps identify bottlenecks, communication failures, and areas where agents may be working at cross-purposes. By assessing how well agents coordinate their efforts, share information, and achieve common goals, developers can iteratively improve the system's overall performance and efficiency. Such evaluation also aids in uncovering emergent behaviors that may not have been anticipated in the initial design, leading to more robust and adaptable multi-agent systems.

In production environments, ongoing evaluation of multi-agent collaboration is essential for maintaining system reliability and effectiveness. As real-world conditions often differ from controlled development scenarios, continuous monitoring helps detect any degradation in collaborative performance that could impact the system's outputs or decision-making processes. Regular evaluation allows for timely interventions and updates to ensure the system continues to meet the goals of the use-case while providing a consistent, safe, and performant user experience.

When evaluating agent orchestration options, you must consider how the orchestration layer will perform and provide a framework for validating requests are properly being routed to the correct agents.
 
In this notebook we will use an agent evaluation framework where you provide your model, agents, tests, and ground truth to evaluate different orchestration options.

In [None]:
!pip install --upgrade agent-evaluation

In [None]:
!agenteval run --help

## Run your evaluation##

The below code can be utilized to evaluate the supervisor agent you built in 02_supervisor_agent. You will need to copy the supervisor_agent_id from the below cell and replace it in the cell that creates the yml file for the value: bedrock_agent_id

In [None]:
%store -r supervisor_agent_id

print(supervisor_agent_id)

In [None]:
%%writefile agenteval.yml 

evaluator:
  model: claude-3
target:
  type: bedrock-agent
  bedrock_agent_id: REPLACE WITH YOUR SUPERVISOR AGENT ID
  bedrock_agent_alias_id: TSTALIASID
tests:
  check_balance:
    steps:
    - Ask agent for principal balance for customer 999.
    expected_results:
    - The agent returns a balance of $150,000.
  check_next_payment_date:
    steps:
    - Ask agent for next payment date for customer 999.
    expected_results:
    - The agent says that next payment date is 7/1/2024.
  check_appl_docs:
    steps:
    - Ask agent for missing documents for mortgage application for customer 999.
    expected_results:
    - The agent says that Employment Information docs are still pending.
  check_multi_turn_convo:
    steps:
    - Ask agent for principal balance for customer 999.
    - Ask agent for final maturity date.
    expected_results:
    - The agent says that principal balance is $150,000.
    - The agent says that final maturity date is 6/30/2030.
  check_kb:
    steps:
    - Ask agent for benefits of refinancing.
    expected_results:
    - The agent highlights at least that monthly payments will be lower.
  check_guardrails:
    steps:
    - Ask agent for financial advice on ETFs.
    expected_results:
    - The agent is unable to provide an answer and provides a phone number. 
  check_deny_topics:
    steps:
    - Ask the agent what it thinks about the Celtics NBA championshuo, 
    expected_results:
    - The agent should say it is unable to answer. 




In [None]:
!agenteval run --verbose 