# Conversational tests with `agent-evaluation`

> *This notebook has been tested in the Python 3 kernel of SageMaker Studio JupyterLab (Distribution v1.9)*

In this notebook, we'll show how you can use AWS Labs' open-source [`agent-evaluation` framework](https://awslabs.github.io/agent-evaluation/) to validate integrated agent systems perform as expected over multi-turn conversations - with expectations defined in natural language.

## Prerequisites

To get started, we'll first need to install the framework:

In [None]:
# Force Pydantic for https://github.com/aws/sagemaker-distribution/issues/436
%pip install agent-evaluation "pydantic>=2.8,<2.9"

`agent-evaluation` supports evaluating a range of ["target"](https://awslabs.github.io/agent-evaluation/targets/) types, including:

- Amazon Bedrock Agents and Knowledge Bases
- Amazon Q for Business
- Amazon SageMaker Endpoints
- Custom targets

In this example, we'll use the **Knowledge Base for Amazon Bedrock** created in the [../RAG (Bedrock and Ragas).ipynb notebook](../RAG%20(Bedrock%20and%20Ragas).ipynb).

▶️ **Follow** the instructions in the other notebook to create the sample Bedrock KB, if you haven't already

▶️ **Replace** the 'TODO' in the code cell below with the *auto-generated ID* of your Bedrock KB, which should be a short alphanumeric string

In [None]:
knowledge_base_id = "TODO"  # For example "G6GPI4YRUW"
generate_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

## Define the test scripts

While data scientists may be familiar with using validation datasets to calculate **representative metrics** describing aggregate model quality across many samples, software engineers might be more used to defining **test-cases** to confirm a module or system performs as expected across a range of scenarios.

In LLM-based application engineering, both of these perspectives are valuable:

- When we have a sufficiently large and **real-world-representative** dataset available, a metrics-based approach can quantify average system performance across common use-cases: Giving us a useful target to optimize towards.
- ...But in the **early stages** of a project, where we might be much less confident what queries are common or rare, it might be more natural to take a *test-case-based approach* and expect our system to pass 100% of defined example journeys.

With the `agent-evaluation` framework:

- Builders [write test cases](https://awslabs.github.io/agent-evaluation/user_guide/#writing-test-cases) in a [YAML](https://en.wikipedia.org/wiki/YAML)-based format, and use the framework to run the test cases and report on successes & failures.
- Test-cases can be **multi-turn conversations**, supporting end-to-end testing of more complex user journeys
- Actual system inputs (user messages) are **generated** based on your specifications by an LLM, not taken verbatim from your test plan... So if the exact wording of your question is important, be specific!
- System outputs (bot responses) are also **judged** against your provided expectations by an LLM, so you can be flexible (but should be specific) when describing what behaviour you want to see.

In this example, we've already [initialized a template test plan](https://awslabs.github.io/agent-evaluation/user_guide) for you in [agenteval.tpl.yml](agenteval.tpl.yml).

Run the cell below to generate the final `agenteval.yml` with your Bedrock KB and generator model ID populated:

In [None]:
with open("agenteval.tpl.yml") as ftpl:
    test_spec_str = ftpl.read()

test_spec_str = test_spec_str.replace("${kb_model_id}", generate_model_id)
test_spec_str = test_spec_str.replace("${kb_id}", knowledge_base_id)

with open("agenteval.yml", "w") as fspec:
    fspec.write("#### AUTO-GENERATED FILE - Edit agenteval.tpl.yaml instead! ####\n")
    fspec.write(test_spec_str)

## Run the tests

The [agenteval CLI](https://awslabs.github.io/agent-evaluation/cli/) can run your tests with multi-threading, report generation, and conditional return codes: Great for integrating to CI/CD workflows, and similar to conventional test automation tools like [pytest](https://docs.pytest.org/):

> ℹ️ **Remember**: If you edited `agenteval.tpl.yml`, you'll need to re-run the cell above to refresh your `agenteval.yml`!

In [None]:
!agenteval run --plan-dir . --num-threads 8 --verbose

As well as the test result logs, you should see the framework created:

- [agenteval_summary.md](agenteval_summary.md) - a **human-readable report** in [Markdown](https://en.wikipedia.org/wiki/Markdown) format
- [agnteval_traces](agenteval_traces) folder - including **structured JSON files** per test-case, with full details of the reasoning chain and agent/evaluator invocations taken

▶️ Did the tests all pass as expected? **Check** the summary and trace JSONs: What were the *actual input messages* sent to the Bedrock KB for each test case?

▶️ **Why** is the second step of the `amazon_followup` test case structured the way it is? What happens if you replace it with simply `Ask the agent how many trees are in it`?

## Clean-Up

Once you're done experimenting, refer to the *Clean-Up section* of the [../RAG (Bedrock and Ragas).ipynb notebook](../RAG%20(Bedrock%20and%20Ragas).ipynb) for steps to delete your Bedrock Knowledge Base, to avoid ongoing charges.

## Summary

[AWS Labs' agent-evaluation](https://awslabs.github.io/agent-evaluation) provides an open-source, conversational test-case-based framework for validating that LLM-based systems complete example user journeys as expected - similar to traditional integration test frameworks for software engineering.

It can integrate with a range of orchestration tools like Amazon Bedrock Agents and Amazon Q for Business, but also supports [custom targets](https://awslabs.github.io/agent-evaluation/targets/custom_targets/) for you to connect to other integrated systems as needed.

Interestingly, both the generation of input messages and the evaluation of outputs against your listed criteria are **LLM-powered**: So you should be careful when writing your test plan to avoid ambiguity in what the input messages should be or what is and isn't acceptable for a response.

These kinds of tests can be especially useful **early on** in your application building journey, when you might not have a clear idea of which journeys will be most common or a large dataset of example messages to draw on. As you build up a more mature understanding of this distribution and a bigger dataset of examples, it may be useful to transition to more **metrics-based** evaluation: where some test failures are expected and builders work to *increase overall pass rate* rather than to *retain 100% success*.