# Fully custom LLM evaluation with Amazon Bedrock and fmeval

> *This notebook has been tested in the Python 3 kernel of SageMaker Studio JupyterLab (Distribution v1.6)*

Automating the evaluation of LLMs is useful (to accelerate prompt engineering and solution optimization), but difficult (because the models output natural language text).

The [open-source `fmeval` library](https://github.com/aws/fmeval) provides a range of evaluation algorithms, metrics, and integrations, and underpins the native automated foundation model evaluation capabilities offered [in Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html) and [in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-foundation-model-evaluate.html).

Recently, **LLM-based** automated evaluation procedures (that actually use one or more evaluator LLMs to judge the output of a candidate LLM) have shown popularity - with options included in tools like [LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/#response-evaluation) and [Ragas](https://docs.ragas.io/en/latest/concepts/metrics/critique.html). However, at the time of writing, `fmeval`s built-in evaluation algorithms do **not** include any LLM-critique-based methods.

This notebook shows a method to evaluate an LLM's accuracy for in-context question answering (with reference to known "ground truth" answers), using an evaluator LLM to determine whether the candidate model's response is in line with the ground truth. We show how this can be achieved within the context of `fmeval` (by providing a custom evaluation algorithm through the fmeval API), and explore how the LLM-based and built-in `QAAccuracy` metrics differ.

## Prerequisites

The [fmeval library](https://github.com/aws/fmeval) is not installed on SageMaker Studio JupyterLab kernels by default, so we'll first need to install it:

In [None]:
%pip install "fmeval>=1.0,<2.0"

You'll also need to **enable access to Anthropic Claude v3 (Haiku or Sonnet)** in Amazon Bedrock:

1. Select an [AWS Region where the model is available](https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html) (which doesn't have to be the same region as you deployed this notebook). For e.g. `us-west-2`
2. [Enable access to the Claude model in that region](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) through the Amazon Bedrock Console.
3. Grant [Bedrock IAM permissions to your SageMaker notebook execution role](https://docs.aws.amazon.com/bedrock/latest/userguide/api-setup.html#api-using-sage) to be able to invoke the model from here in the notebook.

Configure your Bedrock region, candidate & evaluator model ID, and Amazon S3 bucket in the cell below:

In [2]:
# Python Built-Ins:
from dataclasses import dataclass
import json
import os
from string import Template
from typing import Any, Callable, List, Optional

# External Dependencies:
import boto3
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.data_loaders.data_config import DataConfig
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
import pandas as pd  # Utilities for working with dataframes (tabular data)
import sagemaker  # Amazon SageMaker high-level SDK


BEDROCK_REGION = boto3.Session().region_name  # Override this with e.g. "us-east-1" if you need
CANDIDATE_MODEL_ID = "anthropic.claude-3-haiku-20240307-v1:0"
EVAL_MODEL_ID = "anthropic.claude-3-haiku-20240307-v1:0"
bucket_name = sagemaker.Session().default_bucket()  # (Or a custom S3 bucket if you prefer)
prefix = "llm-eval/demo/squad"

pd.options.display.max_colwidth = 70

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## A sample dataset

To demonstrate the pattern, we'll consider an **in-context question answering** use-case where the candidate LLM is presented with both a question and a document/snippet of source text that should include the answer. This is similar to the [Retrieval-Augmented Generation (RAG)](https://aws.amazon.com/what-is/retrieval-augmented-generation/) pattern, but assuming we've already been able to retrieve the relevant source document(s) for our query and are focussing solely on generating a coherent, accurate answer from that context.

Specifically, we'll use a small extract of the [Stamford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/) which has already been transformed and pre-processed in [datasets/question-answering/eval-job-input-qa.manifest.jsonl](datasets/question-answering/eval-job-input-qa.manifest.jsonl). See [datasets/Prepare-SQuAD.ipynb](datasets/Prepare-SQuAD.ipynb) for the code that was used to create this extract from the raw public SQuAD source.

## The default `QAAccuracy` algorithm

To better understand why it might be useful to implement a custom question-answering accuracy evaluator, let's try the fmeval default algorithm first.

(See the public fmeval example notebooks e.g. [examples/bedrock-claude-factual-knowledge.ipynb](https://github.com/aws/fmeval/blob/main/examples/bedrock-claude-factual-knowledge.ipynb) for more demos on how to set up model evaluation)

In [None]:
%%time

# Create a BedrockModelRunner with Claude v3's expected API structure:
os.environ["AWS_REGION"] = BEDROCK_REGION
candidate_model_runner = BedrockModelRunner(
    model_id=CANDIDATE_MODEL_ID,
    output="content[0].text",
    content_template='{"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": [{"type": "text", "text": $prompt}]}]}',
)

# Initialise the built-in QAAccuracy evaluator with default settings:
eval_algo = QAAccuracy()

# Configure the dataset created in the last notebook:
data_config = DataConfig(
    dataset_name="squad_demo",
    dataset_uri="datasets/question-answering/eval-job-input-qa.manifest.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="prompt",
    target_output_location="referenceResponse"
)

# Run the evaluation and save detailed results to local folder:
eval_output = eval_algo.evaluate(
    model=candidate_model_runner,
    dataset_config=data_config,
    prompt_template="$model_input",  # Prompt templating already done in data prep
    save=True,
)
with open("/tmp/eval_results/qa_accuracy_squad_demo.jsonl") as fin:
    os.makedirs("datasets/eval-local", exist_ok=True)
    with open("datasets/eval-local/qa_accuracy_squad_demo.jsonl", "w") as fout:
        fout.write(fin.read())
eval_output

Many of the metrics generated by this out-of-the-box evaluator paint a pretty pessimistic view of the model's performance on the task, as shown in the summary below. Only `recall_over_words` gives a high score of ~97%:

In [4]:
for score in eval_output[0].dataset_scores:
    print(f"{score.name}: {score.value}")

f1_score: 0.17133769728379025
exact_match_score: 0.0
quasi_exact_match_score: 0.0
precision_over_words: 0.10331109925282639
recall_over_words: 0.9714285714285714


To get some more insight, let's look at a few actual examples from the dataset:

In [5]:
with open("datasets/question-answering/eval-job-input-qa.manifest.jsonl") as f:
    for ix, line in enumerate(f):
        print(f"EXAMPLE {ix}\n----------------")
        datum = json.loads(line)
        print(f"Target Response: {datum['referenceResponse']}\n")
        model_resp = candidate_model_runner.predict(datum["prompt"])
        print(f"Model Response:\n{model_resp[0]}\n----------------\n")
        if ix > 1:
            break

EXAMPLE 0
----------------
Target Response: France

Model Response:
According to the given documentation, Normandy is a region in France. The document states that "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France."
----------------

EXAMPLE 1
----------------
Target Response: Computational complexity theory

Model Response:
According to the provided documentation, the branch of theoretical computer science that deals with classifying computational problems by their inherent difficulty and relating those classes to each other is called computational complexity theory.

The document states that "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other."
----------------

EXAMPLE 2
---------------

Qualitatively, the problem is that while the model is returning *"correct"* answers, it's also providing explanations while the reference answers in the dataset are extremely concise. The `recall_over_words` metric is very high because Claude is including the right answer in its response most of the time, but metrics like `f1_score` and `exact_match_score` are very low because Claude is never giving **only** the extracted answer text in the style of the reference answer.

For some use-cases, you may **want** this behaviour that penalizes the model for deviations from the expected style of the reference answer(s). In this case, the most likely solution would be to re-engineer the input prompts to explicitly request a more concise response from the model.

...But in other cases, you might **allow** these stylistic deviations (and even think the added explanations are useful) - and instead want to evaluate **specific aspects** of the LLM's response like its "correctness" or alignment with the reference answer. For this, we can **ask an evaluator LLM** to make a judgement.

We can define these custom evaluation algorithms, including ones that might make additional LLM calls to calculate metrics, using the fmeval API.

## Set up a custom fmeval evaluator

In fmeval, custom evaluation algorithms should implement the [EvalAlgorithmInterface](https://github.com/aws/fmeval/blob/5d1af0949f404c522eb37c7567059e37666cfa8b/src/fmeval/eval_algorithms/eval_algorithm.py#L13).

The [existing evaluation algorithms](https://github.com/aws/fmeval/tree/main/src/fmeval/eval_algorithms) provide a nice reference to work from when implementing custom algorithms, although many of them (like `QAAccuracy`) are a little complex due to their number of features.

The below evaluator is similar to the implementation in [infra/prompt_app/src/datamodel/evaluations/self_critique.py](infra/prompt_app/src/datamodel/evaluations/self_critique.py), used by our prompt engineering app example.

In [6]:
# Python Built-Ins:
from dataclasses import dataclass
import json
from string import Template
from typing import Any, Dict, List, Optional

# External Dependencies:
from fmeval.constants import DatasetColumns, MEAN
from fmeval.data_loaders.data_config import DataConfig
from fmeval.data_loaders.util import get_dataset
from fmeval.eval_algorithms import EvalAlgorithm, EvalOutput, EvalScore
from fmeval.eval_algorithms.eval_algorithm import EvalAlgorithmConfig, EvalAlgorithmInterface
from fmeval.eval_algorithms.util import evaluate_dataset, get_dataset_configs, validate_dataset
from fmeval.exceptions import EvalAlgorithmClientError
from fmeval.model_runners.model_runner import ModelRunner
from fmeval.transforms.transform import Transform
from fmeval.transforms.transform_pipeline import TransformPipeline
from fmeval.transforms.util import validate_call
from fmeval.util import get_eval_results_path

# Prompt template to ask an LLM to critique correctness of answer vs ground truth:
EVAL_TPL = Template("""An AI model was asked a question for which the reference correct answer(s) were:

<ref-answers>
${target}
</ref-answers>

The model's answer was:

<model-answer>
${output}
</model-answer>

Did the model answer correctly in agreement with the provided reference(s)? Answer only Y for yes
or N for no, and do not include any other information or reasoning.
""")

OUTPUT_KEY = "llm_judged_accuracy"

class QAAccuracyByLLMScores(Transform):
    """Scorer inspired by fmeval.eval_algorithms.qa_accuracy.QAAccuracyScores"""
    def __init__(
        self,
        eval_model_runners: List[ModelRunner],
        target_output_key: str = DatasetColumns.TARGET_OUTPUT.value.name,
        model_output_key: str = DatasetColumns.MODEL_OUTPUT.value.name,
        target_output_delimiter: Optional[str] = "<OR>",
    ):
        output_keys = [OUTPUT_KEY]
        super().__init__(
            eval_model_runners,
            target_output_key,
            model_output_key,
            target_output_delimiter,
        )
        self.register_input_output_keys(
            input_keys=[target_output_key, model_output_key],
            output_keys=output_keys,
        )
        self.target_output_key = target_output_key
        self.model_output_key = model_output_key
        self.output_keys = output_keys
        self.target_output_delimiter = target_output_delimiter

        self.eval_model_runners = eval_model_runners
        if not (self.eval_model_runners and len(self.eval_model_runners)):
            raise EvalAlgorithmClientError(
                "You must provide at least one ModelRunner for LLM-based QA Accuracy evaluation"
            )

    @staticmethod
    def _get_score(model_runner: ModelRunner, model_output: str, targets: List[str]) -> float:
        prompt = EVAL_TPL.substitute(
            target="\n".join([f"<ref-answer>{t}</ref-answer>" for t in targets]),
            output=model_output,
        )
        eval_resp, logprobs = model_runner.predict(prompt)
        eval_resp = eval_resp.strip().upper()
        if not len(eval_resp):
            return 0.5  # Swallow unexpected evaluation result & return 'not sure'
        elif eval_resp[0] == "Y":
            return 1.0
        elif eval_resp[0] == "N":
            return 0.0
        else:
            return 0.5  # Swallow unexpected evaluation result & return 'not sure'

    @validate_call
    def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
        model_output = record[self.model_output_key]
        target_outputs = record[self.target_output_key].split(self.target_output_delimiter)

        # Return average score across all evaluator models
        record[OUTPUT_KEY] = sum(
            self._get_score(model_runner=runner, model_output=model_output, targets=target_outputs)
            for runner in self.eval_model_runners
        ) / len(self.eval_model_runners)
        # sum(score_fn_override(runner, model_output, possible_targets) for runner in model_runners) / len(model_runners)
        return record


@dataclass(frozen=True)
class QAAccuracyByLLMConfig(EvalAlgorithmConfig):
    """Configuration for the QA Accuracy Evaluation

    :param eval_model_runners: The QAAccuracyByLLM evaluator uses one or more LLMs to judge whether the answer
        answer generated by the model under test is in agreement with the reference answers. Therefore you need to
        provide one or more ModelRunner instances to use for the evaluation.
    :param target_output_delimiter: Target Output can have multiple answers. We expect customer to combine all the
        possible answers into a single string and use the delimiter to separate them. For instance,
        if the answers are ["UK", "England"] and the delimiter="<OR>", then the target_output should be "UK<OR>England".
    """

    eval_model_runners: List[ModelRunner]
    target_output_delimiter: Optional[str] = "<OR>"

    def __post_init__(self):
        if not len(self.eval_model_runners):
            raise EvalAlgorithmClientError(
                "You must provide at least one ModelRunner for LLM-based QA Accuracy evaluation"
            )
        if self.target_output_delimiter == "":
            raise EvalAlgorithmClientError(
                "Empty target_output_delimiter is provided. Please either provide a non-empty string, or set it to None."
            )


class QAAccuracyByLLM(EvalAlgorithmInterface):
    """This evaluation measures question answering (QA) performance via critique from LLM(s)

    The code is closely aligned to fmeval's vanilla `QAAccuracy` eval algorithm, since the actual
    logic is implemented in the `QAAccuracyByLLMScores` transformer. We had to re-implement (rather
    than re-using the QAAccuracy) because of the way constants like the list of score names are
    referenced in the upstream.

    This evaluator outputs one metric only: The mean of the judged 0-0.5-1 response quality judged
    by the panel of (potentially multiple) evaluator model runners.
    """

    eval_name = "qa_accuracy_by_llm"

    def __init__(self, eval_algorithm_config: QAAccuracyByLLMConfig):
        super().__init__(eval_algorithm_config)
        self._eval_algorithm_config = eval_algorithm_config
        self.transform = QAAccuracyByLLMScores(
            eval_model_runners=eval_algorithm_config.eval_model_runners,
            target_output_delimiter=eval_algorithm_config.target_output_delimiter,
        )

    def evaluate_sample(self, target_output: str, model_output: str) -> List[EvalScore]:
        """Compute QA accuracy metrics for a single sample.

        :param target_output: The expected/desired model output.
        :param model_output: The actual model output.
        :returns: A list of EvalScore objects, one for each of the QA accuracy metrics.
        """
        target_output_key = self.transform.target_output_key
        model_output_key = self.transform.model_output_key
        sample = {target_output_key: target_output, model_output_key: model_output}
        pipeline = TransformPipeline([self.transform])
        result = pipeline.execute_record(sample)
        return [EvalScore(name=score_name, value=result[score_name]) for score_name in self.transform.output_keys]

    def evaluate(
        self,
        model: Optional[ModelRunner] = None,
        dataset_config: Optional[DataConfig] = None,
        prompt_template: Optional[str] = None,
        num_records: int = 100,
        save: bool = False,
    ) -> List[EvalOutput]:
        dataset_configs = get_dataset_configs(dataset_config, self.eval_name)
        eval_outputs = []
        for dataset_config in dataset_configs:
            dataset = get_dataset(dataset_config, num_records)
            validate_dataset(dataset, [DatasetColumns.TARGET_OUTPUT.value.name])
            eval_output = evaluate_dataset(
                dataset=dataset,
                pipeline=TransformPipeline([self.transform]),
                dataset_name=dataset_config.dataset_name,
                eval_name=self.eval_name,
                metric_names=self.transform.output_keys,
                eval_results_path=get_eval_results_path(),
                model=model,
                prompt_template=prompt_template,
                agg_method=MEAN,
                save=save,
            )
            eval_outputs.append(eval_output)
        return eval_outputs

## Running the custom evaluation

With the custom evaluation algorithm defined, we can run a job similarly to before but substitute our own algorithm instead.

Note that since this algorithm uses a (panel of) LLM(s) to evaluate the response from the model under test, it requires additional `BedrockModelRunner`(s) in the configuration:

In [None]:
%%time

# Create a BedrockModelRunner for the evaluator model:
eval_model_runner = BedrockModelRunner(
    model_id=EVAL_MODEL_ID,
    output="content[0].text",
    content_template='{"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": [{"type": "text", "text": $prompt}]}]}',
)

# Initialise the custom evaluation algorithm with the relevant config:
eval_by_llm_algo = QAAccuracyByLLM(QAAccuracyByLLMConfig(eval_model_runners=[eval_model_runner]))

# Run the evaluation and save detailed results to local folder:
eval_by_llm_output = eval_by_llm_algo.evaluate(
    model=candidate_model_runner,
    dataset_config=data_config,
    prompt_template="$model_input",  # Prompt templating already done in data prep
    save=True,
)
with open("/tmp/eval_results/qa_accuracy_by_llm_squad_demo.jsonl") as fin:
    os.makedirs("datasets/eval-local", exist_ok=True)
    with open("datasets/eval-local/qa_accuracy_by_llm_squad_demo.jsonl", "w") as fout:
        fout.write(fin.read())
eval_by_llm_output

The LLM-based evaluation, since it's only checking specifically for correctness of the generated answer in line with the reference answer, is much more generous and in line with our human perception of the model's performance for the task.

In this particular case, where we'd like to evaluate the correctness of models' answers against a ground truth independently of the "style" of the answer, this provides a more useful view of performance than default token-based metrics:

In [8]:
for score in eval_by_llm_output[0].dataset_scores:
    print(f"{score.name}: {score.value}")

llm_judged_accuracy: 0.9142857142857143


## Exploring results in detail

The detailed JSON-Lines result files also allow exploring results at the record level if required:

In [9]:
out_records = []
with open("datasets/eval-local/qa_accuracy_by_llm_squad_demo.jsonl") as f:
    for line in f:
        datum = json.loads(line)
        out_records.append({
            "target_output": datum["target_output"],
            "model_output": datum["model_output"],
            "llm_judged_accuracy": next(s for s in datum["scores"] if s["name"] == "llm_judged_accuracy")["value"],
        })
out_df = pd.DataFrame(out_records)
out_df

Unnamed: 0,target_output,model_output,llm_judged_accuracy
0,October<OR>October 1973<OR>1973,"According to the documentation, the 1973 oil crisis began in Octob...",1.0
1,Construction,"Based on the documentation provided, the process of constructing a...",1.0
2,Paul Baran developed the concept Distributed Adaptive Message Bloc...,"According to the documentation, Paul Baran developed the concept o...",1.0
3,Computational complexity theory,"According to the provided documentation, the branch of theoretical...",1.0
4,diversified<OR>highly diversified,"According to the documentation provided, the economy of Victoria i...",1.0
5,1998<OR>Following a referendum in 1997,"According to the documentation, the current Parliament of Scotland...",0.0
6,Central Asia<OR>the arid plains of Central Asia,"According to the AnyCompany documentation, the Black Death is thou...",1.0
7,Latin,"According to the documentation provided, the word ""imperialism"" or...",1.0
8,the concept of force<OR>force,"According to the documentation, philosophers in antiquity used the...",1.0
9,France,"According to the documentation provided, Normandy is located in Fr...",1.0


In [11]:
item = out_df[out_df["llm_judged_accuracy"] == 0].iloc[0]

print(f"Target\n------\n{item.target_output}")
print(f"\nActual\n------\n{item.model_output}")

Target
------
1998<OR>Following a referendum in 1997

Actual
------
According to the documentation, the current Parliament of Scotland was convened by the Scotland Act 1998, which sets out its powers as a devolved legislature. The first meeting of the new Parliament took place on 12 May 1999.


## Conclusions

While the `fmeval` library provides a range of pre-built evaluation metrics for foundation model evaluation, the open design also supports bringing custom algorithms. This can be extended to include LLM-critique-based evaluations, which can be especially useful in cases where *particular aspects* of the response need to be evaluated - but are hard to describe using traditional pattern matching or text processing methods alone.