# Evaluate an Amazon Bedrock model with FMeval and track with MLflow

***
Developed and tested on Jupyterlab App on Amazon SageMaker Studio, SageMaker Distribution 2.1.0, instance `ml.m5.2xlarge`
***

## Setup

### Import libraries

In [None]:
from pathlib import Path

import mlflow
from dotenv import load_dotenv
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.data_loaders.data_config import DataConfig
from fmeval.eval_algorithms.factual_knowledge import (
    FactualKnowledge,
    FactualKnowledgeConfig,
)
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner

from utils import EvaluationSet, run_evaluation_sets, run_evaluation_sets_nested

We set the environmental variables `MLFLOW_TRACKING_URI` and `MLFLOW_TRACKING_USERNAME` from the `.env` file created in [00-Setup](./00-Setup.ipynb).
Alternatively you can set the tracking URL using the `mlflow` sdk method:

``` python
mlflow.set_tracking_uri(tracking_server_arn)
```

In [None]:
load_dotenv()

### Model Runner Setup

The model runner we create below will be used to perform inference on every sample in the dataset.

In [None]:
model_id = "INSERT-BEDROCK-MODEL-ID-HERE"

We need to find the model content template. We can find this information from the Amazon Bedrock console in the `API request` sample section, and look at value of the `body`. As an example, here is the content template for Claude 3 Heiku

In [None]:
output_jmespath = "content[0].text"
content_template = """{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 512,
  "temperature": 0.5,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": $prompt
        }
      ]
    }
  ]
}"""

model_runner = BedrockModelRunner(
    model_id=model_id,
    output=output_jmespath,
    content_template=content_template,
)

### Data
We first check that the dataset file to be used by the evaluation is present, and then create a `DataConfig` object for each dataset. Each dataset has been prepared to evaluate one of the three categories, i.e., `Summarization`, `Factual Knowledge`, and `Toxicity`. More categories can be defined too.

In [None]:
dataset_path = Path("datasets")

dataset_uri_summarization = dataset_path / "gigaword_sample.jsonl"
if not dataset_uri_summarization.is_file():
    print("ERROR - please make sure the file, gigaword_sample.jsonl, exists.")

data_config_summarization = DataConfig(
    dataset_name="gigaword_sample",
    dataset_uri=dataset_uri_summarization.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary",
)

dataset_uri_factual_knowledge = dataset_path / "trex_sample.jsonl"
if not dataset_uri_factual_knowledge.is_file():
    print("ERROR - please make sure the file, trex_sample.jsonl, exists.")

data_config_factual_knowledge = DataConfig(
    dataset_name="trex_sample",
    dataset_uri=dataset_uri_factual_knowledge.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
)

dataset_uri_toxicity = dataset_path / "real_toxicity_sample.jsonl"
if not dataset_uri_toxicity.is_file():
    print("ERROR - please make sure the file, real_toxicity_sample.jsonl, exists.")

data_config_toxicity = DataConfig(
    dataset_name="toxicity_sample",
    dataset_uri=dataset_uri_toxicity.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="prompt",
)

### Evaluation sets
We can now create an evaluation set for each algorithm we want to use in our test.

More information about metrics and evaluation sets can be found in the paper ["Evaluating Large Language Models with <i>fmeval</i>"](https://arxiv.org/pdf/2407.12872)

In [None]:
summarization_accuracy = SummarizationAccuracy()
factual_knowledge = FactualKnowledge(
    FactualKnowledgeConfig(target_output_delimiter="<OR>")
)
toxicity = Toxicity(ToxicityConfig("detoxify"))

evaluation_set_summarization = EvaluationSet(
    data_config_summarization,
    summarization_accuracy,
    "Summarize the following text in one sentence: $model_input",
)

evaluation_set_factual = EvaluationSet(
    data_config_factual_knowledge,
    factual_knowledge,
    "$model_input",
)

evaluation_set_toxicity = EvaluationSet(
    data_config_toxicity,
    toxicity,
    "Complete the following sentence: $model_input",
)


evaluation_list = [
    evaluation_set_summarization,
    evaluation_set_factual,
    evaluation_set_toxicity,
]

## Run evaluations
We setup the MLflow experiment used to track the evaluations.
We will then create a new run for each model, and run all the evaluation for that model within that run, so that the metrics will all appear together.  

We'll use the `model_id` as run name to make it easier to identify this run as part of the larger experiment, and run the evaluation using the `run_evaluation_sets()` defined in [utils.py](utils.py#20).

In [None]:
run_name = f"{model_id}"

In [None]:
experiment_name = "fmeval-mlflow-simple-runs"
experiment = mlflow.set_experiment(experiment_name)

In [None]:
with mlflow.start_run(run_name=run_name) as run:
    run_evaluation_sets(model_runner, evaluation_list)

### Nested runs
An alternative approach to organize the runs is to create nested runs for the different tasks.

In [None]:
experiment_name = "fmeval-mlflow-nested-runs"
experiment = mlflow.set_experiment(experiment_name)

In [None]:
with mlflow.start_run(run_name=run_name, nested=True) as run:
    run_evaluation_sets_nested(model_runner, evaluation_list)

## Model comparison
The evaluation is completed, and the results are recorded in the MLflow tracking server.

To continue with the evaluation, you can move to the [compare_models.ipynb](./compare_models.ipynb)