# Evaluate a SageMaker JumpStart model with FMeval and track with MLflow

***
Developed and tested on Jupyterlab App on Amazon SageMaker Studio, SageMaker Distribution 2.1.0, instance `ml.m5.2xlarge`
***

This notebook shows you how to use FMeval to evaluate a LLM deployed via SageMaker Jumpstart and track the evaluations as metrics with MLflow tracking server.

## Setup

### Import libraries

In [None]:
from pathlib import Path

import mlflow
from dotenv import load_dotenv
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.data_loaders.data_config import DataConfig
from fmeval.eval_algorithms.factual_knowledge import (
    FactualKnowledge,
    FactualKnowledgeConfig,
)
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
from utils import EvaluationSet, run_evaluation_sets, run_evaluation_sets_nested

In [None]:
%load_ext autoreload
%autoreload 2

We set the environmental variables `MLFLOW_TRACKING_URI` and `MLFLOW_TRACKING_USERNAME` from the `.env` file created in [00-Setup](./00-Setup.ipynb).
Alternatively you can set the tracking URL using the `mlflow` sdk method:

``` python
mlflow.set_tracking_uri(tracking_server_arn)
```

In [None]:
load_dotenv()

Deploy the SageMaker Jumpstart endpoint you want to test. You need the corresponding `model_id` in SageMaker Jumpstart. It can be found when navigating in SageMaker Studio to the JumpStart section and looking at the model details or the sample notebook associated with the deployment section.

![jumpstart-model-id](../img/find-jumpstart-model-id.png)

Alternatively, if you have an existing SageMaker Jumpstart endpoint, you can replace the cell below by setting only the `endpoint_name` variable

```python
endpoint_name = "jumpstart-existing-endpoint-name"
```

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

model_id = "<JUMPSTARD-MODEL-ID>"  # e.g., "huggingface-llm-falcon2-11b"
model = JumpStartModel(model_id=model_id)
accept_eula = False  # <-- some Jumpstart models requires explicitly accepting a EULA
predictor = model.deploy(accept_eula=accept_eula)
endpoint_name = predictor.endpoint_name

In [None]:
endpoint_name = "<YOUR-EXISTING-JUMPSTART-ENDPOINT-NAME>"

### Model Runner Setup

The model runner we create below will be used to perform inference on every sample in the dataset.

In [None]:
import json

from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint
from sagemaker.predictor import retrieve_default

Lets extract information about the model. One particularly important information is the `inputs` format, which tells us the prompt signature for the model we have deployed.

In [None]:
model_id, model_version, _, _, _ = get_model_info_from_endpoint(
    endpoint_name=endpoint_name
)
model = JumpStartModel(model_id=model_id, model_version=model_version)
predictor = retrieve_default(endpoint_name=endpoint_name)
sample_payload = model.retrieve_example_payload().body
print(json.dumps(sample_payload, indent=4))

In [None]:
print(json.dumps(predictor.predict(sample_payload), indent=4))

For JumpStart model, `FMeval`  gets payload and output formats from the description of the models, this make it easier to setup the runners. 

In [None]:
model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
)

Lets test our model runner. You should build the prompt according to the expected input signature of the model

In [None]:
model_runner.predict(prompt="What's the tallest building in the world?")

### Data
We first check that the dataset file to be used by the evaluation is present, and then create a `DataConfig` object for each dataset. Each dataset has been prepared to evaluate one of the three categories, i.e., `Summarization`, `Factual Knowledge`, and `Toxicity`. More categories can be defined too.

In [None]:
dataset_path = Path("datasets")

dataset_uri_summarization = dataset_path / "gigaword_sample.jsonl"
if not dataset_uri_summarization.is_file():
    print("ERROR - please make sure the file, gigaword_sample.jsonl, exists.")

data_config_summarization = DataConfig(
    dataset_name="gigaword_sample",
    dataset_uri=dataset_uri_summarization.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary",
)

dataset_uri_factual_knowledge = dataset_path / "trex_sample.jsonl"
if not dataset_uri_factual_knowledge.is_file():
    print("ERROR - please make sure the file, trex_sample.jsonl, exists.")

data_config_factual_knowledge = DataConfig(
    dataset_name="trex_sample",
    dataset_uri=dataset_uri_factual_knowledge.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
)

dataset_uri_toxicity = dataset_path / "real_toxicity_sample.jsonl"
if not dataset_uri_toxicity.is_file():
    print("ERROR - please make sure the file, real_toxicity_sample.jsonl, exists.")

data_config_toxicity = DataConfig(
    dataset_name="toxicity_sample",
    dataset_uri=dataset_uri_toxicity.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="prompt",
)

### Evaluation sets
We can now create an evaluation set for each algorithm we want to use in our test.

More information about metrics and evaluation sets can be found in the paper ["Evaluating Large Language Models with <i>fmeval</i>"](https://arxiv.org/pdf/2407.12872)

### Summarization

For the `Summarization` evaluation set, let's replace with your own prompt according to the input signature identified above. `FMeval` uses `$model_input` as placeholder to get the input from your evaluation dataset.

In [None]:
summarization_prompt = "Summarize the following text in one sentence: $model_input"

summarization_accuracy = SummarizationAccuracy()

evaluation_set_summarization = EvaluationSet(
    data_config_summarization,
    summarization_accuracy,
    summarization_prompt,
)

### Toxicity

Lets do the same for Toxicity.

In [None]:
toxicity_prompt = "Complete the following sentence: $model_input"

toxicity = Toxicity(ToxicityConfig("detoxify"))

evaluation_set_toxicity = EvaluationSet(
    data_config_toxicity,
    toxicity,
    toxicity_prompt,
)

### Factual Knowledge

And again for Factual Knowledge.

In [None]:
factual_knowledge_prompt = "$model_input"

factual_knowledge = FactualKnowledge(
    FactualKnowledgeConfig(target_output_delimiter="<OR>")
)

evaluation_set_factual = EvaluationSet(
    data_config_factual_knowledge,
    factual_knowledge,
    factual_knowledge_prompt,
)

Group all evaluations

In [None]:
evaluation_list = [
    evaluation_set_summarization,
    evaluation_set_factual,
    evaluation_set_toxicity,
]

## Run evaluation

We setup the MLflow experiment used to track the evaluations.
We will then create a new run for each model, and run all the evaluation for that model within that run, so that the metrics will all appear together.  

We'll use the `model_id` as run name to make it easier to identify this run as part of the larger experiment, and run the evaluation using the `run_evaluation_sets()` defined in [utils.py](utils.py#20).

In [None]:
run_name = f"{model_id}"

In [None]:
experiment_name = "fmeval-mlflow-simple-runs"
experiment = mlflow.set_experiment(experiment_name)

In [None]:
with mlflow.start_run(run_name=run_name) as run:
    run_evaluation_sets(model_runner, evaluation_list)

### Nested runs
An alternative approach to organize the runs is to create nested runs for the different tasks.

In [None]:
experiment_name = "fmeval-mlflow-nested-runs"
experiment = mlflow.set_experiment(experiment_name)

In [None]:
with mlflow.start_run(run_name=run_name, nested=True) as run:
    run_evaluation_sets_nested(model_runner, evaluation_list)

The evaluation is completed, and the results are recorded in the MLflow tracking server.

To continue with the evaluation, you can move to the [compare_models.ipynb](./compare_models.ipynb)

## Clean up
Since SageMaker endpoints are [priced](https://aws.amazon.com/sagemaker/pricing/) by deployed infrastructure time rather than by requests, you can avoid unnecessary charges by deleting your endpoints when you're done experimenting.

[Here](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-delete-resources.html) you can find instructions on how to delete a SageMaker endpoint.