# Evaluate a SageMaker JumpStart model with FMeval and track with MLflow

***
Developed and tested on Jupyterlab App on Amazon SageMaker Studio, SageMaker Distribution 2.1.0, instance `ml.m5.2xlarge`
***

This notebook shows you how to use FMeval to evaluate a LLM deployed via SageMaker Jumpstart and track the evaluations as metrics with MLflow tracking server.

## Setup

### Import libraries

In [2]:
!pip uninstall torch torchvision -y

Found existing installation: torch 2.6.0
Uninstalling torch-2.6.0:
  Successfully uninstalled torch-2.6.0
Found existing installation: torchvision 0.19.1
Uninstalling torchvision-0.19.1:
  Successfully uninstalled torchvision-0.19.1


In [3]:
!pip install torch torchvision

Collecting torch
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Using cached torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)
Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl (766.7 MB)
Using cached torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl (7.2 MB)
Installing collected packages: torch, torchvision
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.2 requires nvidia-ml-py3==7.352.0, which is not installed.
autogluon-multimodal 1.2 requires jsonschema<4.22,>=4.18, but you have jsonschema 4.23.0 which is incompatible.
autogluon-multimodal 1.2 requires nltk<3.9,>=3.4.5, but you have nltk 3.9.1 which is incompatible.
autogluon-multimodal 1.2 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.
autogluon-multimo

In [5]:
!pip uninstall torchvision -y
!pip uninstall torch -y
!pip uninstall fmeval -y

Found existing installation: torchvision 0.21.0
Uninstalling torchvision-0.21.0:
  Successfully uninstalled torchvision-0.21.0
Found existing installation: torch 2.6.0
Uninstalling torch-2.6.0:
  Successfully uninstalled torch-2.6.0
Found existing installation: fmeval 1.2.2
Uninstalling fmeval-1.2.2:
  Successfully uninstalled fmeval-1.2.2


In [6]:
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
!pip install fmeval

Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Downloading https://download.pytorch.org/whl/cpu/torch-2.6.0%2Bcpu-cp311-cp311-linux_x86_64.whl.metadata (26 kB)
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.21.0%2Bcpu-cp311-cp311-linux_x86_64.whl.metadata (6.1 kB)
Downloading https://download.pytorch.org/whl/cpu/torch-2.6.0%2Bcpu-cp311-cp311-linux_x86_64.whl (178.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.7/178.7 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading https://download.pytorch.org/whl/cpu/torchvision-0.21.0%2Bcpu-cp311-cp311-linux_x86_64.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvision
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the 

In [1]:
from pathlib import Path

import mlflow
from dotenv import load_dotenv
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.data_loaders.data_config import DataConfig
from fmeval.eval_algorithms.factual_knowledge import (
    FactualKnowledge,
    FactualKnowledgeConfig,
)
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
from utils import EvaluationSet, run_evaluation_sets, run_evaluation_sets_nested



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


2025-03-29 23:19:05.753021: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
%load_ext autoreload
%autoreload 2

We set the environmental variables `MLFLOW_TRACKING_URI` and `MLFLOW_TRACKING_USERNAME` from the `.env` file created in [00-Setup](./00-Setup.ipynb).
Alternatively you can set the tracking URL using the `mlflow` sdk method:

``` python
mlflow.set_tracking_uri(tracking_server_arn)
```

In [3]:
load_dotenv()

True

Deploy the SageMaker Jumpstart endpoint you want to test. You need the corresponding `model_id` in SageMaker Jumpstart. It can be found when navigating in SageMaker Studio to the JumpStart section and looking at the model details or the sample notebook associated with the deployment section.

![jumpstart-model-id](../img/find-jumpstart-model-id.png)

Alternatively, if you have an existing SageMaker Jumpstart endpoint, you can replace the cell below by setting only the `endpoint_name` variable

```python
endpoint_name = "jumpstart-existing-endpoint-name"
```

In [7]:
# from sagemaker.jumpstart.model import JumpStartModel

# model_id = "hf-llm-mistral-7b-ins-20250329-220013Endpoint"  # e.g., "huggingface-llm-falcon2-11b"
# model = JumpStartModel(model_id=model_id)
# accept_eula = False  # <-- some Jumpstart models requires explicitly accepting a EULA
# predictor = model.deploy(accept_eula=accept_eula)
# endpoint_name = predictor.endpoint_name

ValueError: Invalid model ID: 'hf-llm-mistral-7b-ins-20250329-220013Endpoint'. Please visit https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html for a list of valid model IDs. The module `sagemaker.jumpstart.notebook_utils` contains utilities for fetching model IDs. We recommend upgrading to the latest version of sagemaker to get access to the most models.

In [4]:
endpoint_name = "jumpstart-dft-hf-llm-mistral-7b-ins-20250329-230644"

### Model Runner Setup

The model runner we create below will be used to perform inference on every sample in the dataset.

In [5]:
import json

from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint
from sagemaker.predictor import retrieve_default

Lets extract information about the model. One particularly important information is the `inputs` format, which tells us the prompt signature for the model we have deployed.

In [6]:
model_id, model_version, _, _, _ = get_model_info_from_endpoint(
    endpoint_name=endpoint_name
)
model = JumpStartModel(model_id=model_id, model_version=model_version)
predictor = retrieve_default(endpoint_name=endpoint_name)
sample_payload = model.retrieve_example_payload().body
print(json.dumps(sample_payload, indent=4))

No instance type selected for inference hosting endpoint. Defaulting to ml.g5.12xlarge.


{
    "inputs": "<s>[INST] what is the recipe of mayonnaise? [/INST]",
    "parameters": {
        "max_new_tokens": 256,
        "do_sample": true,
        "decoder_input_details": true,
        "details": true
    }
}


In [7]:
print(json.dumps(predictor.predict(sample_payload), indent=4))

[
    {
        "generated_text": "<s>[INST] what is the recipe of mayonnaise? [/INST]Mayonnaise is a creamy sauce made from oil, egg yolks, and an acid such as vinegar or lemon juice. Here's a basic recipe for traditional mayonnaise:\n\nIngredients:\n- 1 egg yolk\n- 1 teaspoon of mustard (Dijon mustard is a good choice)\n- 1 tablespoon of vinegar (white or apple cider vinegar) or lemon juice\n- 1 cup of oil (neutral oil like canola or vegetable oil work well)\n- Salt, to taste\n- Pepper, to taste\n\nInstructions:\n1. Combine the egg yolk, mustard, and acid (vinegar or lemon juice) in a medium-sized bowl. Whisk until well combined.\n2. Start adding the oil very slowly, drop by drop, while continually whisking the mixture. This process is crucial for keeping the sauce from separating.\n3. Once the sauce begins to thicken, you can start adding the oil in a thin, steady stream while continuing to whisk.\n4. Once all the oil is incorporated, season with salt and pepper to taste.\n5. If the

For JumpStart model, `FMeval`  gets payload and output formats from the description of the models, this make it easier to setup the runners. 

In [8]:
model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
)

Lets test our model runner. You should build the prompt according to the expected input signature of the model

In [9]:
model_runner.predict(prompt="What's the tallest building in the world?")

('What\'s the tallest building in the world? Well, that answer\'s actually pretty tricky depending on your definition of "tallest."\n\nWhen the Burj Khalifa in Dubai, UAE, first opened in 2010, it dwarfed every other construction, with 163 floors and 828 meters (2,700 feet) in height. (The shaft of the building is 828 meters to the very pointy top, but most buildings\' height measurements only account for the first 153 meters from the ground to the top-most part of the roof.)\n\nSadly, the United Arab Emirates seems to have a monopoly on the world\'s tallest buildings, keeping four out of the top five places. (The rest of the top 10 is a list of just-shy-of-tallest: One World Trade Center in New York is in 6th place, 85 Skyscraper in Taipei is in 5th, the Makkah Royal Clock Tower in Mecca is 3rd, and CTF Finans Tower in Guangzhou, China, is 4th.)\n\nBut the real',
 -23.06406305725)

### Data
We first check that the dataset file to be used by the evaluation is present, and then create a `DataConfig` object for each dataset. Each dataset has been prepared to evaluate one of the three categories, i.e., `Summarization`, `Factual Knowledge`, and `Toxicity`. More categories can be defined too.

In [10]:
dataset_path = Path("datasets")

dataset_uri_summarization = dataset_path / "gigaword_sample.jsonl"
if not dataset_uri_summarization.is_file():
    print("ERROR - please make sure the file, gigaword_sample.jsonl, exists.")

data_config_summarization = DataConfig(
    dataset_name="gigaword_sample",
    dataset_uri=dataset_uri_summarization.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary",
)

dataset_uri_factual_knowledge = dataset_path / "trex_sample.jsonl"
if not dataset_uri_factual_knowledge.is_file():
    print("ERROR - please make sure the file, trex_sample.jsonl, exists.")

data_config_factual_knowledge = DataConfig(
    dataset_name="trex_sample",
    dataset_uri=dataset_uri_factual_knowledge.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
)

dataset_uri_toxicity = dataset_path / "real_toxicity_sample.jsonl"
if not dataset_uri_toxicity.is_file():
    print("ERROR - please make sure the file, real_toxicity_sample.jsonl, exists.")

data_config_toxicity = DataConfig(
    dataset_name="toxicity_sample",
    dataset_uri=dataset_uri_toxicity.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="prompt",
)

### Evaluation sets
We can now create an evaluation set for each algorithm we want to use in our test.

More information about metrics and evaluation sets can be found in the paper ["Evaluating Large Language Models with <i>fmeval</i>"](https://arxiv.org/pdf/2407.12872)

### Summarization

For the `Summarization` evaluation set, let's replace with your own prompt according to the input signature identified above. `FMeval` uses `$model_input` as placeholder to get the input from your evaluation dataset.

In [11]:
summarization_prompt = "Summarize the following text in one sentence: $model_input"

summarization_accuracy = SummarizationAccuracy()

evaluation_set_summarization = EvaluationSet(
    data_config_summarization,
    summarization_accuracy,
    summarization_prompt,
)

[nltk_data] Downloading package wordnet to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Toxicity

Lets do the same for Toxicity.

In [12]:
toxicity_prompt = "Complete the following sentence: $model_input"

toxicity = Toxicity(ToxicityConfig("detoxify"))

evaluation_set_toxicity = EvaluationSet(
    data_config_toxicity,
    toxicity,
    toxicity_prompt,
)

### Factual Knowledge

And again for Factual Knowledge.

In [13]:
factual_knowledge_prompt = "$model_input"

factual_knowledge = FactualKnowledge(
    FactualKnowledgeConfig(target_output_delimiter="<OR>")
)

evaluation_set_factual = EvaluationSet(
    data_config_factual_knowledge,
    factual_knowledge,
    factual_knowledge_prompt,
)

Group all evaluations

In [14]:
evaluation_list = [
    evaluation_set_summarization,
    evaluation_set_factual,
    evaluation_set_toxicity,
]

## Run evaluation

We setup the MLflow experiment used to track the evaluations.
We will then create a new run for each model, and run all the evaluation for that model within that run, so that the metrics will all appear together.  

We'll use the `model_id` as run name to make it easier to identify this run as part of the larger experiment, and run the evaluation using the `run_evaluation_sets()` defined in [utils.py](utils.py#20).

In [15]:
run_name = f"{model_id}"

In [16]:
experiment_name = "fmeval-mlflow-simple-runs"
experiment = mlflow.set_experiment(experiment_name)

In [17]:
with mlflow.start_run(run_name=run_name) as run:
    run_evaluation_sets(model_runner, evaluation_list)

2025-03-29 23:20:22,345	INFO worker.py:1812 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
[nltk_data] Downloading package wordnet to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
2025-03-29 23:20:24,213	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:20:24,213	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[In

Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2025-03-29 23:20:24,905	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:20:24,905	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:20:25,022	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:20:25,022	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


Running 0: 0.00 row [00:00, ? row/s]

- Repartition 1: 0.00 row [00:00, ? row/s]

Split Repartition 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-03-29 23:20:25,969	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:20:25,970	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GeneratePrompt)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GeneratePrompt) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:20:34,911	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:20:34,912	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GetModelOutputs)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GetModelOutputs) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:21:01,350	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:21:01,351	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(MeteorScore)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(MeteorScore) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:21:20,912	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:21:20,913	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(RougeScore)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(RougeScore) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:21:43,187	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:21:43,188	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(BertScore)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(BertScore) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:05,374	INFO dataset.py:2631 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-03-29 23:24:05,376	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:05,376	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:08,139	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:08,139	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:08,421	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:08,421	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:08,714	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:08,714	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:11,893	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:11,894	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
2025-03-29 23:24:13,082	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:13,082	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON] -> AggregateNumRows[AggregateNumRows]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:13,191	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:13,192	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:13,305	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:13,306	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


Running 0: 0.00 row [00:00, ? row/s]

- Repartition 1: 0.00 row [00:00, ? row/s]

Split Repartition 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-03-29 23:24:13,458	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:13,458	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GeneratePrompt)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GeneratePrompt) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:24:22,899	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:24:22,900	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GetModelOutputs)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GetModelOutputs) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:10,622	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:10,622	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(FactualKnowledgeScores)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(FactualKnowledgeScores) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:19,917	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:19,918	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:20,349	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:20,350	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:20,675	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:20,675	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:21,023	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:21,024	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
2025-03-29 23:25:31,183	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:31,184	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON] -> AggregateNumRows[AggregateNumRows]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:31,289	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:31,290	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:31,400	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:31,400	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


Running 0: 0.00 row [00:00, ? row/s]

- Repartition 1: 0.00 row [00:00, ? row/s]

Split Repartition 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-03-29 23:25:31,543	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:31,543	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GeneratePrompt)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GeneratePrompt) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:25:40,738	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:25:40,738	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GetModelOutputs)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GetModelOutputs) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:12,888	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:12,889	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(ToxicityScores)]


Running 0: 0.00 row [00:00, ? row/s]

- MapBatches(ToxicityScores) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:35,776	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:35,776	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:35,840	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:35,840	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:35,904	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:35,904	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:35,967	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:35,967	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:36,029	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:36,030	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:36,093	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:36,093	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:36,155	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:36,156	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:36,219	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:36,219	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:26:36,268	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:26:36,269	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
2025/03/29 23:26:37 INFO mlflow.tracking._tracking_service.client: 🏃 View run huggingface-llm-mistral-7b-instruct-v3 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/1/runs/f9c03994f35a430ea49d822b4a493cdc.
2025/03/29 23:26:37 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/1.


### Nested runs
An alternative approach to organize the runs is to create nested runs for the different tasks.

In [18]:
experiment_name = "fmeval-mlflow-nested-runs"
experiment = mlflow.set_experiment(experiment_name)

2025/03/29 23:30:27 INFO mlflow.tracking.fluent: Experiment with name 'fmeval-mlflow-nested-runs' does not exist. Creating a new experiment.


In [19]:
with mlflow.start_run(run_name=run_name, nested=True) as run:
    run_evaluation_sets_nested(model_runner, evaluation_list)

[nltk_data] Downloading package wordnet to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
2025-03-29 23:30:29,398	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:30:29,399	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON] -> AggregateNumRows[AggregateNumRows]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2025-03-29 23:30:29,518	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:30:29,519	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:30:29,630	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:30:29,630	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


Running 0: 0.00 row [00:00, ? row/s]

- Repartition 1: 0.00 row [00:00, ? row/s]

Split Repartition 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-03-29 23:30:29,772	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:30:29,773	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GeneratePrompt)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GeneratePrompt) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:30:39,099	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:30:39,100	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GetModelOutputs)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GetModelOutputs) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:31:04,889	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:31:04,889	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(MeteorScore)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(MeteorScore) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:31:24,449	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:31:24,449	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(RougeScore)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(RougeScore) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:31:47,199	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:31:47,200	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(BertScore)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(BertScore) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:40,393	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:40,393	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:43,324	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:43,325	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:43,603	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:43,603	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:43,891	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:43,892	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:47,980	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:47,980	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

MLflow parent run ID: 72ee6a618d3a4f3d8e6676fdc97415e7


  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
2025/03/29 23:33:49 INFO mlflow.tracking._tracking_service.client: 🏃 View run SummarizationAccuracy at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2/runs/8cd454152abc4d49a7e8c4e5bf4ad26b.
2025/03/29 23:33:49 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2.
2025-03-29 23:33:50,055	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:50,056	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON] -> AggregateNumRows[AggregateNumRows]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:50,164	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:50,165	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:50,278	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:50,278	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


Running 0: 0.00 row [00:00, ? row/s]

- Repartition 1: 0.00 row [00:00, ? row/s]

Split Repartition 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-03-29 23:33:50,432	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:50,433	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GeneratePrompt)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GeneratePrompt) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:33:59,810	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:33:59,811	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GetModelOutputs)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GetModelOutputs) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:34:48,930	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:34:48,931	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(FactualKnowledgeScores)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(FactualKnowledgeScores) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:34:58,242	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:34:58,243	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:34:58,636	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:34:58,637	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:34:59,041	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:34:59,042	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:34:59,462	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:34:59,463	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

MLflow parent run ID: 72ee6a618d3a4f3d8e6676fdc97415e7


  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
2025/03/29 23:35:01 INFO mlflow.tracking._tracking_service.client: 🏃 View run FactualKnowledge at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2/runs/c49ac4940f314f6385011f171b250c22.
2025/03/29 23:35:01 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2.
2025-03-29 23:35:10,635	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:35:10,636	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON] -> AggregateNumRows[AggregateNumRows]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2025-03-29 23:35:10,779	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:35:10,780	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCustomJSON]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCustomJSON->SplitBlocks(96) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:35:10,952	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:35:10,953	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


Running 0: 0.00 row [00:00, ? row/s]

- Repartition 1: 0.00 row [00:00, ? row/s]

Split Repartition 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-03-29 23:35:11,174	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:35:11,175	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GeneratePrompt)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GeneratePrompt) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:35:20,823	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:35:20,824	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[Map(GetModelOutputs)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(GetModelOutputs) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:35:54,197	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:35:54,197	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(ToxicityScores)]


Running 0: 0.00 row [00:00, ? row/s]

- MapBatches(ToxicityScores) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:17,953	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:17,953	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,023	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,023	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,078	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,079	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,143	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,144	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,208	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,209	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,271	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,272	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,335	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,336	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- Aggregate 1: 0.00 row [00:00, ? row/s]

Sort Sample 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 5: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,402	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,402	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

2025-03-29 23:36:18,451	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-29_23-20-21_071722_922/logs/ray-data
2025-03-29 23:36:18,452	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]


Running 0: 0.00 row [00:00, ? row/s]

- Map(<lambda>) 1: 0.00 row [00:00, ? row/s]

MLflow parent run ID: 72ee6a618d3a4f3d8e6676fdc97415e7


  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(
2025/03/29 23:36:19 INFO mlflow.tracking._tracking_service.client: 🏃 View run Toxicity at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2/runs/6ae4802c6ba24d6f863a288123ba6bb7.
2025/03/29 23:36:19 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2.
2025/03/29 23:36:19 INFO mlflow.tracking._tracking_service.client: 🏃 View run huggingface-llm-mistral-7b-instruct-v3 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2/runs/72ee6a618d3a4f3d8e6676fdc97415e7.
2025/03/29 23:36:19 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/2.


The evaluation is completed, and the results are recorded in the MLflow tracking server.

To continue with the evaluation, you can move to the [compare_models.ipynb](./compare_models.ipynb)

## Clean up
Since SageMaker endpoints are [priced](https://aws.amazon.com/sagemaker/pricing/) by deployed infrastructure time rather than by requests, you can avoid unnecessary charges by deleting your endpoints when you're done experimenting.

[Here](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-delete-resources.html) you can find instructions on how to delete a SageMaker endpoint.