### Evaluate Model - TBD

**Environment**

- Base Python 3.0 kernel
- Studio Notebook instance type: ml.m5.xlarge

### Setup

In [19]:
# Install the fmeval package

!rm -Rf ~/.cache/pip/*
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall


Collecting fmeval
  Downloading fmeval-0.3.0-py3-none-any.whl.metadata (5.7 kB)
Collecting IPython (from fmeval)
  Downloading ipython-8.20.0-py3-none-any.whl.metadata (5.9 kB)
Collecting bert-score<0.4.0,>=0.3.13 (from fmeval)
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m804.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting detoxify<0.6.0,>=0.5.1 (from fmeval)
  Downloading detoxify-0.5.1-py3-none-any.whl (12 kB)
Collecting evaluate<0.5.0,>=0.4.0 (from fmeval)
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting ipykernel<7.0.0,>=6.26.0 (from fmeval)
  Downloading ipykernel-6.28.0-py3-none-any.whl.metadata (6.0 kB)
Collecting jiwer<4.0.0,>=3.0.3 (from fmeval)
  Downloading jiwer-3.0.3-py3-none-any.whl.metadata (2.6 kB)
Collecting markdown (from fmeval)
  Downloading Markdown-3.5.1-py3-none-any.whl.metadata (7.1 kB)
Collecting matplotlib<4.0.0,>=3.8.0 

In [None]:
!pip3 install sagemaker

### JumpStart Endpoint Creation

First we will deploy the Llama-2 model as a SageMaker endpoint. To train/deploy 13B and 70B models, please change model_id to "meta-textgeneration-llama-2-7b" and "meta-textgeneration-llama-2-70b" respectively.



In [20]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel

# These are needed, even if you use an existing endpoint, by a cell later in this notebook.
model_id, model_version = "meta-textgeneration-llama-2-7b-f", "*"


In [21]:
pretrained_model = JumpStartModel(model_id=model_id)
predictor = pretrained_model.deploy(accept_eula=True)

Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


---------!

In [3]:
# Uncomment the lines below and fill in the endpoint name if you have an existing endpoint.
# endpoint_name = "meta-textgeneration-llama-2-7b-2024-01-09-18-10-40-222"
# predictor = sagemaker.predictor.Predictor(
#     endpoint_name=endpoint_name,
#     serializer=sagemaker.serializers.JSONSerializer(),
#     deserializer = sagemaker.deserializers.JSONDeserializer()
# )


### Sample endpoint invocation

In [22]:
%%time

prompt = "London is the capital of"
payload = {
    "inputs": prompt,
    "parameters": {
        "top_p": 0.9,
        "temperature": 0.9,
        "max_new_tokens": 200,
    },
}

response = predictor.predict(payload)
print(response[0])

{'generated_text': ' England and the United Kingdom, located in the southeastern part of the island of Great Britain. It is a global financial and cultural hub, known for its iconic landmarks, rich history, and diverse culture. London has a population of over 8.9 million people, making it one of the most populous cities in Europe.\n\nLondon is home to many famous landmarks, including Buckingham Palace, the official residence of the British monarch; the Houses of Parliament, which houses the British Parliament; and Westminster Abbey, a UNESCO World Heritage Site and one of the most iconic churches in the world. The city is also known for its vibrant cultural scene, with over 200 museums, galleries, and theaters, including the British Museum, the National Gallery, and the Royal Opera House.\n\nLondon has a long history, dating back to the Roman era, and has played a significant role in'}
CPU times: user 21.4 ms, sys: 5.36 ms, total: 26.7 ms
Wall time: 6.34 s


### FMEval Setup

In [23]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig

### Evaluate the model on a single sample

In [24]:
eval_algo = FactualKnowledge(FactualKnowledgeConfig("<OR>"))

prompt = "London is the capital of"
payload = {
    "inputs": prompt,
    "parameters": {
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 100,
    },
}

model_output = predictor.predict(payload)[0]["generated_text"]
print(model_output)

eval_algo.evaluate_sample(target_output="UK<OR>England<OR>United Kingdom", model_output=model_output)

 England and the United Kingdom. It is a global city, known for its history, culture, and entertainment. London has a population of over 8 million people and is home to many famous landmarks, such as Buckingham Palace, the Tower of London, and Big Ben.
London has a rich history, dating back to Roman times. The city has been the capital of England since the 11th century and has played a significant role in many historical events, including the


[EvalScore(name='factual_knowledge', value=1)]

### Evaluate the model using a dataset

## Data Config Setup

Below, we create a DataConfig for the local dataset file, trex_sample.jsonl.

- `dataset_name` is just an identifier for your own reference
- `dataset_uri` is either a local path to a file or an S3 URI
- `dataset_mime_type` is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
- `model_input_location` and `target_output_location` are JMESPath queries used to find the model inputs and target outputs within the dataset. category_location similarly is used to find information about the category that the sample belongs to. The values that you specify here depend on the structure of the dataset itself. Take a look at trex_sample.jsonl to see where "question", "answers", and "knowledge_category" show up.

In [25]:
config = DataConfig(
    dataset_name="trex_sample",
    dataset_uri="trex_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
    category_location="knowledge_category",
)

### Model Runner setup

In [26]:
js_model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
    output='[0].generated_text',
    # log_probability='[0].details.prefill[*].logprob',
    content_template='{"inputs": $prompt, "parameters": { "top_p": 0.9, "temperature": 0.8, "max_new_tokens": 200}}',
)

Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


### Run Evaluation

In [27]:
eval_output = eval_algo.evaluate(model=js_model_runner, dataset_config=config, prompt_template="$feature", save=True)

2024-01-09 19:49:40,875	INFO read_api.py:406 -- To satisfy the requested parallelism of 8, each read task output is split into 8 smaller blocks.


Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)
2024-01-09 19:49:40,938	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition]
2024-01-09 19:49:40,939	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:49:40,939	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Repartition 1:   0%|          | 0/15 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/15 [00:00<?, ?it/s]

Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

2024-01-09 19:49:41,033	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(process_batch)]
2024-01-09 19:49:41,034	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:49:41,035	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

2024-01-09 19:49:41,140	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[Map(ModelRunnerWrapper)]
2024-01-09 19:49:41,140	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:49:41,141	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-01-09 19:49:41,163	INFO actor_pool_map_operator.py:106 -- Map(ModelRunnerWrapper): Waiting for 3 pool actors to start...


[2m[36m(_MapWorker pid=1516)[0m sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
[2m[36m(_MapWorker pid=1516)[0m sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


[2m[36m(_MapWorker pid=1516)[0m Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1516)[0m Unable to fetch log_probability from model response: Extractor cannot extract log_probability as log_probability_jmespath_expression is not provided
[2m[36m(_MapWorker pid=1518)[0m Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.[32m [repeated 2x across cluster][0m
[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1516)[0m Unable to fetch log_probability from model response: Extractor cannot extract log_probability as log_probability_jmespath_expression is not provided[32m [repeated 3x across cluster][0m
[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1516)[0m Unable to fetch log_probability from model response: Extractor cannot extract log_probability as log_probability_jmespath_expression is not provided[32m [repeated 3x across cluster]

- Aggregate 1:   0%|          | 0/15 [00:00<?, ?it/s]

[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1518)[0m Unable to fetch log_probability from model response: Extractor cannot extract log_probability as log_probability_jmespath_expression is not provided


Shuffle Map 2:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/15 [00:00<?, ?it/s]

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(Map(_generate_eval_scores) pid=1607)[0m sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml[32m [repeated 3x across cluster][0m
[2m[36m(Map(_generate_eval_scores) pid=1607)[0m sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml[32m [repeated 3x across cluster][0m


2024-01-09 19:51:49,390	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)] -> LimitOperator[limit=1]
2024-01-09 19:51:49,391	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:51:49,391	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-01-09 19:51:49,451	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)] -> AllToAllOperator[Aggregate] -> TaskPoolMapOperator[MapBatches(<lambda>)]
2024-01-09 19:51:49,452	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:51:49,452	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/15 [00:00<?, ?it/s]

Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

Sort Sample 0:   0%|          | 0/15 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)
2024-01-09 19:51:49,748	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)] -> AllToAllOperator[Aggregate]
2024-01-09 19:51:49,749	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:51:49,750	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/15 [00:00<?, ?it/s]

Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

Sort Sample 0:   0%|          | 0/15 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)
2024-01-09 19:51:49,975	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)->Map(<lambda>)]
2024-01-09 19:51:49,975	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:51:49,976	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

2024-01-09 19:51:50,078	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)->Map(<lambda>)]
2024-01-09 19:51:50,078	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:51:50,079	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

### Parse Evaluation Results

In [28]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

[
    {
        "eval_name": "factual_knowledge",
        "dataset_name": "trex_sample",
        "dataset_scores": [
            {
                "name": "factual_knowledge",
                "value": 0.82
            }
        ],
        "prompt_template": "$feature",
        "category_scores": [
            {
                "name": "Capitals",
                "scores": [
                    {
                        "name": "factual_knowledge",
                        "value": 0.82
                    }
                ]
            }
        ],
        "output_path": "/tmp/eval_results/factual_knowledge_trex_sample.jsonl",
        "error": null
    }
]


In [29]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

data = []

# We obtain the path to the results file from "output_path" in the cell above
with open("/tmp/eval_results/factual_knowledge_trex_sample.jsonl", "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['eval_algo'] = df['scores'].apply(lambda x: x[0]['name'])
df['eval_score'] = df['scores'].apply(lambda x: x[0]['value'])
df

Unnamed: 0,model_input,model_output,target_output,category,prompt,scores,eval_algo,eval_score
0,Cape Coast is the capital of,the Central region and is also the regional c...,"Central regional<OR>Central Region, Ghana",Capitals,Cape Coast is the capital of,"[{'name': 'factual_knowledge', 'value': 0}]",factual_knowledge,0
1,Kuching is the capital of,Sarawak and the biggest city on Borneo. It is...,Sarawak<OR>Crown Colony,Capitals,Kuching is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
2,Sukhum is the capital of,"the Republic of Abkhazia, a semi-autonomous s...",Abkhazia,Capitals,Sukhum is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
3,Minsk is the capital of,Belarus. It is located on the Svislach and Ni...,Byelorussian Soviet Socialist Republic<OR>Bela...,Capitals,Minsk is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
4,Vientiane is the capital of,"Laos and the administrative, economic, cultur...",Laotian<OR>Laos,Capitals,Vientiane is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
5,Port au Prince is the capital of,Haiti. It is a relatively small city with a p...,Haiti,Capitals,Port au Prince is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
6,Bloemfontein is the capital of,the Free State Province and is located in the...,Free State<OR>Mangaung Local Municipality<OR>O...,Capitals,Bloemfontein is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
7,Kabul is the capital of,Afghanistan. It is the second largest city in...,Islamic State<OR>Afghan<OR>Taliban government<...,Capitals,Kabul is the capital of,"[{'name': 'factual_knowledge', 'value': 1}]",factual_knowledge,1
8,Senftenberg is the capital of,the region of Zlín in the Czech Republic. The...,Oberspreewald-Lausitz,Capitals,Senftenberg is the capital of,"[{'name': 'factual_knowledge', 'value': 0}]",factual_knowledge,0
9,Bassein is the capital of,"the Bassein district, which is the second-lar...",Ayeyarwady Region,Capitals,Bassein is the capital of,"[{'name': 'factual_knowledge', 'value': 0}]",factual_knowledge,0


### Dolly dataset evaluation

In [30]:
config = DataConfig(
    dataset_name="dolly_qa",
    dataset_uri="dataset_qa_eval.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="model_input",
    target_output_location="target_output",
)

In [31]:
js_model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
    output='[0].generated_text',
    log_probability='[0].details.prefill[*].logprob',
    content_template='{"inputs": $prompt, "parameters": { "top_p": 0.9, "temperature": 0.8, "max_new_tokens": 200}}',
)

Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


In [32]:
eval_output = eval_algo.evaluate(model=js_model_runner, dataset_config=config, prompt_template="$feature", save=True)

2024-01-09 19:52:22,272	INFO read_api.py:406 -- To satisfy the requested parallelism of 8, each read task output is split into 8 smaller blocks.


Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)
2024-01-09 19:52:22,363	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition]
2024-01-09 19:52:22,363	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:52:22,364	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Repartition 1:   0%|          | 0/15 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/15 [00:00<?, ?it/s]

Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

2024-01-09 19:52:22,463	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(process_batch)]
2024-01-09 19:52:22,463	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:52:22,464	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

2024-01-09 19:52:22,581	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[Map(ModelRunnerWrapper)]
2024-01-09 19:52:22,582	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 19:52:22,583	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-01-09 19:52:22,604	INFO actor_pool_map_operator.py:106 -- Map(ModelRunnerWrapper): Waiting for 3 pool actors to start...


[2m[36m(_MapWorker pid=1732)[0m sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
[2m[36m(_MapWorker pid=1732)[0m sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


[2m[36m(_MapWorker pid=1732)[0m Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1734)[0m Unable to fetch log_probability from model response: JMESpath [0].details.prefill[*].logprob could not find any data
[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1734)[0m Unable to fetch log_probability from model response: JMESpath [0].details.prefill[*].logprob could not find any data
[2m[36m(_MapWorker pid=1733)[0m Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '*'. You can pin to version '3.0.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.[32m [repeated 2x across cluster][0m
[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1733)[0m Unable to fetch log_probability from model response: JMESpath [0].details.prefill[*].logprob could not find any data[32m [repeated 2x across cluster][0m
[2m[36m(MapWorker(Map(ModelRunnerWrapper)) pid=1733)[0m Unable to fetch log_probability from model response: JMESpath [0].deta

- Aggregate 1:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/15 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/15 [00:00<?, ?it/s]

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(Map(_generate_eval_scores) pid=1848)[0m sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml[32m [repeated 3x across cluster][0m
[2m[36m(Map(_generate_eval_scores) pid=1848)[0m sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml[32m [repeated 3x across cluster][0m


2024-01-09 20:00:51,621	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)] -> LimitOperator[limit=1]
2024-01-09 20:00:51,621	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 20:00:51,622	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-01-09 20:00:51,694	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)->Map(<lambda>)]
2024-01-09 20:00:51,695	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 20:00:51,695	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

2024-01-09 20:00:51,816	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(_generate_eval_scores)->Map(<lambda>)]
2024-01-09 20:00:51,817	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 20:00:51,818	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/15 [00:00<?, ?it/s]

In [15]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

[
    {
        "eval_name": "factual_knowledge",
        "dataset_name": "dolly_qa",
        "dataset_scores": [
            {
                "name": "factual_knowledge",
                "value": 0.04666666666666667
            }
        ],
        "prompt_template": "$feature",
        "category_scores": null,
        "output_path": "/tmp/eval_results/factual_knowledge_dolly_qa.jsonl",
        "error": null
    }
]
