## Get Evaluations on all inference files and gather findings on quantitative metrics (such as _Cosine Similarity_) and subjective metrics on various criteria using _LLM as a judge_ - Max Voting & Average Pooling using PoLL (Panel of LLM Evaluators)

---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

#### This step of the solution focusses on getting evaluations on the quality of resposes. It does so by gathering the following information and performing the steps below:

- **Gets all per inference request file**: This step first accesses and gets all of the per inference request files into a dataframe, containing the response from the LLM as well as the ground truth, if any is provided. 

- **Generates quantitative metrics for evaluation**: Calculate quantitative metrics to measure similarity and accuracy, for example _Cosine Similarity_. This helps in getting a quantitative overall score to the entire dataset in terms of which model generates outputs that are most similar and accurate to the ground truth (if any is provided). With this statistic, customers and users of the open source community can make business level judgements. 

- **Uses an _LLM as a judge_ approach to get subjective evaluations**: Refer to this [paper](https://arxiv.org/pdf/2404.18796). We use the following ways to evaluate the responses from the `candidate models` (models used to generate inferences)

    1. **Max Voting**: When a dataset provides a ground truth, we use a technique called `Max Voting`. Here, we use PoLL, or a panel of LLM evaluators, from different model families to evaluate each candidate model's response based on whether it generates a `correct` or an `incorrect` answer simply based on ground truth comparisons. Using models from different families as a PoLL, increases it's evaluation ability to be close to that of a human evaluation, and eliminates the intra model bias.
    
    2. **Average Pooling**: When a dataset does not provide a ground truth, or if a task being evaluated needs to be given subjective level judgements, that is when we use `Average Pooling`. We use a specific subjective level criteria and then evaluate the candidate model responses on a scale of 1-5 for each PoLL. Using this, we get an average score and then can evaluate how each candidate model was scored based on the PoLL evaluations.
    
***All evaluations are generated in a JSON format for further downstream analytics on the evaluation results***

#### Import all of the necessary libraries below to run this notebook

In [1]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [2]:
import io
import ray
import math
import time
import json
import torch
import tempfile
import datetime
import matplotlib
import numpy as np
import pandas as pd
from numpy import dot
from numpy.linalg import norm
from litellm import completion
from sentence_transformers import SentenceTransformer

# Import seaborn and other related libraries for visualizations and plotting charts
import seaborn as sns
from pathlib import Path
from tomark import Tomark
from fmbench.utils import *
from fmbench.globals import *
from datetime import datetime
from datetime import timezone
from dateutil.parser import parse
from typing import List, Optional, Dict
import importlib.resources as pkg_resources
from fmbench import __version__ as fmbench_version

region_name=us-west-2


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


role_arn_from_env=None, using current sts caller identity to set arn_string
the sts role is an assumed role, setting arn_string to arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role
config file current -> configs/bedrock/config-bedrock.yml, None
loaded config: {'general': {'name': 'fmbench-bedrock', 'model_name': 'FMs available in Amazon Bedrock'}, 'aws': {'region': 'us-west-2', 'sagemaker_execution_role': 'arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role', 'bucket': 'sagemaker-fmbench-write-us-west-2-387192758086'}, 'dir_paths': {'data_prefix': 'data', 'prompts_prefix': 'prompts', 'all_prompts_file': 'all_prompts.csv', 'metrics_dir': 'metrics', 'models_dir': 'models', 'metadata_dir': 'metadata'}, 's3_read_data': {'read_bucket': 'sagemaker-fmbench-read-us-west-2-387192758086', 'scripts_prefix': 'scripts', 'script_files': ['hf_token.txt'], 'eval_prompts_dir': 'eval_criteria_prompts', 'eval_prompt_template_dir_list': ['claude_eval_prompt_templates', 'llama3_eval_promp

In [3]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [4]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

2024-07-12 16:35:41,603	INFO worker.py:1788 -- Started a local Ray instance.


0,1
Python version:,3.11.9
Ray version:,2.32.0


[36m(async_run_eval pid=27749)[0m async_run_eval, i=1, total=10, judge_model_info=anthropic.claude-3-haiku-20240307-v1:0, eval_method: max_voting, uuid: 6e77fb90ea804956a115b0268df059b0
[36m(async_run_eval pid=27749)[0m run_eval, row 1/10, judge_model_id=anthropic.claude-3-haiku-20240307-v1:0, candidate model=mistral.mistral-7b-instruct-v0:2
[36m(async_run_eval pid=27749)[0m get_inference, model_id=anthropic.claude-3-haiku-20240307-v1:0
[36m(async_run_eval pid=27750)[0m Invoking bedrock/anthropic.claude-3-haiku-20240307-v1:0......
[36m(async_run_eval pid=27750)[0m 
[36m(async_run_eval pid=27750)[0m async_run_eval, i=6, total=10, judge_model_info=anthropic.claude-3-haiku-20240307-v1:0, eval_method: max_voting, uuid: bbf3e53076c3497aba6e0cd8eba686fd[32m [repeated 24x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication f

[36m(async_run_eval pid=27750)[0m Exception occurred during invoking anthropic.claude-3-haiku-20240307-v1:0, exception=BedrockException: Timeout Error - {"message":"Model has timed out in processing the request. Try your request again."}


[36m(async_run_eval pid=27750)[0m async_run_eval, i=4, total=10, judge_model_info=anthropic.claude-3-haiku-20240307-v1:0, eval_method: max_voting, uuid: 715d1d30b18244279ff5c0fbf112f245
[36m(async_run_eval pid=27750)[0m run_eval, row 4/10, judge_model_id=anthropic.claude-3-haiku-20240307-v1:0, candidate model=mistral.mistral-7b-instruct-v0:2
[36m(async_run_eval pid=27750)[0m get_inference, model_id=anthropic.claude-3-haiku-20240307-v1:0
[36m(async_run_eval pid=27750)[0m Invoking bedrock/anthropic.claude-3-haiku-20240307-v1:0......
[36m(async_run_eval pid=27749)[0m async_run_eval, i=5, total=10, judge_model_info=anthropic.claude-3-haiku-20240307-v1:0, eval_method: max_voting, uuid: ea46fe7df1c24763b2332d518096d818[32m [repeated 24x across cluster][0m
[36m(async_run_eval pid=27749)[0m run_eval, row 5/10, judge_model_id=anthropic.claude-3-haiku-20240307-v1:0, candidate model=mistral.mistral-7b-instruct-v0:2[32m [repeated 24x across cluster][0m
[36m(async_run_eval pid=2774

Load the Config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations

In [5]:
logger.info(f"CONFIG_FILE={CONFIG_FILE}")
config = load_main_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

[2024-07-12 16:35:42,281] p27460 {2445076252.py:1} INFO - CONFIG_FILE=configs/bedrock/config-bedrock.yml


region_name=us-west-2


[2024-07-12 16:35:42,602] p27460 {2445076252.py:3} INFO - {
  "general": {
    "name": "fmbench-bedrock",
    "model_name": "FMs available in Amazon Bedrock"
  },
  "aws": {
    "region": "us-west-2",
    "sagemaker_execution_role": "arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role",
    "bucket": "sagemaker-fmbench-write-us-west-2-387192758086"
  },
  "dir_paths": {
    "data_prefix": "data",
    "prompts_prefix": "prompts",
    "all_prompts_file": "all_prompts.csv",
    "metrics_dir": "metrics",
    "models_dir": "models",
    "metadata_dir": "metadata"
  },
  "s3_read_data": {
    "read_bucket": "sagemaker-fmbench-read-us-west-2-387192758086",
    "scripts_prefix": "scripts",
    "script_files": [
      "hf_token.txt"
    ],
    "eval_prompts_dir": "eval_criteria_prompts",
    "eval_prompt_template_dir_list": [
      "claude_eval_prompt_templates",
      "llama3_eval_prompt_templates",
      "cohere_eval_prompt_templates"
    ],
    "eval_instructions_dir": "eval_instruct

role_arn_from_env=None, using current sts caller identity to set arn_string
the sts role is an assumed role, setting arn_string to arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role


#### Load the associated pricing config file

In [6]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
pricing_file_path: str = config['pricing'] 

# initialize the pricing config file to None
pricing_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files('fmbench'), 'configs')
logger.info(f"Using fmbench.configs directory: {config_dir}")

pricing_module = Path(config['pricing'])
logger.info(f"pricing config provided for inference from this model is --> {pricing_module}")
pricing_file_path = os.path.join(config_dir, pricing_module)
logger.info(f"pricing config file path is --> {pricing_file_path}")

pricing_config = load_config(pricing_file_path)
logger.info(f"pricing config file recorded: {json.dumps(pricing_config, indent=2)}")

[2024-07-12 16:35:42,608] p27460 {2131877439.py:9} INFO - Using fmbench.configs directory: /home/ec2-user/anaconda3/envs/fmbench_eval_python311/lib/python3.11/site-packages/fmbench/configs
[2024-07-12 16:35:42,609] p27460 {2131877439.py:12} INFO - pricing config provided for inference from this model is --> pricing.yml
[2024-07-12 16:35:42,610] p27460 {2131877439.py:14} INFO - pricing config file path is --> /home/ec2-user/anaconda3/envs/fmbench_eval_python311/lib/python3.11/site-packages/fmbench/configs/pricing.yml


region_name=us-west-2


[2024-07-12 16:35:42,902] p27460 {2131877439.py:17} INFO - pricing config file recorded: {
  "pricing": {
    "instance_based": {
      "ml.m5.xlarge": 0.23,
      "ml.g5.xlarge": 1.4084,
      "ml.g5.2xlarge": 1.515,
      "ml.g5.12xlarge": 7.09,
      "ml.g5.24xlarge": 10.18,
      "ml.g5.48xlarge": 20.36,
      "ml.inf2.xlarge": 0.99,
      "ml.inf2.8xlarge": 2.36,
      "ml.inf2.24xlarge": 7.79,
      "ml.inf2.48xlarge": 15.58,
      "ml.trn1.32xlarge": 28.497,
      "ml.p4d.24xlarge": 37.688,
      "ml.p5.48xlarge": 113.068,
      "ml.p3.2xlarge": 3.825,
      "ml.g4dn.12xlarge": 4.89,
      "ml.g6.2xlarge": 1.222,
      "ml.g6.16xlarge": 4.246,
      "ml.g6.12xlarge": 5.752,
      "ml.g6.24xlarge": 8.344,
      "ml.g6.48xlarge": 16.688,
      "anthropic.claude-v3-sonnet-pt-nc": 88,
      "m5.xlarge": 0.192,
      "g5.xlarge": 1.006,
      "g5.2xlarge": 1.212,
      "g5.12xlarge": 5.672,
      "g5.24xlarge": 8.144,
      "g5.48xlarge": 16.288,
      "inf2.xlarge": 0.7582,
      "i

role_arn_from_env=None, using current sts caller identity to set arn_string
the sts role is an assumed role, setting arn_string to arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role


In [7]:
debug = False
if debug is True:
    metrics_path_file: str = os.path.join("..", "..", METADATA_DIR, METRICS_PATH_FNAME)
else:
    metrics_path_file: str = os.path.join(METADATA_DIR, METRICS_PATH_FNAME)
logger.info(f"cwd={os.getcwd()}, METADATA_DIR={METADATA_DIR}, METRICS_PATH_FNAME={METRICS_PATH_FNAME}, metrics_path_file={metrics_path_file}")
METRICS_DIR: str = Path(metrics_path_file).read_text().strip()
logger.info(f"metrics_path_file={metrics_path_file}, METRICS_DIR={METRICS_DIR}")

[2024-07-12 16:35:42,907] p27460 {3887258129.py:6} INFO - cwd=/home/ec2-user/SageMaker/foundation-model-benchmarking-tool/src/fmbench, METADATA_DIR=metadata, METRICS_PATH_FNAME=metrics_path.txt, metrics_path_file=metadata/metrics_path.txt


FileNotFoundError: [Errno 2] No such file or directory: 'metadata/metrics_path.txt'

In [8]:
file_path: str = "fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=01/mm=18/per_inference_request_results.csv"
logger.info(f"File path containing the metrics per inference folder --> {file_path}")

# Read the file from S3
try:
    file_content = get_s3_object(config['aws']['bucket'], file_path)
    # Use pandas to read the CSV content
    df_per_inference = pd.read_csv(io.StringIO(file_content))
    logger.info(f"{file_path} read into dataframe of shape {df_per_inference.shape}, "
                f"cols={df_per_inference.columns}")
    logger.info(f"{file_path} contains results for the following endpoints={df_per_inference.endpoint_name.unique()}")
    logger.info(df_per_inference.head())
except Exception as e:
    logger.error(f"Error reading from S3: {e}")

[2024-07-12 16:35:48,216] p27460 {187576928.py:2} INFO - File path containing the metrics per inference folder --> fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=01/mm=18/per_inference_request_results.csv
[2024-07-12 16:35:48,321] p27460 {187576928.py:9} INFO - fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=01/mm=18/per_inference_request_results.csv read into dataframe of shape (360, 22), cols=Index(['endpoint_name', 'prompt', 'ground_truth', 'temperature', 'max_tokens',
       'top_p', 'completion', 'prompt_tokens', 'completion_tokens', 'latency',
       'time_to_first_token', 'time_per_output_token', 'time_to_last_token',
       'uuid', 'experiment_name', 'concurrency', 'instance_type',
       'instance_count', 'EndpointName', 'ModelName', 'Image', 'S3Uri'],
      dtype='object')
[2024-07-12 16:35:48,322] p27460 {187576928.py:11} INFO - fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=

In [9]:
df_per_inference.head()

Unnamed: 0,endpoint_name,prompt,ground_truth,temperature,max_tokens,top_p,completion,prompt_tokens,completion_tokens,latency,...,time_to_last_token,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri
0,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.440897,...,,6e77fb90ea804956a115b0268df059b0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,
1,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.45033,...,,82d04de71238454a9bbe05f520f22cb0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,
2,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.461762,...,,22303e132fb646aa86b938083660dce8,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,
3,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.453813,...,,715d1d30b18244279ff5c0fbf112f245,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,
4,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.435938,...,,6ede1f51a3d143f09377dda35107693a,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,


### Relationship between prompt token length and inference latency for different instances and concurrency levels

In [10]:
df_per_inference.latency.describe()

count    360.000000
mean       0.999005
std        0.660359
min        0.273340
25%        0.549609
50%        0.800043
75%        1.293201
max        5.731676
Name: latency, dtype: float64

### Use the `sentence-transformers/all-mpnet-base-v2` embeddings model to calculate the _Cosine Similarity_ scores 
---

This portion of the evaluation step does as follows:

1. Uses the `sentence-transformers/all-mpnet-base-v2` model from Hugging Face. This is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

1. Use the embeddings model to get quantitative metrics from the inferences. This helps to get a similarity score between the ground truth answers from a dataset if any are given and the actual responses from the model received during inference.

1. If no ground truth is provided, cosine similarity is calculated between the response and the content provided to answer the question.embeddings_model_info

In [11]:
# get the quantitiative evaluation information from the config file, such as the embeddings model
# to be used
embeddings_model_quantitative_info: Dict = config['model_evaluations']['quantitative_eval_info']


def load_model():
    """
    This function loads the sentence-transformers model based on the provided model ID.
    """
    try: 
        model=None
        model_id = embeddings_model_quantitative_info['embeddings_model_id'].get('model_id', None)
        if model_id:
            model = SentenceTransformer(model_id)
        else:
            raise ValueError("Model ID is not provided or invalid in the configuration.")
    except Exception as e:
        logger.error(f"The SentenceTransformer embeddings model could not be loaded: {e}")
        model=None
    return model

In [12]:
# load the embeddings model to calculate the cosine similarity scores
model = load_model()


def get_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between two texts.
    """
    try:
        cosine: float = None
        # returns the embedding for a given text using the sentence-transformers model.
        A = model.encode([text1])[0]
        B = model.encode([text2])[0]
        cosine = dot(A, B) / (norm(A) * norm(B))
    except Exception as e:
        logger.error(f"Cosine similarity was not calculated at this iteration: {e}")
        cosine=None
    return cosine


# Assuming df_per_inference is your DataFrame
df_per_inference['cosine_similarity_score'] = df_per_inference.apply(
    lambda row: get_cosine_similarity(row['completion'], row['ground_truth']), axis=1
)
df_per_inference.head()

[2024-07-12 16:35:50,764] p27460 {SentenceTransformer.py:189} INFO - Use pytorch device_name: cpu
[2024-07-12 16:35:50,765] p27460 {SentenceTransformer.py:197} INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,endpoint_name,prompt,ground_truth,temperature,max_tokens,top_p,completion,prompt_tokens,completion_tokens,latency,...,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri,cosine_similarity_score
0,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.440897,...,6e77fb90ea804956a115b0268df059b0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
1,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.45033,...,82d04de71238454a9bbe05f520f22cb0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
2,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.461762,...,22303e132fb646aa86b938083660dce8,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
3,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.453813,...,715d1d30b18244279ff5c0fbf112f245,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
4,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.435938,...,6ede1f51a3d143f09377dda35107693a,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724


In [13]:
# define the all_metrics path to send the evaluation metrics to
all_metrics_fpath = os.path.join(METRICS_DIR, config["report"]["all_metrics_file"])

csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_cosine_similarity_scores_csv = csv_buffer.getvalue()

# Define the file name for S3 based on the original file path
df_per_inference_with_cosine_similarity_scores_csv = all_metrics_fpath.replace("all_metrics", "all_metrics_summary").split('/')[-1] 
inference_cosine_similarity_scores_s3_path = os.path.join(METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_cosine_similarity_scores_csv, BUCKET_NAME, "", 
            METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{inference_cosine_similarity_scores_s3_path}")

df_per_inference.head()

[2024-07-12 16:36:34,492] p27460 {100736689.py:15} INFO - Per inference cosine similarity scores saved to s3://sagemaker-fmbench-write-us-west-2-387192758086/fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=16/mm=35/per_inference_cosine_similarity.csv


Unnamed: 0,endpoint_name,prompt,ground_truth,temperature,max_tokens,top_p,completion,prompt_tokens,completion_tokens,latency,...,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri,cosine_similarity_score
0,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.440897,...,6e77fb90ea804956a115b0268df059b0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
1,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.45033,...,82d04de71238454a9bbe05f520f22cb0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
2,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.461762,...,22303e132fb646aa86b938083660dce8,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
3,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.453813,...,715d1d30b18244279ff5c0fbf112f245,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
4,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.435938,...,6ede1f51a3d143f09377dda35107693a,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724


### Use _Panel of LLM Evaluators_ to get Subjective Evaluations on various evaluation criteria
---

In this portion of the notebook, we run evaluations on the content generated by different candidate models. We use two main evaluation methods: `Max Voting` and `Average Pooling`. To eliminate intra-model bias, we address this by scoring answer correctness based not on a single judge, but instead on a panel composed of multiple evaluator models. Similar pooling techniques are used to reduce variance in human annotations by normalizing out both natural variation in human judgements caused by their own subjective biases as well as human error. We use the following two techniques:

1. **Max Voting**: We use the PoLL to evaluate candidate model responses by checking its correctness compared to a provided ground truth answer in the dataset. We prompt each PoLL to evaluate and give the response in a JSON structure, giving a verdict on whether the response is correct or incorrect, and an explanation as to why that is. Using this, we can perform downstream analytics such as: 

    1. Calculate the overall accuracy of each model using the correct versus the (correct + incorrect) responses
    
    1. Calculate the `error rate` or frequency or incorrect responses
    
    1. Categorize the errors based on the explanations provided by the evaluators. Common categories might include misunderstanding the question, incomplete answers, factual inaccuracies
    
    1. Summary of overall correct/incorrect, and the best model based on the PoLL. Rank the models on Correctness versus Incorrectness.

1. **Average Pooling**: We use the PoLL to rate the response of each candidate model on a more subjective criteria. Here, we have the candidate model responses rated on a scale of 1-5 based on the subjective criteria and then get an explanation to that. Using this we can do as follows:

    1. Calculate the average score for each model across all questions to get an overall performance measure.
    
    1. Compute the standard deviation of the scores to understand the consistency of the model's performance.

1. Towards the end of all evaluations, a final layer of evaluation is added at the end. This layer utilizes another LLM that acts as a final summarizer. It takes in the ratings, answers generated from each unique model that was used in inference, to give a list of trends, overall patterns and observations as to which model is suited for a given task for a given dataset.

In [14]:
# get the qualitative/subjective evaluation information from the config file to evaluate answers from different
# endpoints on various criteria
model_eval_subjective_info: Dict = config['model_evaluations']['subjective_eval_info']
eval_criteria_list = model_eval_subjective_info.get('eval_criteria', None)
logger.info(f"available llm as a judge evaluation information to use: {json.dumps(model_eval_subjective_info, indent=2)}")

[2024-07-12 16:36:34,505] p27460 {474703826.py:5} INFO - available llm as a judge evaluation information to use: {
  "judge_panel_list": [
    {
      "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
      "eval_prompt_template_dir": "claude_eval_prompt_templates",
      "eval_prompt_template_name_prefix": "claude_eval"
    },
    {
      "model_id": "meta.llama3-70b-instruct-v1:0",
      "eval_prompt_template_dir": "llama3_eval_prompt_templates",
      "eval_prompt_template_name_prefix": "llama3_eval"
    },
    {
      "model_id": "cohere.command-r-v1:0",
      "eval_prompt_template_dir": "cohere_eval_prompt_templates",
      "eval_prompt_template_name_prefix": "cohere_eval"
    }
  ],
  "inference_parameters": {
    "temperature": 0.1,
    "max_tokens": 300,
    "top_p": 0.92,
    "caching": false
  },
  "run_parallel_inference_count": 10
}


In [15]:
# get the inference parameters that the LLM judge panel will use while evaluating model candidate responses
INFERENCE_PARAMETERS_LLM_PANEL: Dict = config['model_evaluations']['subjective_eval_info'].get('inference_parameters', None)

In [16]:
def get_inference(model_id: str,
                  prompt: str):
    """
    Get inference using LiteLLM. This get's inference on the answers provided and evaluates each
    answer based on a given evaluation prompt template and the specific set of rules for each
    evaluation criteria.
    """
    # represents the service name
    print(f"get_inference, model_id={model_id}")
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception=None,
               prompt=prompt,
               completion=None,
               completion_token_count=None,
               prompt_token_count=None,
               model_id=model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        print(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{"content": body,"role": "user"}],
                              temperature=INFERENCE_PARAMETERS_LLM_PANEL.get('temperature', 0.1),
                              max_tokens=INFERENCE_PARAMETERS_LLM_PANEL.get('max_tokens', 100),
                              caching=INFERENCE_PARAMETERS_LLM_PANEL.get('caching', False))
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens        
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [17]:
def safe_filename(s):
    """
    convert a string to another string that can be used as a filename
    i.e. remove white space and non-word chars
    """
    if s is None:
        return "None"
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)

    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)

    return s

In [18]:
def parse_as_json(x: str) -> Optional[Dict]:
    """
    Convert a string into a dictionary. Remove any
    stray whitespaces which could break the json parsing
    """
    d: Optional[Dict] = None
    try:
        x = x.replace("\n", "").replace("\t", "")
        d = json.loads(x)
    except Exception as e:
        print(f"parse_as_json, error parsing string as json, string={x}")
    return d

### Read the latest dataframe and run LLM as a judge evaluations on it

In [19]:
df_per_inference.head()

Unnamed: 0,endpoint_name,prompt,ground_truth,temperature,max_tokens,top_p,completion,prompt_tokens,completion_tokens,latency,...,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri,cosine_similarity_score
0,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.440897,...,6e77fb90ea804956a115b0268df059b0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
1,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.45033,...,82d04de71238454a9bbe05f520f22cb0,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
2,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.461762,...,22303e132fb646aa86b938083660dce8,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
3,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.453813,...,715d1d30b18244279ff5c0fbf112f245,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724
4,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.435938,...,6ede1f51a3d143f09377dda35107693a,mistral.mistral-7b-instruct-v0:2,1,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724


### Prepare the evaluation prompt payloads
---

Here, the evaluation prompt template is used by the LLM judge to evaluate the answers on different criteria.
This prompt template function uses a set of rules, prompt template, the answer, and ground truth (if any) in the
evaluation solution

In [20]:
def prepare_eval_prompts(eval_template: str,
                         answer: str, 
                         rules: str, 
                         context: str, 
                         ground_truth: Optional[str]):
    """
    This function prepares the evaluation prompts by preparing the standard eval prompt template
    with the rules of a given subjective criteria, context, answer and ground truth (if any ground truth is provided)
    """
    processed_eval_template: Optional[str] = None
    processed_eval_template = eval_template.format(
        rules=rules,
        answer=answer,
        context=context,
        ground_truth=ground_truth)
    return processed_eval_template

In [21]:
def run_eval(i: int, total: int, row: Dict,  model_id: str, eval_method_name: str, uuid: str) -> Dict:
    """
    Runs the evaluation for one row 
    The eval prompt is already available in the row dictionary
    and we simply want to run the inference against the judge model.
    The results are returned in a new dictionary that contains the model 
    response and some fields from the original dictionary
    """
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    print(f"run_eval, row {i}/{total}, judge_model_id={model_id}, candidate model={row['endpoint_name']}")
    candidate_model_response: str = row['completion']
    # create the payload for model inference
    prompt = row[f'{model_id}_{method_name}_eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = get_inference(model_id, prompt)
    resp['candidate_model_response'] = candidate_model_response
    resp['candidate_model'] = row['endpoint_name']
    resp['cosine_similarity_score'] = row['cosine_similarity_score']
    if 'ground_truth' in row:
        resp['ground_truth'] = row['ground_truth']
    # calculate the input and output token price for all of the calls
    model_eval_completions_dir: str = os.path.join(RESULTS_DIR, MODEL_EVALUATION_JUDGE_COMPLETIONS_DIR)
    dir_path = os.path.join(model_eval_completions_dir, model_id, eval_method_name)
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"{model_id}_{eval_method_name}_{uuid}.json")

    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))

    return resp

# we use Ray to parallize
@ray.remote
def async_run_eval(i: int, total: int, row: Dict, model_id: str, eval_method_name: str, uuid: str) -> Dict:
    print(f"async_run_eval, i={i}, total={total}, judge_model_info={model_id}, eval_method: {eval_method_name}, uuid: {uuid}")
    return run_eval(i, total, row, model_id, eval_method_name, uuid)

In [22]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
df_per_inference_list = json.loads(df_per_inference.to_json(orient='records'))
logger.info(f"eval_records_list has {len(df_per_inference_list)} entries")

[2024-07-12 16:36:34,557] p27460 {2376022395.py:3} INFO - eval_records_list has 360 entries


### Prepare evaluation prompt templates
---

This portion of the step prepares the evaluation prompt templates that are used in the evaluation process of using `Max Voting` or `Average Pooling` using the PoLL.

In [23]:
model_eval_subjective_info

{'judge_panel_list': [{'model_id': 'anthropic.claude-3-haiku-20240307-v1:0',
   'eval_prompt_template_dir': 'claude_eval_prompt_templates',
   'eval_prompt_template_name_prefix': 'claude_eval'},
  {'model_id': 'meta.llama3-70b-instruct-v1:0',
   'eval_prompt_template_dir': 'llama3_eval_prompt_templates',
   'eval_prompt_template_name_prefix': 'llama3_eval'},
  {'model_id': 'cohere.command-r-v1:0',
   'eval_prompt_template_dir': 'cohere_eval_prompt_templates',
   'eval_prompt_template_name_prefix': 'cohere_eval'}],
 'inference_parameters': {'temperature': 0.1,
  'max_tokens': 300,
  'top_p': 0.92,
  'caching': False},
 'run_parallel_inference_count': 10}

In [24]:
# get the method that is being used to evaluate the content (which is either 
# max voting or average pooling)
method_name: str = config['model_evaluations']['PoLL_Composition_and_Voting'].get('method', None)
logger.info(f"The evaluation method FMBench is going to use to evaluate different model responses: {method_name}")
logger.info(f"judge panel being used to evaluate model responses: {model_eval_subjective_info.get('judge_panel_list', None)}")

[2024-07-12 16:36:34,566] p27460 {2514474553.py:4} INFO - The evaluation method FMBench is going to use to evaluate different model responses: max_voting
[2024-07-12 16:36:34,567] p27460 {2514474553.py:5} INFO - judge panel being used to evaluate model responses: [{'model_id': 'anthropic.claude-3-haiku-20240307-v1:0', 'eval_prompt_template_dir': 'claude_eval_prompt_templates', 'eval_prompt_template_name_prefix': 'claude_eval'}, {'model_id': 'meta.llama3-70b-instruct-v1:0', 'eval_prompt_template_dir': 'llama3_eval_prompt_templates', 'eval_prompt_template_name_prefix': 'llama3_eval'}, {'model_id': 'cohere.command-r-v1:0', 'eval_prompt_template_dir': 'cohere_eval_prompt_templates', 'eval_prompt_template_name_prefix': 'cohere_eval'}]


In [25]:
# Iterate through each LLM as a judge and each evaluation criterion
for llm_info in model_eval_subjective_info.get('judge_panel_list', None):
    model_id = llm_info['model_id']
    eval_prompt_template_fname: str = f"{llm_info.get('eval_prompt_template_name_prefix', None)}_{method_name}.txt"
    eval_prompt_template_dir = llm_info.get('eval_prompt_template_dir', None)
    eval_prompt_template_path: str = os.path.join(EVAL_DIR, eval_prompt_template_dir, eval_prompt_template_fname)
    logger.info(f"evaluation prompt template file path being used for {model_id}: {eval_prompt_template_path}")
    logger.info(f"evaluation prompt template file name: {eval_prompt_template_fname}")
    try:
        eval_prompt_template = Path(eval_prompt_template_path).read_text()
    except FileNotFoundError:
        logger.error(f"File not found: {eval_prompt_template_path}")
        continue

    print(f"Evaluation prompt template being used: {eval_prompt_template}")
    eval_instructions_fname: str = next((rule for rule in config['s3_read_data']['eval_instructions_files'] if method_name in rule), None)
    rules = Path(os.path.join(EVAL_DIR, eval_instructions_fname)).read_text()
    logger.info(f"rules: {rules}")
    column_name = f"{model_id}_{method_name}_eval_prompt"
    logger.info(f"column_name: {column_name}")

    df_per_inference[column_name] = df_per_inference.apply(
        lambda r: prepare_eval_prompts(
            eval_prompt_template,
            r['completion'],
            rules,
            r['prompt'],
            r['ground_truth']
        ),
        axis=1
    )

[2024-07-12 16:36:34,574] p27460 {578681319.py:7} INFO - evaluation prompt template file path being used for anthropic.claude-3-haiku-20240307-v1:0: eval_criteria_prompts/claude_eval_prompt_templates/claude_eval_max_voting.txt
[2024-07-12 16:36:34,575] p27460 {578681319.py:8} INFO - evaluation prompt template file name: claude_eval_max_voting.txt
[2024-07-12 16:36:34,575] p27460 {578681319.py:18} INFO - rules: 1. Your role is to evaluate whether the answer is "correct" or "incorrect" compared to the ground truth
provided and the question in the context.

2. Your response should be a JSON containing 2 main elements: "verdict" and "explanation". In the "verdict"
field of the JSON response, you should mention whether the question is "correct" or "incorrect" based on the 
comparison of the answer to the ground truth provided. The "explanation" field of the JSON contains the 
reason why the answer is correct or incorrect after your evaluation of it against the ground truth.

3. Make sure to

Evaluation prompt template being used: Human: You are a judge who evaluates the correctness of the answer to a given question in the context 
in the <context></context> tags. Your role is to evaluate whether the answer provided in the <answer></answer> 
tags is correct compared to the ground truth answer provided in the <ground_truth></ground_truth> xml tags.

Follow the instructions below while giving your evaluation in the <evaluation_instructions></evaluation_instructions>
tags:

<evaluation_instructions>
{rules}
</evaluation_instructions>

Refer to the context below in the <context></context> xml tags:
<context>
{context}
</context>

Refer to the answer to be evaluated in the <answer></answer> tags:
<answer>
{answer}
</answer> 

Refer to the ground truth to the question in the context below in the <ground_truth></ground_truth> xml tags: 
<ground_truth>
{ground_truth}
</ground_truth> 

Assistant: Sure, here is my evaluation in JSON:
Evaluation prompt template being used: <|begin_of_

In [26]:
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_eval_prompt_payloads = csv_buffer.getvalue()
eval_prompt_payloads_for_inference = os.path.join(METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_eval_prompt_payloads, BUCKET_NAME, "", 
            METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{eval_prompt_payloads_for_inference}")

df_per_inference.head()

[2024-07-12 16:36:34,909] p27460 {4247376453.py:9} INFO - Per inference cosine similarity scores saved to s3://sagemaker-fmbench-write-us-west-2-387192758086/fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=16/mm=35/processed_eval_prompts_for_inference.csv


Unnamed: 0,endpoint_name,prompt,ground_truth,temperature,max_tokens,top_p,completion,prompt_tokens,completion_tokens,latency,...,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri,cosine_similarity_score,anthropic.claude-3-haiku-20240307-v1:0_max_voting_eval_prompt,meta.llama3-70b-instruct-v1:0_max_voting_eval_prompt,cohere.command-r-v1:0_max_voting_eval_prompt
0,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.440897,...,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724,Human: You are a judge who evaluates the corre...,<|begin_of_text|><|start_header_id|>user<|end_...,You are a judge who evaluates the correctness ...
1,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.45033,...,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724,Human: You are a judge who evaluates the corre...,<|begin_of_text|><|start_header_id|>user<|end_...,You are a judge who evaluates the correctness ...
2,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.461762,...,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724,Human: You are a judge who evaluates the corre...,<|begin_of_text|><|start_header_id|>user<|end_...,You are a judge who evaluates the correctness ...
3,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.453813,...,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724,Human: You are a judge who evaluates the corre...,<|begin_of_text|><|start_header_id|>user<|end_...,You are a judge who evaluates the correctness ...
4,mistral.mistral-7b-instruct-v0:2,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,0.1,100,0.92,The genera Sinofranchetia and Stauntonia belon...,319,26,0.435938,...,mistral.mistral-7b-instruct-v0:2,1.0,,,,,0.710724,Human: You are a judge who evaluates the corre...,<|begin_of_text|><|start_header_id|>user<|end_...,You are a judge who evaluates the correctness ...


In [27]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
eval_records_list = json.loads(df_per_inference.to_json(orient='records'))
logger.info(f"eval_records_list has {len(eval_records_list)} entries")

[2024-07-12 16:36:34,950] p27460 {3717042138.py:3} INFO - eval_records_list has 360 entries


### Run LLM as a Judge Evaluations
---

In this portion of the step, FMBench performs the following actions:

1. If the method of evaluation is `Max Voting`, then in that case we suppose that a ground truth to the question from the context or task is pre existing in the dataset. We use the LLM panel of judges (in this case 3 judges), to give a verdict on whether the `answer` from the candidate models during inference is `correct` or `incorrect`. If the response is correct, then it gives it a `correct` and if not, then `incorrect`.

1. If the method of evaluation is `Average Pooling`, then in that case we suppose that the completion from the candidate models are supposed to be evlauated on a more subjective criteria rather than just deciding whether it is correct or incorrect compared to the ground truth. In this case, the average pooling prompt templates are used by the Judge Panel to give a rating out of 1-5 to each model completion on different criteria, such as relevancy, helpfulness, correctness, and so on.

1. Each model response is given in a JSON structure which is further used for downstream analytics, to decide the comparision of evaluation results between different model candidates and more.

***This step takes about ~6 minutes to complete. Model completion time depends on the PoLL models being used. `Llama3-70b`, `Cohere command-r-v1` and `claude 3 haiku` were used for this example***

In [28]:
# get the llm as a judge panel list
judge_panel_list = model_eval_subjective_info.get('judge_panel_list', None)
logger.info(f"The judge panel list contains {len(judge_panel_list)} judges: {judge_panel_list}")

[2024-07-12 16:36:34,954] p27460 {602205354.py:3} INFO - The judge panel list contains 3 judges: [{'model_id': 'anthropic.claude-3-haiku-20240307-v1:0', 'eval_prompt_template_dir': 'claude_eval_prompt_templates', 'eval_prompt_template_name_prefix': 'claude_eval'}, {'model_id': 'meta.llama3-70b-instruct-v1:0', 'eval_prompt_template_dir': 'llama3_eval_prompt_templates', 'eval_prompt_template_name_prefix': 'llama3_eval'}, {'model_id': 'cohere.command-r-v1:0', 'eval_prompt_template_dir': 'cohere_eval_prompt_templates', 'eval_prompt_template_name_prefix': 'cohere_eval'}]


In [31]:
n = model_eval_subjective_info.get('run_parallel_inference_count', 5)
list_of_lists = [eval_records_list[i * n:(i + 1) * n] for i in range((len(eval_records_list) + n - 1) // n)]
resp_list = []
st = time.perf_counter()

# Iterate over the judge panel and sublists
for judge_panelist_info in judge_panel_list:
    logger.info(f"============Running inference for judge panelist {judge_panelist_info['model_id']} for {method_name} ============")
    for idx, sublist in enumerate(list_of_lists):
        model_id = judge_panelist_info['model_id']
        logger.info(f"getting inference for list {idx + 1}/{len(list_of_lists)}, size of list={len(sublist)}")

        # Run inference in parallel
        resp_list.extend(ray.get([async_run_eval.remote(i + 1, len(sublist), record, model_id, method_name, record['uuid'])
                                  for i, record in enumerate(sublist)]))

elapsed_time = time.perf_counter() - st
logger.info(f"Total elapsed time for inference: {elapsed_time:.2f} seconds")

[2024-07-12 16:38:58,900] p27460 {3777942767.py:11} INFO - getting inference for list 1/36, size of list=10
[2024-07-12 16:39:01,639] p27460 {3777942767.py:11} INFO - getting inference for list 2/36, size of list=10
[2024-07-12 16:39:03,834] p27460 {3777942767.py:11} INFO - getting inference for list 3/36, size of list=10
[2024-07-12 16:39:06,366] p27460 {3777942767.py:11} INFO - getting inference for list 4/36, size of list=10
[2024-07-12 16:39:08,851] p27460 {3777942767.py:11} INFO - getting inference for list 5/36, size of list=10
[2024-07-12 16:39:11,269] p27460 {3777942767.py:11} INFO - getting inference for list 6/36, size of list=10
[2024-07-12 16:39:14,318] p27460 {3777942767.py:11} INFO - getting inference for list 7/36, size of list=10
[2024-07-12 16:39:16,838] p27460 {3777942767.py:11} INFO - getting inference for list 8/36, size of list=10
[2024-07-12 16:39:19,074] p27460 {3777942767.py:11} INFO - getting inference for list 9/36, size of list=10
[2024-07-12 16:39:22,004] p2

### Perform downstream analytical tasks on each PoLL evaluation result
---

In [32]:
# convert the results list into a dataframe for easy analytics
df_eval_results = pd.DataFrame(resp_list)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.head()

[2024-07-12 16:44:46,615] p27460 {1134668164.py:3} INFO - df_eval_results shape=(1080, 10)


Unnamed: 0,exception,prompt,completion,completion_token_count,prompt_token_count,model_id,candidate_model_response,candidate_model,cosine_similarity_score,ground_truth
0,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...
1,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...
2,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...
3,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...
4,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...


In [33]:
# parse out the completion from LLM as a judge and column bind
# the fields of the dictionary to the original results dataframe
df_eval_results_only = df_eval_results['completion'].apply(parse_as_json).apply(pd.Series)
df_eval_results = pd.concat([df_eval_results, df_eval_results_only], axis=1)
df_eval_results.rename(columns={'model_id': 'judge_model_id'}, inplace=True)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.head()

[2024-07-12 16:44:47,691] p27460 {4249199706.py:6} INFO - df_eval_results shape=(1080, 12)


parse_as_json, error parsing string as json, string={  "verdict": "correct",  "explanation": "The answer provided, "Reality television shows", is correct compared to the ground truth answer "American reality television series". Both WAGS Atlanta and WAGS are American reality television series that chronicle the lives of wives and girlfriends of professional athletes."}
parse_as_json, error parsing string as json, string={  "verdict": "correct",  "explanation": "The answer provided correctly identifies that both WAGS Atlanta and WAGS are American reality television series that chronicle the lives of "WAGs" (wives and girlfriends of high-profile athletes). The ground truth also confirms that they are both American reality television series, so the answer matches the ground truth."}
parse_as_json, error parsing string as json, string={  "verdict": "correct",  "explanation": "The answer provided correctly identifies that both WAGS Atlanta and WAGS are American reality television series tha

Unnamed: 0,exception,prompt,completion,completion_token_count,prompt_token_count,judge_model_id,candidate_model_response,candidate_model,cosine_similarity_score,ground_truth,verdict,explanation
0,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
1,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
2,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
3,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
4,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...


In [34]:
# send the raw results as a csv file to the S3 bucket
csv_buffer = io.StringIO()
df_eval_results.to_csv(csv_buffer, index=False)
eval_llm_as_a_judge_results = csv_buffer.getvalue()
eval_results_csv_fpath = os.path.join(METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(eval_llm_as_a_judge_results, BUCKET_NAME, "", 
            METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV)
logger.info(f"Per PoLL model responses saved as a csv to s3://{BUCKET_NAME}/{eval_results_csv_fpath}")
df_eval_results.head()

[2024-07-12 16:44:48,463] p27460 {3367761925.py:10} INFO - Per PoLL model responses saved as a csv to s3://sagemaker-fmbench-write-us-west-2-387192758086/fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=16/mm=35/raw_llm_as_a_judge_evals.csv


Unnamed: 0,exception,prompt,completion,completion_token_count,prompt_token_count,judge_model_id,candidate_model_response,candidate_model,cosine_similarity_score,ground_truth,verdict,explanation
0,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
1,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
2,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
3,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...
4,,Human: You are a judge who evaluates the corre...,"{\n ""verdict"": ""correct"",\n ""explanation"": ""...",63,856,anthropic.claude-3-haiku-20240307-v1:0,The genera Sinofranchetia and Stauntonia belon...,mistral.mistral-7b-instruct-v0:2,0.710724,a genus of flowering plant in the Lardizabalac...,correct,The answer provided correctly states that the ...


In [35]:
panel_summary_responses_df = df_eval_results.groupby(['judge_model_id', 'candidate_model', 'verdict']).size().unstack(fill_value=0)
panel_summary_responses_df.reset_index(inplace=True)

In [36]:
# send the raw results as a csv file to the S3 bucket
csv_buffer = io.StringIO()
panel_summary_responses_df.to_csv(csv_buffer, index=False)
panel_summary_responses = csv_buffer.getvalue()
llm_as_a_judge_per_eval_summary_fpath = os.path.join(METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(panel_summary_responses, BUCKET_NAME, "", 
            METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)
logger.info(f"Summary on each eval (max voting/average pooling) for each panel judge sent to s3://{BUCKET_NAME}/{llm_as_a_judge_per_eval_summary_fpath}")
panel_summary_responses_df.head(40)

[2024-07-12 16:44:50,952] p27460 {1355189319.py:10} INFO - Summary on each eval (max voting/average pooling) for each panel judge sent to s3://sagemaker-fmbench-write-us-west-2-387192758086/fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=16/mm=35/llm_as_a_judge_per_eval_summary.csv


verdict,judge_model_id,candidate_model,correct,incorrect
0,anthropic.claude-3-haiku-20240307-v1:0,ai21.j2-mid-v1,30,0
1,anthropic.claude-3-haiku-20240307-v1:0,ai21.j2-ultra-v1,30,0
2,anthropic.claude-3-haiku-20240307-v1:0,amazon.titan-text-express-v1,30,0
3,anthropic.claude-3-haiku-20240307-v1:0,amazon.titan-text-lite-v1,26,3
4,anthropic.claude-3-haiku-20240307-v1:0,anthropic.claude-3-haiku-20240307-v1:0,27,0
5,anthropic.claude-3-haiku-20240307-v1:0,anthropic.claude-3-sonnet-20240229-v1:0,30,0
6,anthropic.claude-3-haiku-20240307-v1:0,cohere.command-light-text-v14,23,6
7,anthropic.claude-3-haiku-20240307-v1:0,cohere.command-text-v14,27,3
8,anthropic.claude-3-haiku-20240307-v1:0,meta.llama2-13b-chat-v1,27,3
9,anthropic.claude-3-haiku-20240307-v1:0,meta.llama2-70b-chat-v1,30,0


#### Calculate the overall accuracy of each model scored by the PoLL
---

In [37]:
per_panel_judgement_result_df= panel_summary_responses_df.groupby(['candidate_model', 'judge_model_id']).sum()
# Compute the accuracy and error rate of each candidate model id
per_panel_judgement_result_df['accuracy'] = ((per_panel_judgement_result_df['correct'] / (per_panel_judgement_result_df['correct'] + per_panel_judgement_result_df['incorrect'])).round(2) * 100)
per_panel_judgement_result_df['error_rate'] = ((per_panel_judgement_result_df['incorrect'] / (per_panel_judgement_result_df['correct'] + per_panel_judgement_result_df['incorrect'])).round(2) * 100)
per_panel_judgement_result_df = per_panel_judgement_result_df[['accuracy', 'error_rate']].reset_index()
per_panel_judgement_result_df.head(40)

verdict,candidate_model,judge_model_id,accuracy,error_rate
0,ai21.j2-mid-v1,anthropic.claude-3-haiku-20240307-v1:0,100.0,0.0
1,ai21.j2-mid-v1,cohere.command-r-v1:0,100.0,0.0
2,ai21.j2-mid-v1,meta.llama3-70b-instruct-v1:0,100.0,0.0
3,ai21.j2-ultra-v1,anthropic.claude-3-haiku-20240307-v1:0,100.0,0.0
4,ai21.j2-ultra-v1,cohere.command-r-v1:0,90.0,10.0
5,ai21.j2-ultra-v1,meta.llama3-70b-instruct-v1:0,100.0,0.0
6,amazon.titan-text-express-v1,anthropic.claude-3-haiku-20240307-v1:0,100.0,0.0
7,amazon.titan-text-express-v1,cohere.command-r-v1:0,93.0,7.0
8,amazon.titan-text-express-v1,meta.llama3-70b-instruct-v1:0,100.0,0.0
9,amazon.titan-text-lite-v1,anthropic.claude-3-haiku-20240307-v1:0,90.0,10.0


In [38]:
# Calculate mean cosine similarity for each candidate model
mean_cosine_similarity = df_eval_results.groupby('candidate_model')['cosine_similarity_score'].mean().reset_index()
mean_cosine_similarity = mean_cosine_similarity.rename(columns={'cosine_similarity_score': 'mean_cosine_similarity'})
mean_cosine_similarity

Unnamed: 0,candidate_model,mean_cosine_similarity
0,ai21.j2-mid-v1,0.40804
1,ai21.j2-ultra-v1,0.418304
2,amazon.titan-text-express-v1,0.477337
3,amazon.titan-text-lite-v1,0.537419
4,anthropic.claude-3-haiku-20240307-v1:0,0.400099
5,anthropic.claude-3-sonnet-20240229-v1:0,0.433328
6,cohere.command-light-text-v14,0.590262
7,cohere.command-text-v14,0.338896
8,meta.llama2-13b-chat-v1,0.431822
9,meta.llama2-70b-chat-v1,0.438103


In [39]:
overall_accuracy_grouped_panel_df = panel_summary_responses_df.groupby('candidate_model').sum()
# Compute the accuracy and error rate of each candidate model id
overall_accuracy_grouped_panel_df['accuracy'] = ((overall_accuracy_grouped_panel_df['correct'] / (overall_accuracy_grouped_panel_df['correct'] + overall_accuracy_grouped_panel_df['incorrect'])).round(2) * 100)
overall_accuracy_grouped_panel_df['error_rate'] = ((overall_accuracy_grouped_panel_df['incorrect'] / (overall_accuracy_grouped_panel_df['correct'] + overall_accuracy_grouped_panel_df['incorrect'])).round(2) * 100)
overall_accuracy_grouped_panel_df = overall_accuracy_grouped_panel_df[['accuracy', 'error_rate']].reset_index()
overall_accuracy_grouped_panel_df = overall_accuracy_grouped_panel_df.sort_values(by='accuracy', ascending=False)
overall_accuracy_grouped_panel_df = pd.merge(mean_cosine_similarity, overall_accuracy_grouped_panel_df, on='candidate_model')
overall_accuracy_grouped_panel_df

Unnamed: 0,candidate_model,mean_cosine_similarity,accuracy,error_rate
0,ai21.j2-mid-v1,0.40804,100.0,0.0
1,ai21.j2-ultra-v1,0.418304,97.0,3.0
2,amazon.titan-text-express-v1,0.477337,98.0,2.0
3,amazon.titan-text-lite-v1,0.537419,89.0,11.0
4,anthropic.claude-3-haiku-20240307-v1:0,0.400099,100.0,0.0
5,anthropic.claude-3-sonnet-20240229-v1:0,0.433328,100.0,0.0
6,cohere.command-light-text-v14,0.590262,83.0,17.0
7,cohere.command-text-v14,0.338896,90.0,10.0
8,meta.llama2-13b-chat-v1,0.431822,90.0,10.0
9,meta.llama2-70b-chat-v1,0.438103,100.0,0.0


In [40]:
# send the accuracy metrics to s3
csv_buffer = io.StringIO()
overall_accuracy_grouped_panel_df.to_csv(csv_buffer, index=False)
overall_panel_result = csv_buffer.getvalue()
overall_panel_accuracy_metrics_fpath = os.path.join(METRICS_DIR, PER_MODEL_ACCURACY_POLL)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(overall_panel_result, BUCKET_NAME, "", 
            METRICS_DIR, PER_MODEL_ACCURACY_POLL)
logger.info(f"Overall accuracy and error rates results of each model sent to s3://{BUCKET_NAME}/{overall_panel_accuracy_metrics_fpath}")
overall_accuracy_grouped_panel_df.head(10)

[2024-07-12 16:45:10,049] p27460 {1651915844.py:10} INFO - Overall accuracy and error rates results of each model sent to s3://sagemaker-fmbench-write-us-west-2-387192758086/fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=16/mm=35/PoLL_per_model_accuracy.csv


Unnamed: 0,candidate_model,mean_cosine_similarity,accuracy,error_rate
0,ai21.j2-mid-v1,0.40804,100.0,0.0
1,ai21.j2-ultra-v1,0.418304,97.0,3.0
2,amazon.titan-text-express-v1,0.477337,98.0,2.0
3,amazon.titan-text-lite-v1,0.537419,89.0,11.0
4,anthropic.claude-3-haiku-20240307-v1:0,0.400099,100.0,0.0
5,anthropic.claude-3-sonnet-20240229-v1:0,0.433328,100.0,0.0
6,cohere.command-light-text-v14,0.590262,83.0,17.0
7,cohere.command-text-v14,0.338896,90.0,10.0
8,meta.llama2-13b-chat-v1,0.431822,90.0,10.0
9,meta.llama2-70b-chat-v1,0.438103,100.0,0.0


In [41]:
# get accuracy statements
# Rank models by accuracy
ranked_models = overall_accuracy_grouped_panel_df.sort_values(by='accuracy', ascending=False)
highest_accuracy = ranked_models['accuracy'].max()

# Group models with the highest accuracy
top_performers = ranked_models[ranked_models['accuracy'] == highest_accuracy]
other_models = ranked_models[ranked_models['accuracy'] < highest_accuracy]
final_ranking = pd.concat([top_performers, other_models])
unique_judge_model_ids = per_panel_judgement_result_df['judge_model_id'].unique()
PoLL_model_ids = ', '.join(map(str, unique_judge_model_ids))
top_performing_model_ids = ', '.join(top_performers['candidate_model'].tolist())

# cosine similarity score data
highest_cosine_model = final_ranking.loc[final_ranking['mean_cosine_similarity'].idxmax()]
highest_cosine_model_name = highest_cosine_model['candidate_model']
highest_cosine_similarity = highest_cosine_model['mean_cosine_similarity']

In [47]:
accuracy_statement = MAX_VOTING_RESULT_STATEMENT.format(
    judge_model_ids=PoLL_model_ids,
    highest_accuracy=highest_accuracy,
    top_models=top_performers.to_string(index=False),
    highest_cosine_similarity=round(highest_cosine_similarity, 4),
    top_cosine_similarity_model=highest_cosine_model_name,
    ranked_models=other_models.to_string(index=False),
    top_performing_model_ids=top_performing_model_ids
)

In [49]:
# send the overall accuracy report to s3
txt_buffer = io.StringIO()
txt_buffer.write(accuracy_statement)
poll_txt_file_content = txt_buffer.getvalue()
overall_panel_accuracy_metrics_fpath = os.path.join(METRICS_DIR, OVERALL_POLL_REPORT)
write_to_s3(poll_txt_file_content, BUCKET_NAME, "", 
            METRICS_DIR, OVERALL_POLL_REPORT)
logger.info(f"Overall accuracy and error rates results of each model sent to s3://{BUCKET_NAME}/{overall_panel_accuracy_metrics_fpath}")
print(accuracy_statement)

[2024-07-12 16:46:58,497] p27460 {84484965.py:8} INFO - Overall accuracy and error rates results of each model sent to s3://sagemaker-fmbench-write-us-west-2-387192758086/fmbench-bedrock-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=12/hh=16/mm=35/overall_PoLL_report.txt


 
A Detailed Analysis of Model Performance Based on Accuracy using Panel of LLM Evaluators (PoLL):

This accuracy benchmarking was done using a Panel of LLM evaluators. anthropic.claude-3-haiku-20240307-v1:0, cohere.command-r-v1:0, meta.llama3-70b-instruct-v1:0 were used as judges.

Top Performing Models (100.0% Accuracy):
                        candidate_model  mean_cosine_similarity  accuracy  error_rate
                         ai21.j2-mid-v1                0.408040     100.0         0.0
       mistral.mistral-7b-instruct-v0:2                0.424023     100.0         0.0
anthropic.claude-3-sonnet-20240229-v1:0                0.433328     100.0         0.0
 anthropic.claude-3-haiku-20240307-v1:0                0.400099     100.0         0.0
                meta.llama2-70b-chat-v1                0.438103     100.0         0.0

Top Performing Model (0.5903 Cosine Similarity Score):
cohere.command-light-text-v14

Other Ranked Models:
                   candidate_model  mean_cosine_sim