## Get Evaluations on all inference files and gather findings on quantitative metrics (such as _Cosine Similarity_) and subjective metrics on various criteria using _LLM as a judge_.
---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

#### This step of the solution focusses on getting evaluations on the quality of resposes. It does so by gathering the following information and performing the steps below:

- **Gets all per inference request file**: This step first accesses and gets all of the per inference request files into a dataframe, containing the response from the LLM as well as the ground truth, if any is provided. 

- **Generates quantitative metrics for evaluation**: Calculate quantitative metrics to measure similarity and accuracy, for example _Cosine Similarity_. This helps in getting a quantitative overall score to the entire dataset in terms of which model generates outputs that are most similar and accurate to the ground truth (if any is provided). With this statistic, customers and users of the open source community can make business level judgements. 

- **Uses an _LLM as a judge_ approach to get subjective evaluations**: This step uses a Large Language Model (LLM) that _acts as a judge_ where we use an LLM to evaluate the output of other LLMs that FMBench uses during the inference step. This steps helps in the following criteria

    1. Assists the users with a subjective evaluation to make the evaluation process more streamlined and personalized for their use case
    
    2. Generates an evaluation rating on each task on a scale of 1-5 and acculumates the overall rating across multiple evaluation criteria

- **Ground Truth Evaluation**: WIP

#### Import all of the necessary libraries below to run this notebook

In [1]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [38]:
import io
import ray
import math
import json
import tempfile
import datetime
import matplotlib
import numpy as np
import pandas as pd

# Import seaborn and other related libraries for visualizations and plotting charts
import seaborn as sns
from pathlib import Path
from tomark import Tomark
from fmbench.utils import *
from fmbench.globals import *
from datetime import datetime
from datetime import timezone
from dateutil.parser import parse
from typing import List, Optional, Dict
import importlib.resources as pkg_resources
from fmbench import __version__ as fmbench_version

In [39]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [40]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0,1
Python version:,3.11.9
Ray version:,2.32.0


Load the Config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations

In [4]:
logger.info(f"CONFIG_FILE={CONFIG_FILE}")
config = load_main_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

[2024-07-10 23:53:13,660] p23381 {2445076252.py:1} INFO - CONFIG_FILE=configs/llama2/7b/config-llama2-7b-g5-quick.yml


region_name=us-west-2


[2024-07-10 23:53:13,943] p23381 {2445076252.py:3} INFO - {
  "general": {
    "name": "llama2-7b-v1",
    "model_name": "Llama2-7b"
  },
  "aws": {
    "region": "us-west-2",
    "sagemaker_execution_role": "arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role",
    "bucket": "sagemaker-fmbench-write-us-west-2-387192758086"
  },
  "dir_paths": {
    "data_prefix": "data",
    "prompts_prefix": "prompts",
    "all_prompts_file": "all_prompts.csv",
    "metrics_dir": "metrics",
    "models_dir": "models",
    "metadata_dir": "metadata"
  },
  "s3_read_data": {
    "read_bucket": "sagemaker-fmbench-read-us-west-2-387192758086",
    "scripts_prefix": "scripts",
    "script_files": [
      "hf_token.txt"
    ],
    "configs_prefix": "configs",
    "config_files": [
      "pricing.yml"
    ],
    "source_data_prefix": "source_data",
    "source_data_files": [
      "2wikimqa_e.jsonl",
      "2wikimqa.jsonl",
      "hotpotqa_e.jsonl",
      "hotpotqa.jsonl",
      "narrativeqa.jsonl",

role_arn_from_env=None, using current sts caller identity to set arn_string
the sts role is an assumed role, setting arn_string to arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role


#### Load the associated pricing config file

In [5]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
pricing_file_path: str = config['pricing'] 

# initialize the pricing config file to None
pricing_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files('fmbench'), 'configs')
logger.info(f"Using fmbench.configs directory: {config_dir}")

pricing_module = Path(config['pricing'])
logger.info(f"pricing config provided for inference from this model is --> {pricing_module}")
pricing_file_path = os.path.join(config_dir, pricing_module)
logger.info(f"pricing config file path is --> {pricing_file_path}")

pricing_config = load_config(pricing_file_path)
logger.info(f"pricing config file recorded: {json.dumps(pricing_config, indent=2)}")

[2024-07-10 23:53:13,949] p23381 {2131877439.py:9} INFO - Using fmbench.configs directory: /home/ec2-user/anaconda3/envs/fmbench_python311/lib/python3.11/site-packages/fmbench/configs
[2024-07-10 23:53:13,950] p23381 {2131877439.py:12} INFO - pricing config provided for inference from this model is --> pricing.yml
[2024-07-10 23:53:13,951] p23381 {2131877439.py:14} INFO - pricing config file path is --> /home/ec2-user/anaconda3/envs/fmbench_python311/lib/python3.11/site-packages/fmbench/configs/pricing.yml


region_name=us-west-2


[2024-07-10 23:53:14,226] p23381 {2131877439.py:17} INFO - pricing config file recorded: {
  "pricing": {
    "instance_based": {
      "ml.m5.xlarge": 0.23,
      "ml.g5.xlarge": 1.4084,
      "ml.g5.2xlarge": 1.515,
      "ml.g5.12xlarge": 7.09,
      "ml.g5.24xlarge": 10.18,
      "ml.g5.48xlarge": 20.36,
      "ml.inf2.xlarge": 0.99,
      "ml.inf2.8xlarge": 2.36,
      "ml.inf2.24xlarge": 7.79,
      "ml.inf2.48xlarge": 15.58,
      "ml.trn1.32xlarge": 28.497,
      "ml.p4d.24xlarge": 37.688,
      "ml.p5.48xlarge": 113.068,
      "ml.p3.2xlarge": 3.825,
      "ml.g4dn.12xlarge": 4.89,
      "ml.g6.2xlarge": 1.222,
      "ml.g6.16xlarge": 4.246,
      "ml.g6.12xlarge": 5.752,
      "ml.g6.24xlarge": 8.344,
      "ml.g6.48xlarge": 16.688,
      "anthropic.claude-v3-sonnet-pt-nc": 88,
      "m5.xlarge": 0.192,
      "g5.xlarge": 1.006,
      "g5.2xlarge": 1.212,
      "g5.12xlarge": 5.672,
      "g5.24xlarge": 8.144,
      "g5.48xlarge": 16.288,
      "inf2.xlarge": 0.7582,
      "i

role_arn_from_env=None, using current sts caller identity to set arn_string
the sts role is an assumed role, setting arn_string to arn:aws:iam::387192758086:role/fmbench-stack-us-west-2-role


In [6]:
pwd

'/home/ec2-user/SageMaker/foundation-model-benchmarking-tool/src/fmbench'

In [7]:
debug = False
if debug is True:
    metrics_path_file: str = os.path.join("..", "..", METADATA_DIR, METRICS_PATH_FNAME)
else:
    metrics_path_file: str = os.path.join(METADATA_DIR, METRICS_PATH_FNAME)
logger.info(f"cwd={os.getcwd()}, METADATA_DIR={METADATA_DIR}, METRICS_PATH_FNAME={METRICS_PATH_FNAME}, metrics_path_file={metrics_path_file}")
METRICS_DIR: str = Path(metrics_path_file).read_text().strip()
logger.info(f"metrics_path_file={metrics_path_file}, METRICS_DIR={METRICS_DIR}")

[2024-07-10 23:53:14,237] p23381 {3887258129.py:6} INFO - cwd=/home/ec2-user/SageMaker/foundation-model-benchmarking-tool/src/fmbench, METADATA_DIR=metadata, METRICS_PATH_FNAME=metrics_path.txt, metrics_path_file=metadata/metrics_path.txt


FileNotFoundError: [Errno 2] No such file or directory: 'metadata/metrics_path.txt'

In [8]:
# file_path = os.path.join(METRICS_DIR, config["report"]["per_inference_request_file"])
file_path: str = "llama2-7b-v1-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=10/hh=23/mm=37/per_inference_request_results.csv"
logger.info(f"File path containing the metrics per inference folder --> {file_path}")

# Read the file from S3
try:
    file_content = get_s3_object(config['aws']['bucket'], file_path)
    # Use pandas to read the CSV content
    df_per_inference = pd.read_csv(io.StringIO(file_content))
    logger.info(f"{file_path} read into dataframe of shape {df_per_inference.shape}, "
                f"cols={df_per_inference.columns}")
    logger.info(f"{file_path} contains results for the following endpoints={df_per_inference.endpoint_name.unique()}")
    logger.info(df_per_inference.head())
except Exception as e:
    logger.error(f"Error reading from S3: {e}")

[2024-07-10 23:53:25,061] p23381 {67534588.py:3} INFO - File path containing the metrics per inference folder --> llama2-7b-v1-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=10/hh=23/mm=37/per_inference_request_results.csv
[2024-07-10 23:53:25,197] p23381 {67534588.py:10} INFO - llama2-7b-v1-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=07/dd=10/hh=23/mm=37/per_inference_request_results.csv read into dataframe of shape (140, 25), cols=Index(['endpoint_name', 'prompt', 'ground_truth', 'do_sample', 'temperature',
       'top_p', 'top_k', 'max_new_tokens', 'return_full_text', 'completion',
       'prompt_tokens', 'completion_tokens', 'latency', 'time_to_first_token',
       'time_per_output_token', 'time_to_last_token', 'uuid',
       'experiment_name', 'concurrency', 'instance_type', 'instance_count',
       'EndpointName', 'ModelName', 'Image', 'S3Uri'],
      dtype='object')
[2024-07-10 23:53:25,199] p23381 {67534588.py:12} INFO - llama2-7b-v1-fmbench-stack-us-we

In [9]:
df_per_inference.head()

Unnamed: 0,endpoint_name,prompt,ground_truth,do_sample,temperature,top_p,top_k,max_new_tokens,return_full_text,completion,...,time_to_last_token,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri
0,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,,18f167cf937c435caa5a6f791a360aa8,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,
1,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,,50c72f34f8a041af94fb86e362524baf,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,
2,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,,9c933384680a428784e009de8780a764,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,
3,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,,470b83c90f884371b5c46fdf693f98ac,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,
4,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,,01bb6ca6d3574d50abbd34e2537c2d4a,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,


### Relationship between prompt token length and inference latency for different instances and concurrency levels

In [10]:
df_per_inference.latency.describe()

count    140.000000
mean       1.510421
std        0.685664
min        0.739065
25%        1.002028
50%        1.138360
75%        1.831972
max        3.845381
Name: latency, dtype: float64

### Deploy the `sentence-transformers/all-mpnet-base-v2` embeddings model to calculate the _Cosine Similarity scores 
---

This portion of the evaluation step does as follows:

1. Deploys the `sentence-transformers/all-mpnet-base-v2` model from Hugging Face. This is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

1. Use the embeddings model to get quantitative metrics from the inferences. This helps to get a similarity score between the ground truth answers from a dataset if any are given and the actual responses from the model received during inference.

1. If no ground truth is provided, cosine similarity is calculated between the response and the content provided to answer the question.embeddings_model_info

In [41]:
import torch
from numpy import dot
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

# Load the model
embeddings_model_info: Dict = config['get_model_evaluations']['quantitative_eval_info']


def load_model():
    """
    This function loads the sentence-transformers model based on the provided model ID.
    """
    model_id = embeddings_model_info['get_embeddings_model_id'].get('model_id', None)
    if model_id:
        model = SentenceTransformer(model_id)
        return model
    else:
        raise ValueError("Model ID is not provided or invalid in the configuration.")


# Load the model
model = load_model()


def get_embeddings(text: str) -> torch.Tensor:
    """
    This function returns the embedding for a given text using the sentence-transformers model.
    """
    return model.encode([text])[0]  # Return the first element to get the embedding


def get_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between two texts.
    """
    A = get_embeddings(text1)
    B = get_embeddings(text2)
    cosine = dot(A, B) / (norm(A) * norm(B))
    return cosine

# Assuming df_per_inference is your DataFrame
df_per_inference['cosine_similarity_score'] = df_per_inference.apply(
    lambda row: get_cosine_similarity(row['completion'], row['ground_truth']), axis=1
)
df_per_inference.head()

[2024-07-11 01:31:22,577] p23381 {SentenceTransformer.py:189} INFO - Use pytorch device_name: cpu
[2024-07-11 01:31:22,578] p23381 {SentenceTransformer.py:197} INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,endpoint_name,prompt,ground_truth,do_sample,temperature,top_p,top_k,max_new_tokens,return_full_text,completion,...,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri,cosine_similarity_score
0,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,18f167cf937c435caa5a6f791a360aa8,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
1,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,50c72f34f8a041af94fb86e362524baf,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
2,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,9c933384680a428784e009de8780a764,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
3,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,470b83c90f884371b5c46fdf693f98ac,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
4,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,01bb6ca6d3574d50abbd34e2537c2d4a,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455


In [42]:
# define the all_metrics path to send the evaluation metrics to
all_metrics_fpath = os.path.join(METRICS_DIR, config["report"]["all_metrics_file"])

csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_cosine_similarity_scores_csv = csv_buffer.getvalue()

# Define the file name for S3 based on the original file path
df_per_inference_with_cosine_similarity_scores_csv = all_metrics_fpath.replace("all_metrics", "all_metrics_summary").split('/')[-1] 
inference_cosine_similarity_scores_s3_path = os.path.join(METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_cosine_similarity_scores_csv, BUCKET_NAME, "", 
            METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{inference_cosine_similarity_scores_s3_path}")

df_per_inference.head()

NameError: name 'PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES' is not defined

### Use _LLM as a judge_ to get Subjective Evaluations on various evaluation criteria
---

In this portion of the notebook, we run evaluations on the content generated on various different criteria. By default, 
FMBench supports `Relevance`, `Depth`, `Creativity`, `Correctness`, and `Helpfulness`. Each criteria is evaluated on a set of 
rules that are populated into a standard prompt template during the LLM as a judge evaluation process. The steps followed in this 
process are given below:

1. LLM as a judge (configurable) uses a standard prompt template to evaluate content. This standard prompt template is populated with rules and instructions at runtime for the given criteria being evaluated. For example, during the evaluation of correctness, the rules and instructions for correctness will be used in the prompt template to evaluate _content1_. 

1. Each criteria evaluation is scored on a score of 1-5. The rules to score the content is defined in the standard prompt template.

1. Along with each rule, a `subjective_explanation` is provided as to why it gave it that rating and more insights into the evaluation process.

1. Towards the end of all evaluations, a final layer of evaluation is added at the end. This layer utilizes another LLM that acts as a final summarizer. It takes in the ratings, answers generated from each unique model that was used in inference, to give a list of trends, overall patterns and observations as to which model is suited for a given task for a given dataset.

In [43]:
def get_inference(model_id: str,
                  prompt: str):
    """
    Get inference using LiteLLM. This get's inference on the answers provided and evaluates each
    answer based on a given evaluation prompt template and the specific set of rules for each
    evaluation criteria.
    """
    # represents the service name
    print(f"get_inference, model_id={model_id}")
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception=None,
               prompt=prompt,
               completion=None,
               completion_token_count=None,
               prompt_token_count=None,
               model_id=model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        print(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{"content": body,"role": "user"}],
                              temperature=temperature,
                              max_tokens=max_tokens,
                              caching=caching)
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens        
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [44]:
def safe_filename(s):
    """
    convert a string to another string that can be used as a filename
    i.e. remove white space and non-word chars
    """
    if s is None:
        return "None"
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)

    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)

    return s

In [45]:
def parse_as_json(x: str) -> Optional[Dict]:
    """
    Convert a string into a dictionary. Remove any
    stray whitespaces which could break the json parsing
    """
    d: Optional[Dict] = None
    try:
        x = x.replace("\n", "").replace("\t", "")
        d = json.loads(x)
    except Exception as e:
        print(f"parse_as_json, error parsing string as json, string={x}")
    return d

### Read the latest dataframe and run LLM as a judge evaluations on it

In [46]:
df_per_inference.head()

Unnamed: 0,endpoint_name,prompt,ground_truth,do_sample,temperature,top_p,top_k,max_new_tokens,return_full_text,completion,...,uuid,experiment_name,concurrency,instance_type,instance_count,EndpointName,ModelName,Image,S3Uri,cosine_similarity_score
0,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,18f167cf937c435caa5a6f791a360aa8,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
1,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,50c72f34f8a041af94fb86e362524baf,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
2,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,9c933384680a428784e009de8780a764,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
3,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,470b83c90f884371b5c46fdf693f98ac,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455
4,llama-2-7b-g5-2xlarge-1720647440-3564436,<s>[INST] <<SYS>>\nYou are an assistant for qu...,a genus of flowering plant in the Lardizabalac...,True,0.1,0.92,120,100,False,The genus' Sinofranchetia and Stauntonia are ...,...,01bb6ca6d3574d50abbd34e2537c2d4a,llama2-7b-g5.xlarge-huggingface-pytorch-tgi-in...,1,ml.g5.xlarge,1,,,,,0.695455


### Prepare the evaluation prompt payloads
---

Here, the evaluation prompt template is used by the LLM judge to evaluate the answers on different criteria.
This prompt template function uses a set of rules, prompt template, the answer, and ground truth (if any) in the
evaluation solution

In [47]:
def prepare_eval_prompts(eval_template_path: str,
                         answer: str, 
                         rules: str, 
                         context: str, 
                         ground_truth: Optional[str]):
    """
    This function prepares the evaluation prompts by preparing the standard eval prompt template
    with the rules of a given subjective criteria, context, answer and ground truth (if any ground truth is provided)
    """
    processed_eval_template: Optional[str] = None
    eval_template = Path(eval_template_path).read_text()
    processed_eval_template = eval_template.format(
        rules=rules,
        answer=answer,
        context=context,
        ground_truth=ground_truth)
    return processed_eval_template

In [48]:
def run_eval(i: int, total: int, row: Dict,  judge_model_info: Dict) -> Dict:
    """
    Runs the evaluation for one row 
    The eval prompt is already available in the row dictionary
    and we simply want to run the inference against the judge model.
    The results are returned in a new dictionary that contains the model 
    response and some fields from the original dictionary
    """
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    judge_model_id = judge_model_info['llm_as_a_judge_model_id']
    print(f"run_eval, row {i}/{total}, judge_model_id={judge_model_id}, model_id={row['model_id']}")

    # create the payload for model inference
    prompt = row['eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = get_inference(judge_model_id, prompt)
    resp['user_input'] = row['user_input']
    resp['model_id'] = row['model_id']
    resp['sql'] = row['sql']
    # calculate the input and output token price for all of the calls
    dir_path = os.path.join(EVAL_COMPLETIONS_DIR, safe_filename(row['model_id']))
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"{safe_filename(row['user_input'])}.json")

    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))

    return resp

# we use Ray to parallize
@ray.remote
def async_run_eval(i: int, total: int, row: Dict, judge_model_info: Dict) -> Dict:
    print(f"async_run_eval, i={i}, total={total}, judge_model_info={judge_model_info}")
    return run_eval(i, total, row, judge_model_info)

In [None]:
# we divide the full list into sublists because we dont want to 
# run into throttling issues with Bedrock
n: int = config['get_model_evaluations']['subjective_eval_info'].get('run_parallel_inference_count', 5)
resp_list: List = []
st = time.perf_counter()
logger.info(f"running inference for {LLM_AS_A_JUDGE_MODEL_INFO['model']}")

list_of_lists = [eval_records_list_flat[i * n:(i + 1) * n] \
                   for i in range((len(eval_records_list_flat) + n - 1) // n )]

logger.info(f"split input list of size {len(eval_records_list_flat)} "
            f"into {len(list_of_lists)} lists")

for idx, l in enumerate(list_of_lists):
    logger.info(f"getting inference for list {idx+1}/{len(list_of_lists)}, size of list={len(l)} ")
    resp_list.extend(ray.get([async_run_eval.remote(i+1, len(l), e, LLM_AS_A_JUDGE_MODEL_INFO) \
                              for i, e in enumerate(l)]))

elapsed_time = time.perf_counter() - st
logger.info(f"model={LLM_AS_A_JUDGE_MODEL_INFO.get('model')} completed "
            f"in {elapsed_time:.2f}")