## Get Evaluations on inferences generated by candidate models in the inference step, gather findings on quantitative metrics (such as _Cosine Similarity, levenshtein distance, and token set ratio_) and subjective metrics on various criteria using Max Voting & Average Pooling with PoLL (Panel of LLM Evaluators)

---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

#### This step of the solution focusses on getting evaluations on the quality of responses. It does so by gathering the following information and performing the steps below:

- **Gets the inference request file that contains all inferences from the inference step**: This step first accesses and gets all the inference request file into a dataframe which contains the responses from the candidate models, ground truth (if any), and other information, such as the source payload file, concurrency level, etc.

- **Generates quantitative metrics for evaluation**: Calculate quantitative metrics to measure similarity and accuracy, for example _Cosine Similarity, levenshtein distance, and token set ratio_. This helps in getting a quantitative overall score to the entire dataset in terms of which model generates outputs that are most similar and accurate to the ground truth (if any is provided). We use these metrics to build a hierarchy evaluation decision tree to move up to the next step of evaluation if the correctness of an answer is not obviously determined. 
    
    The steps that are followed as a part of this evaluation hierarchy (for Max Voting) is as given below:
    
    1. For this, we check if either the _Cosine Similarity, levenshtein similarity, or token set ratio_ values exceed a given threshold, and if they do, we assume that the answer to the question is correct and do not parse it through the next step. This saves on latency, cost, and also acts as an evaluation filter.
    
    1. For the rest of the answers that are not obviously correct or do not have any semantic relation with the ground truth, we move to the next step in the hierarchical tree, which is using a panel of LLM evaluators.

- **Uses a _Panel of LLM Evaluator_ approach to get subjective evaluations**: Refer to this [paper](https://arxiv.org/pdf/2404.18796). We use the following ways to evaluate the responses from the `candidate models` (models used to generate inferences)

    1. **Max Voting**: When a dataset provides a ground truth, we use a technique called `Max Voting`. Here, we use PoLL, or a panel of LLM evaluators, from different model families to evaluate each candidate model's response based on whether it generates a `correct` or an `incorrect` answer simply based on its comparison with the ground truth. Using models from different families as a PoLL, increases it's evaluation ability to be close to that of a human evaluation, and eliminates intra model bias during the evaluation process.
    
    2. **Average Pooling**: When a dataset does not provide a ground truth, or if a task being evaluated needs to be given deeper subjective level judgements, that is when we use `Average Pooling`. In this, we use specific subjective level criteria and then evaluate the candidate model responses on a scale of 1-5 for each PoLL. Using this, we get an average score on each criteria and then can evaluate how each candidate model was scored based on the PoLL evaluations.
    
FMBench uses this approach of PoLL to eradicate intra model bias by using models as judges from different model families. This brings the evaluation results closer to that of a human evaluation, makes the evaluation process more streamlined, consistent across all the responses, and reduces the latency and cost of evaluating the candidate models over time.
    
***All evaluations are generated in a JSON format for further downstream analytics on the evaluation results***

#### Import all of the necessary libraries below to run this notebook

In [None]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [None]:
import io
import ray
import time
import json
import glob
import yaml
import pandas as pd
from numpy import dot
import seaborn as sns
from pathlib import Path
from fuzzywuzzy import fuzz
from fmbench.utils import *
from fmbench.globals import *
from numpy.linalg import norm
from litellm import completion
from typing import List, Optional, Dict
from difflib import SequenceMatcher as SM
import importlib.resources as pkg_resources
from fmbench import __version__ as fmbench_version
from sentence_transformers import SentenceTransformer

In [None]:
# set a logger to get logs
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

Load the Config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations

In [None]:
logger.info(f"CONFIG_FILE={CONFIG_FILE}")
config = load_main_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

#### Load the associated pricing config file

In [None]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
pricing_file_path: str = config['pricing'] 

# initialize the pricing config file to None
pricing_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files('fmbench'), 'configs')
logger.info(f"Using fmbench.configs directory: {config_dir}")

pricing_module = Path(config['pricing'])
logger.info(f"pricing config provided for inference from this model is --> {pricing_module}")
pricing_file_path = os.path.join(config_dir, pricing_module)
logger.info(f"pricing config file path is --> {pricing_file_path}")

pricing_config = load_config(pricing_file_path)
logger.info(f"pricing config file recorded: {json.dumps(pricing_config, indent=2)}")

### Load the model evaluation information
---

In [None]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
model_eval_fpath: str = config['model_evaluations'] 

# initialize the pricing config file to None
eval_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files('fmbench'), 'configs')
logger.info(f"Using fmbench.configs directory: {config_dir}")

eval_module = Path(config['model_evaluations'])
logger.info(f"eval config provided for evaluation --> {eval_module}")
eval_file_path = os.path.join(config_dir, eval_module)
logger.info(f"eval config file path is --> {eval_file_path}")

# eval_config = load_config(eval_file_path).format(method_name=config['method_name'])
with open(eval_file_path, 'r') as file:
    model_eval_info = file.read()
    model_eval_formatted_content = model_eval_info.format(method_name=config['PoLL_Composition_and_Voting'].get('method', None),
                                                         ground_truth=config['PoLL_Composition_and_Voting'].get('ground_truth_col', None), 
                                                         criteria=config['PoLL_Composition_and_Voting'].get('subjective_eval_criteria', None))
    eval_config = yaml.safe_load(model_eval_formatted_content)
logger.info(f"eval config file recorded: {json.dumps(eval_config, indent=2)}")

In [None]:
debug = False
if debug is True:
    metrics_path_file: str = os.path.join("..", "..", METADATA_DIR, METRICS_PATH_FNAME)
else:
    metrics_path_file: str = os.path.join(METADATA_DIR, METRICS_PATH_FNAME)
logger.info(f"cwd={os.getcwd()}, METADATA_DIR={METADATA_DIR}, METRICS_PATH_FNAME={METRICS_PATH_FNAME}, metrics_path_file={metrics_path_file}")
METRICS_DIR: str = Path(metrics_path_file).read_text().strip()
logger.info(f"metrics_path_file={metrics_path_file}, METRICS_DIR={METRICS_DIR}")

In [None]:
# file_path: str = os.path.join(METRICS_DIR, config["report"]["per_inference_request_file"])
file_path='fmbench-bedrock-anthropic-models-fmbench-1-us-east-1-role/data/metrics/yyyy=2024/mm=07/dd=23/hh=22/mm=19/per_inference_request_results.csv'
logger.info(f"File path containing the metrics per inference folder --> {file_path}")

# Read the file from S3
try:
    file_content = get_s3_object(config['aws']['bucket'], file_path)
    # Use pandas to read the CSV content
    df_per_inference = pd.read_csv(io.StringIO(file_content))
    logger.info(f"{file_path} read into dataframe of shape {df_per_inference.shape}, "
                f"cols={df_per_inference.columns}")
    logger.info(f"{file_path} contains results for the following endpoints={df_per_inference.endpoint_name.unique()}")
    logger.info(df_per_inference.head())
except Exception as e:
    logger.error(f"Error reading from S3: {e}")


In [None]:
logger.info(f"Going to be using this inference file to generate evaluations on -> {df_per_inference.head()}")

### Relationship between prompt token length and inference latency for different instances and concurrency levels

In [None]:
logger.info(f"Information on the inference file being used for evaluations: {df_per_inference.latency.describe()}")

### Use the `sentence-transformers/all-mpnet-base-v2` embeddings model to calculate the _Cosine Similarity_ scores 
---

This portion of the evaluation step does as follows:

1. Uses the `sentence-transformers/all-mpnet-base-v2` model from Hugging Face. This is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

1. Use the embeddings model to get quantitative metrics from the inferences. This helps to get a similarity score between the ground truth answers from a dataset if any are given and the actual responses from the model received during inference.

1. If no ground truth is provided, cosine similarity is calculated between the response and the content provided to answer the question

In [None]:
# get the quantitiative evaluation information from the config file, such as the embeddings model
# to be used
embeddings_model_quantitative_info: Dict = eval_config['model_evaluations']['quantitative_eval_info']


def load_model():
    """
    This function loads the sentence-transformers model based on the provided model ID.
    """
    try:
        model=None
        model_id = embeddings_model_quantitative_info['embeddings_model_id'].get('model_id', None)
        if model_id:
            model = SentenceTransformer(model_id)
        else:
            raise ValueError("Model ID is not provided or invalid in the configuration.")
    except Exception as e:
        logger.error(f"The SentenceTransformer embeddings model could not be loaded: {e}")
        model=None
    return model

In [None]:
# load the embeddings model to calculate the cosine similarity scores
model = load_model()


def calculate_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between two texts. In this case, 
    the cosine similarity is the comparison between the ground truth in the given dataset
    and the candidate model's response
    """
    try:
        cosine: float = None
        # returns the embedding for a given text using the sentence-transformers model.
        A = model.encode([text1])[0]
        B = model.encode([text2])[0]
        cosine = dot(A, B) / (norm(A) * norm(B))
        logger.info(f"Calculating the cosine similarity score, current score: {cosine}")
    except Exception as e:
        logger.error(f"Cosine similarity was not calculated at this iteration: {e}")
        cosine=None
    return cosine

In [None]:
# get the method that is being used to evaluate the content (which is either 
# max voting or average pooling)
model_eval_subjective_info: List[Dict] = eval_config['model_evaluations']['subjective_eval_info']
method_name: str = eval_config['model_evaluations']['PoLL_Composition_and_Voting'].get('method', None)
logger.info(f"The evaluation method FMBench is going to use to evaluate different model responses: {method_name}")
logger.info(f"judge panel being used to evaluate model responses: {model_eval_subjective_info.get('judge_panel_list', None)}")

In [None]:
logger.info(f"~Creating embeddings of all candidate model responses now. This might take a 1-2 minutes~")

# calculate the quantitative metrics if evaluation is set to max voting
if method_name == "max_voting":
    logger.info(f"ground truth column found: {eval_config['model_evaluations'].get('ground_truth_col')}, calculating cosine similarity scores")
    # Assuming df_per_inference is your DataFrame
    df_per_inference['cosine_similarity_score'] = df_per_inference.apply(
        lambda row: calculate_cosine_similarity(row['completion'], row['ground_truth']), axis=1
    )
df_per_inference.head()

## Model Evaluations: Hierarchical Flow
--- 

For this portion of the step, we start with the model evaluation process. Here we perform the following steps:

1. Check for the lexical match/similarity between the ground truth (if any) and the answer.

1. Compute the similarity score using three main quantitative metrics: Cosine similarity score, Levenshtein similarity, and Token set ratio. If the thresholds of any of these are passed, the model evaluation is complete and answer is correct. 

1. If the answer is not obvious, i.e., none of the three thresholds of quantitative evaluations are met, then the data moves to the Panel of LLM Evaluators for a further deep dive into the evaluation process.

### Model Evaluation Part 1: Lexical Match & Cosine Similarity Score Accuracy Evaluation Filter
---

Before having the Panel of LLM Evaluators evaluate each candidate model response, we pass those responses through a filtering step. In this step we use a threshold for a `Lexical match`, `Cosine Similarity`, and `Levenshtein Similarity` scores to define whether that answer is correct without having an LLM evaluate it. The thresholds for correctness is defined in the configuration files. 

The reason to do this is to make the evaluation process more like a hierarchy of checks, to make sure each and every candidate model response is evaluated appropriately. Additionally, filter steps to check for these scores to determine whether a candidate model reponse is correct, will narrow down the evaluation checks for the PoLL reducing the time and cost to complete all evaluations. This is specific to the `Ground Truth based approach`. 

For the lexical match, we use the `fuzzy` match approach `token_set_ratio` library to determine what percent of the two texts are similar.

**Note**: `Token_set_ratio` algorithm tokenizes both input strings, removes duplicate tokens, and calculates the similarity score based on the intersection and union of the token sets. It captures the essence of the strings’ content rather than their specific order.

In [None]:
def calculate_token_set_ratio(text1: str, text2: str) -> float:
    """
    This function calculates the partial token match or fuzz ratio between two strings.
    If the fuzz ratio exceeds the threshold and the cosine similarity matches or exceeds the threshold, 
    then the answer is correct and it is not evaluated using a judge. If it is not, then it
    is parsed through the PoLL process
    """
    try:
        token_set_ratio: float = None
        if text1 and text2:
            token_set_ratio = fuzz.token_set_ratio(text1, text2) / 100.0
        else:
            token_set_ratio=None
    except Exception as e:
        logger.error(f"Error in calculating token set ratio: {e}")
        token_set_ratio=None
    return token_set_ratio

### Levenshtein distance algorithm
---
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

In [None]:
def levenshtein_distance(s: str, t: str):
    """
    Here, we use Dynamic Programming (DP) to compute the levenshtein distance
    between two strings
    """
    # Initialize lengths of both strings
    m, n = len(s), len(t)

    # Ensure s is the longer string
    if m < n:
        s, t = t, s
        m, n = n, m

    # Initialize the distance matrix with dimensions (m+1) x (n+1)
    d = [list(range(n + 1))] + [[i] + [0] * n for i in range(1, m + 1)]

    # Populate the matrix
    for j in range(1, n + 1):
        for i in range(1, m + 1):
            # If characters match, no cost is added
            if s[i - 1] == t[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                # Otherwise, take the minimum cost from insert, delete, or replace operations
                d[i][j] = min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1
    # Return the computed Levenshtein distance (bottom-right cell of the matrix)
    return d[m][n]


def calculate_levenshtein_distance(input_string: str, reference_string: str) -> float:
    """
    In this function, we calculate the levenshtein distance between the input string (candidate model response) and 
    the reference string (which can be the ground truth or the context provided to answer the question).
    """
    try:
        similarity: Optional[float]=None
        distance = levenshtein_distance(input_string, reference_string)
        max_length = max(len(input_string), len(reference_string))
        similarity = 1 - (distance / max_length)
    except Exception as e:
        logger.error(f"Could not compute the levenshtein similarity score: {e}")
        similarity=None
    return similarity

In [None]:
# # These are examples from the LongBench dataset for testing purposes
# candidate_model_response: str = "Both Sinofranchetia and Stauntonia are from the Lardizabalaceae family. This information is mentioned in the passages for both genera."
# ground_truth: str = "a genus of flowering plant in the Lardizabalaceae family"
# ratio = calculate_levenshtein_distance(candidate_model_response, ground_truth)
# print(f"ratio calculated: {ratio}")

In [None]:
# Compute the token set ratio for each row and add it as a new column
# In this case, the ground truth is used as context to calculate the levenshtein distance
# and the token set ratio if the ground truth is not provided

# calculate the quantitative metrics if evaluation is set to max voting
if method_name == "max_voting":
    logger.info(f"ground truth column is found: {eval_config['model_evaluations'].get('ground_truth_col')}, calculating token set ratio and levenshtein distance")
    df_per_inference = df_per_inference.assign(
        token_set_ratio_value=lambda df: df.apply(lambda row: calculate_token_set_ratio(row['completion'], row['ground_truth']), axis=1),
        levenshtein_distance=lambda df: df.apply(lambda row: calculate_levenshtein_distance(row['completion'], row['ground_truth']), axis=1)
    )
df_per_inference.head()

In [None]:
# define the all_metrics path to send the evaluation metrics to
all_metrics_fpath: str = os.path.join(METRICS_DIR, config["report"]["all_metrics_file"])
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_cosine_similarity_scores_csv = csv_buffer.getvalue()
inference_cosine_similarity_scores_s3_path = os.path.join(METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_cosine_similarity_scores_csv, BUCKET_NAME, "", 
            METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{inference_cosine_similarity_scores_s3_path}")
df_per_inference.head()

### Model Evaluation Part 2: Use _Panel of LLM Evaluators_ to get Subjective Evaluations on various evaluation criteria
---

In this portion of the notebook, we run evaluations on the content generated by different candidate models. We use two main evaluation methods: `Max Voting` and `Average Pooling`. To eliminate intra-model bias, we address this by scoring answer correctness based not on a single judge, but instead on a panel composed of multiple evaluator models. Similar pooling techniques are used to reduce variance in human annotations by normalizing out both natural variation in human judgements caused by their own subjective biases as well as human error. We use the following two techniques:

1. **Max Voting**: We use the PoLL to evaluate candidate model responses by checking its correctness compared to a provided ground truth answer in the dataset. We prompt each PoLL to evaluate and give the response in a JSON structure, giving a verdict on whether the response is correct or incorrect, and an explanation as to why that is. Using this, we can perform downstream analytics such as: 

    1. Calculate the overall accuracy of each model using the correct versus the (correct + incorrect) responses
    
    1. Calculate the `error rate` or frequency or incorrect responses
    
    1. Categorize the errors based on the explanations provided by the evaluators. Common categories might include misunderstanding the question, incomplete answers, factual inaccuracies
    
    1. Summary of overall correct/incorrect, and the best model based on the PoLL. Rank the models on Correctness versus Incorrectness.

1. **Average Pooling**: We use the PoLL to rate the response of each candidate model on a more subjective criteria. Here, we have the candidate model responses rated on a scale of 1-5 based on the subjective criteria and then get an explanation to that. Using this we can do as follows:

    1. Calculate the average score for each model across all questions to get an overall performance measure.
    
    1. Compute the standard deviation of the scores to understand the consistency of the model's performance.

In [None]:
# get the qualitative/subjective evaluation information from the config file to evaluate answers from different
# endpoints on various criteria
model_eval_subjective_info: Dict = eval_config['model_evaluations']['subjective_eval_info']
eval_criteria_list = model_eval_subjective_info.get('eval_criteria', None)
logger.info(f"available llm as a judge evaluation information to use: {json.dumps(model_eval_subjective_info, indent=2)}")

In [None]:
# get the inference parameters that the LLM judge panel will use while evaluating model candidate responses
INFERENCE_PARAMETERS_LLM_PANEL: Dict = eval_config['model_evaluations']['subjective_eval_info'].get('inference_parameters', None)
logger.info(f"Inference parameters that LLM evaluators will use: {INFERENCE_PARAMETERS_LLM_PANEL}")

In [None]:
def get_panel_of_llm_evaluation(model_id: str,
                                  prompt: str):
    """
    Get inference using LiteLLM. This function is called by each evaluator on the panel of 
    llm evaluators to get a response on a given prompt. This is in the case of where there is 
    max voting or average pooling enabled
    """
    # represents the service name
    logger.info(f"get_inference, model_id={model_id}")
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception=None,
               prompt=prompt,
               completion=None,
               completion_token_count=None,
               prompt_token_count=None,
               model_id=model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        print(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{"content": body,"role": "user"}],
                              temperature=INFERENCE_PARAMETERS_LLM_PANEL.get('temperature', 0.1),
                              max_tokens=INFERENCE_PARAMETERS_LLM_PANEL.get('max_tokens', 100),
                              caching=INFERENCE_PARAMETERS_LLM_PANEL.get('caching', False))
        print(f"response: {response}")
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens        
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [None]:
def safe_filename(s):
    """
    convert a string to another string that can be used as a filename
    i.e. remove white space and non-word chars
    """
    if s is None:
        return "None"
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)
    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)
    return s

In [None]:
def parse_as_json(x: str) -> Optional[Dict]:
    """
    Convert a string into a dictionary. Remove any
    stray whitespaces which could break the json parsing
    """
    d: Optional[Dict] = None
    try:
        x = x.replace("\n", "").replace("\t", "")
        d = json.loads(x)
    except Exception as e:
        print(f"parse_as_json, error parsing string as json, string={x}")
    return d

In [None]:
df_per_inference.rename(columns={'completion': 'candidate_model_response'}, inplace=True)
df_per_inference.head()

#### Prepare the evaluation prompt payloads
---

Here, the evaluation prompt template is used by the LLM judge to evaluate the answers on different criteria.
This prompt template function uses a set of rules, prompt template, the answer, and ground truth (if any) in the
evaluation solution

In [None]:
def prepare_eval_prompts(eval_template: str,
                         answer: str, 
                         rules: str, 
                         context: str, 
                         ground_truth: Optional[str], 
                         subjective_criteria: Optional[str]):
    """
    This function prepares the evaluation prompts by preparing the standard eval prompt template
    with the rules of a given subjective criteria, context, answer and ground truth (if any ground truth is provided)
    """
    try:
        processed_eval_template: Optional[str] = None
        processed_eval_template = eval_template.format(
            rules=rules,
            answer=answer,
            context=context,
            ground_truth=ground_truth, 
            subjective_criteria=subjective_criteria)
    except Exception as e:
        logger.error(f"Error encountered while generating the evaluation prompt template: {e}")
        processed_eval_template=None
    return processed_eval_template

In [None]:
def clear_dir(dir_path: str):
    files = glob.glob(os.path.join(dir_path, "*"))
    for f in files:
        os.remove(f)

# create the metrics directory that stores all of the json files containing evaluations from all Panel of LLM evaluators
METRICS_PER_POLL_EVAL_DIR: str = os.path.join(METRICS_DIR, METRICS_PER_POLL_EVAL_DIR_NAME)
_ = list(map(clear_dir, [METRICS_PER_POLL_EVAL_DIR]))

In [None]:
def run_llm_evals(i: int, total: int, row: Dict,  model_id: str, eval_method_name: str, uuid: str) -> Dict:
    """
    Runs the evaluation for one row 
    The eval prompt is already available in the row dictionary
    and we simply want to run the inference against the judge model.
    The results are returned in a new dictionary that contains the model 
    response and some fields from the original dictionary
    """
    try: 
        # save all the responses from the model in a dictionary
        resp: Dict = {}
        print(f"run_eval, row {i}/{total}, judge_model_id={model_id}, candidate model={row['endpoint_name']}")
        # create the payload for model inference
        prompt = row[f'{model_id}_{method_name}_eval_prompt']
        # generate the evaluation on the data using the model judge
        resp = get_panel_of_llm_evaluation(model_id, prompt)
        # assign the completion from the candidate model to the `candidate_model_response`, 
        # and the actual evaluation will be contained in a field called `completion`
        resp['candidate_model_response'] = row['candidate_model_response']
        logger.info(f"Panel of LLM evaluator {model_id} completion: {resp['completion']}")
        resp['candidate_model'] = row['endpoint_name']
        if eval_method_name == "max_voting":
            resp['cosine_similarity_score'] = row['cosine_similarity_score']
            resp['levenshtein_distance'] = row['levenshtein_distance']
            resp['token_set_ratio_value'] = row['token_set_ratio_value']
            resp['token_set_ratio_value'] = row['token_set_ratio_value']
        resp['payload_file'] = row['payload_file']
        # if there is a ground truth (in case of max voting) or 
        # criteria name (in case of average pooline), include those in the json response
        if 'ground_truth' in row:
            resp['ground_truth'] = row['ground_truth']
        if 'criteria_name' in row:
            resp['criteria_name'] = row['criteria_name']
    except Exception as e:
        logger.error(f"Error encountered while running evaluation: {e}")
        resp=None
    return resp

# we use Ray to parallize
@ray.remote
def async_run_eval(i: int, total: int, row: Dict, model_id: str, eval_method_name: str, uuid: str) -> Dict:
    print(f"async_run_eval, i={i}, total={total}, judge_model_info={model_id}, eval_method: {eval_method_name}, uuid: {uuid}")
    return run_llm_evals(i, total, row, model_id, eval_method_name, uuid)

In [None]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
df_per_inference_list = json.loads(df_per_inference.to_json(orient='records'))
logger.info(f"Total number of candidate models going to be evaluated: {len(df_per_inference_list)}")

#### Prepare evaluation prompt templates
---

This portion of the step prepares the evaluation prompt templates that are used in the evaluation process of using `Max Voting` or `Average Pooling` using the PoLL.

In [None]:
model_eval_subjective_info

In [None]:
model_eval_subjective_info.get('subjective_eval_criteria', None)

In [None]:
# Assuming fmbench is a valid Python package and scripts is a subdirectory within it
eval_prompts_dir: str = Path(pkg_resources.files('fmbench'), f"{config['s3_read_data']['prompt_template_dir']}/{config['s3_read_data']['eval_prompts_dir']}")
# Iterate through each LLM as a judge and each evaluation criterion
for llm_info in model_eval_subjective_info.get('judge_panel_list', []):
    model_id = llm_info['model_id']
    method_name = eval_config['model_evaluations']['PoLL_Composition_and_Voting'].get("method", None)
    eval_prompt_template_fname = f"{llm_info.get('eval_prompt_template_name', None)}.txt"

    eval_prompt_template_dir = llm_info.get('eval_prompt_template_dir', None)
    eval_prompt_template_path = os.path.join(eval_prompts_dir, eval_prompt_template_dir, eval_prompt_template_fname)
    logger.info(f"evaluation prompt template file path being used for {model_id}: {eval_prompt_template_path}")
    logger.info(f"evaluation prompt template file name: {eval_prompt_template_fname}")

    try:
        eval_prompt_template = Path(eval_prompt_template_path).read_text()
    except FileNotFoundError:
        logger.error(f"File not found: {eval_prompt_template_path}")
        continue

    logger.info(f"Evaluation prompt template being used: {eval_prompt_template}")

    eval_instructions_fname = next((rule for rule in config['s3_read_data']['eval_instructions_files'] if method_name in rule), None)
    rules = Path(os.path.join(eval_prompts_dir, eval_instructions_fname)).read_text()
    logger.info(f"rules: {rules}")

    if method_name == "max_voting":
        column_name = f"{model_id}_{method_name}_eval_prompt"
        df_per_inference[column_name] = df_per_inference.apply(
            lambda r: prepare_eval_prompts(
                eval_prompt_template,
                r['candidate_model_response'],
                rules,
                r['prompt'],
                r['ground_truth'],
                ""
            ),
            axis=1
        )

    elif method_name == "avg_pooling":
        criteria_info = model_eval_subjective_info.get('subjective_eval_criteria', None)
        criteria_dir = criteria_info.get('criteria_dir', None)
        criteria_files = criteria_info.get('criteria', None)
        logger.info(f"Iterating through criteria in directory: {criteria_dir}")

        # List to store DataFrames for each criteria
        all_dataframes = []

        # loop through each criteria to form a prompt template for each
        for criteria in criteria_files:
            criteria_file = f"{criteria}.txt"
            criteria_path = os.path.join(eval_prompts_dir, criteria_dir, criteria_file)
            logger.info(f"path to the evaluation criteria: {criteria_path}")

            try:
                subjective_criteria_content = Path(criteria_path).read_text()
                logger.info(f"subjective_criteria_content: {subjective_criteria_content}")
            except FileNotFoundError:
                logger.error(f"Subjective criteria file not found: {criteria_path}")
                continue

            # Create a copy of the original DataFrame for this criteria
            df_criteria = df_per_inference.copy()
            df_criteria['criteria_name'] = criteria

            column_name = f"{model_id}_{method_name}_eval_prompt"
            df_criteria[column_name] = df_criteria.apply(
                lambda r: prepare_eval_prompts(
                    eval_prompt_template,
                    r['candidate_model_response'],
                    rules,
                    r['prompt'],
                    "",
                    subjective_criteria_content
                ),
                axis=1
            )

            all_dataframes.append(df_criteria)

        # Concatenate all the DataFrames
        df_per_inference = pd.concat(all_dataframes, ignore_index=True)


In [None]:
df_per_inference.columns

In [None]:
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_eval_prompt_payloads = csv_buffer.getvalue()
eval_prompt_payloads_for_inference = os.path.join(METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_eval_prompt_payloads, BUCKET_NAME, "", 
            METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{eval_prompt_payloads_for_inference}")
df_per_inference.head()

In [None]:
df_per_inference.shape

In [None]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
eval_records_list = json.loads(df_per_inference.to_json(orient='records'))
logger.info(f"Total number evaluations to be done: {len(eval_records_list)}")

### Run the hierarchy of Model Evaluations
---

In this portion of the step, FMBench performs the following actions:

1. If the method of evaluation is `Max Voting`, then in that case we suppose that a ground truth to the question from the context or task is pre existing in the dataset. We first calculate quantitative metrics. If the desired correctness threshold for either of the cosine similarity, levenshtein distance or token set ratio is exceeded, we end the evaluation of that data and move to the next one. If none of the quantitative metrics satisfy the threshold, the data moves to the panel of LLM evaluators for next steps.

1. We use the LLM panel of judges (in this case 3 judges), to give a verdict on whether the `answer` from the candidate models during inference is `correct` or `incorrect`. If the response is correct, then it gives it a `correct` and if not, then `incorrect`.

1. If the method of evaluation is `Average Pooling`, then in that case we suppose that the completion from the candidate models are supposed to be evlauated on a more subjective criteria rather than just deciding whether it is correct or incorrect compared to the ground truth. In this case, the average pooling prompt templates are used by the Judge Panel to give a rating out of 1-5 to each model completion on different criteria, such as relevancy, helpfulness, correctness, and so on.

1. Each model response is given in a JSON structure which is further used for downstream analytics, to decide the comparision of evaluation results between different model candidates and more.

***This step takes about ~6 minutes to complete. Model completion time depends on the PoLL models being used. `Llama3-70b`, `Cohere command-r-v1` and `claude 3 haiku` were used for this example***

In [None]:
# get the llm as a judge panel list
judge_panel_list: List[Dict] = model_eval_subjective_info.get('judge_panel_list', None)
logger.info(f"The judge panel list contains {len(judge_panel_list)} judges. Their information: {judge_panel_list}")

In [None]:
logger.info(f"~Panel of LLM evaluators are going to start evaluating responses. This might take a couple of minutes depending on the size of the dataset and candidate model responses~")

In [None]:
def process_record_for_lexical_similarity(method_name: str, record):
    """
    Given a record, this function calculates the average of token set ratio and Levenshtein distance,
    and checks the cosine similarity. If the cosine similarity meets or exceeds the specified threshold,
    or if the average of token set ratio and Levenshtein distance meets or exceeds the specified threshold,
    the completion is correct and an explanation is given without going through an LLM evaluator.
    """
    try:
        if method_name == "max_voting":
            # Get the quantitative metrics
            token_set_ratio = record['token_set_ratio_value']
            levenshtein_ratio = record['levenshtein_distance']
            cosine_similarity_score = record['cosine_similarity_score']

            # Calculate the average of the metrics
            average_score = (token_set_ratio + levenshtein_ratio + cosine_similarity_score) / 3

            # Check if any of the quantitative metrics thresholds are met
            is_quantitative_threshold_met = (
                cosine_similarity_score >= eval_config['model_evaluations']['quantitative_eval_info'].get('cosine_similarity_threshold', None) or 
                token_set_ratio >= eval_config['model_evaluations']['quantitative_eval_info'].get('token_set_ratio_threshold', None) or
                levenshtein_ratio >= eval_config['model_evaluations']['quantitative_eval_info'].get('levenshtein_distance_threshold', None)
            )

            # If no individual threshold is met, check if the average meets the overall threshold
            if not is_quantitative_threshold_met:
                is_quantitative_threshold_met = (
                    average_score >= eval_config['model_evaluations']['quantitative_eval_info'].get('overall_eval_threshold', None)
                )

            if is_quantitative_threshold_met:
                verdict = "correct"
                explanation = (
                    f"Lexical match check passed with: "
                    f"Token set ratio = {token_set_ratio * 100}%, "
                    f"Levenshtein similarity match = {levenshtein_ratio * 100}%, "
                    f"Cosine similarity = {cosine_similarity_score:.3f}, "
                    f"Average score = {average_score * 100:.2f}%, not going through a panel of LLM evaluator."
                )
                record.update({
                    'candidate_model': record['endpoint_name'],
                    'completion': f'{{\n  "verdict": "{verdict}",\n  "explanation": "{explanation}"\n}}',
                })
        else:
            is_quantitative_threshold_met=False
            record=record
        logger.debug(f"Processed record: {record}")
    except Exception as e:
        logger.error(f"Error occurred while checking for text similarity: {e}")
        record = None
        is_quantitative_threshold_met = False
    return record, is_quantitative_threshold_met

#### Start the evaluation process
---

In [None]:
n: int = model_eval_subjective_info.get('run_parallel_inference_count', 5)
list_of_lists = [eval_records_list[i * n:(i + 1) * n] for i in range((len(eval_records_list) + n - 1) // n)]
resp_list = []
erroneous_count: int = 0
st: float = time.perf_counter()

# Iterate over the judge panel and sublists
for judge_panelist_info in judge_panel_list:
    logger.info(f"============Running inference for judge panelist {judge_panelist_info['model_id']} for {method_name} ============")
    for idx, sublist in enumerate(list_of_lists):
        model_id: str = judge_panelist_info['model_id']
        logger.info(f"Getting inference for list {idx + 1}/{len(list_of_lists)}, size of list={len(sublist)}")
        # this list will hold all of the records that do not pass the metrics
        # threshold test. The records that will be populated in this list will be used
        # by the LLM evaluators to evaluate
        records_not_meeting_quantitative_metric_threshold = []
        for record in sublist:
            # First, check if the current content of the record matches the threshold for 
            # token set ratio/cosine similarity/levenshtein similarity
            processed_record, is_quantitative_threshold_met = process_record_for_lexical_similarity(method_name, record)
            if is_quantitative_threshold_met:
                # if the quantitative threshold is met, append the updated record with the 
                # already decided verdict in the response list
                resp_list.append(processed_record)
            else:
                records_not_meeting_quantitative_metric_threshold.append(record)
        try:
            # If the quantitative metric thresholds are not meet, we parse them through all LLM
            # evaluators to dive deep and correctly evaluate whether the model candidate response is 
            # correct or incorrect
            if records_not_meeting_quantitative_metric_threshold:
                # Run inference in parallel for non-matching records
                resp_list.extend(ray.get([async_run_eval.remote(i + 1, len(records_not_meeting_quantitative_metric_threshold), record, model_id, method_name, record['uuid'])
                                   for i, record in enumerate(records_not_meeting_quantitative_metric_threshold)]))
        except Exception as e:
            logger.error(f"Error processing list {idx + 1}/{len(list_of_lists)}: {e}")
            erroneous_count += 1
    # Sleep for two seconds before moving on to the next model
    logger.info(f"~Sleeping for one second before the next Panel of LLM evaluates the responses~")
    time.sleep(1)

elapsed_time = time.perf_counter() - st
logger.info(f"Total elapsed time for inference: {elapsed_time:.2f} seconds")
logger.info(f"Total erroneous lists: {erroneous_count}")

#### Send all Panel of LLM evaluator responses to S3 as `JSON` files
---

In [None]:
# Collect all of the panel of LLM evals and send them all as JSON files
# to s3
if resp_list:
    save_s3_list = []
    try:
        for resp in resp_list:
            llm_eval_response = json.dumps(resp, indent=2)
            candidate_model_id = resp.get('candidate_model', None)
            # Extract a few words from the poll eval response to append to the file name
            response_excerpt = " ".join(resp.get('candidate_model_response', "").split()[:5])
            sanitized_response_excerpt = "".join([c if c.isalnum() else "_" for c in response_excerpt])
            llm_eval_json_fname = f"{candidate_model_id}_{time.time()}_{sanitized_response_excerpt}.json"
            response_s3_path = os.path.join(METRICS_PER_POLL_EVAL_DIR, llm_eval_json_fname)
            logger.info(f"Sending model eval result files to s3 path prefix: {response_s3_path}")
            save_s3_list.append((llm_eval_response,
                                config['aws']['bucket'],
                                "",
                                METRICS_PER_POLL_EVAL_DIR,
                                llm_eval_json_fname))
        write_multiple_to_s3(save_s3_list)
    except Exception as e:
        logger.error(f"Error processing or writing to S3: {e}")
else:
    logger.info("No responses to write to S3")

### Perform downstream analytical tasks on each PoLL evaluation result
---

In [None]:
# convert the results list into a dataframe for easy analytics
df_eval_results = pd.DataFrame(resp_list)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.dropna(axis=1, how='all')
# the exception, judge model id, prompt token count, will be NaN for the verdicts decided
# using the lexical match and not moved forward to the panel of LLM evaluators
df_eval_results.head()

In [None]:
# parse out the completion from LLM as a judge and column bind
# the fields of the dictionary to the original results dataframe
df_eval_results_only = df_eval_results['completion'].apply(parse_as_json).apply(pd.Series)
df_eval_results_only.dropna(axis=1, how='all')
df_eval_results = pd.concat([df_eval_results, df_eval_results_only], axis=1)
df_eval_results.rename(columns={'model_id': 'judge_model_id'}, inplace=True)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.dropna(axis=1, how='all')
df_eval_results.head()

In [None]:
# send the raw results as a csv file to the S3 bucket
csv_buffer = io.StringIO()
df_eval_results.to_csv(csv_buffer, index=False)
eval_llm_as_a_judge_results = csv_buffer.getvalue()
eval_results_csv_fpath = os.path.join(METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(eval_llm_as_a_judge_results, BUCKET_NAME, "", 
            METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV)
logger.info(f"Per PoLL model responses saved as a csv to s3://{BUCKET_NAME}/{eval_results_csv_fpath}")
df_eval_results.head()

In [None]:
logger.info(f"Shape of the dataframe containing all evaluations: {df_eval_results.shape}")

### Send the incorrect and correct responses to S3 separately in `CSV` files for downstream analytics for each model judge
---

In [None]:
def save_evaluation_verdicts_and_ratings_to_s3(df_eval_results: pd.DataFrame, method_name: str) -> Optional[pd.DataFrame]:
    """
    This function sends verdict responses separately to s3. This function is for max voting. 
    It uses a method name (which is either max_voting or avg_pooling). For max voting evaluation types, 
    all of the correct and incorrect verdicts are saved in separate CSV files using this function
    """
    try:
        result_df: Optional[pd.DataFrame] = None
        if method_name == 'max_voting':
            logger.info(f"Method name is {method_name}, sending the correct and incorrect verdicts to s3")
            verdict_types: List[str] = ['incorrect', 'correct']
            all_verdicts_df = pd.DataFrame()
            for verdict in verdict_types:
                df_verdicts = df_eval_results[df_eval_results['verdict'] == verdict]
                all_verdicts_df = pd.concat([all_verdicts_df, df_verdicts])

                if not df_verdicts.empty:
                    csv_buffer = io.StringIO()
                    df_verdicts.to_csv(csv_buffer, index=False)
                    verdict_responses = csv_buffer.getvalue()
                    verdict_file = INCORRECT_VERDICT_RESPONSES_FILE if verdict == 'incorrect' else CORRECT_VERDICT_RESPONSES_FILE
                    verdict_responses_fpath = os.path.join(METRICS_DIR, verdict_file)
                    write_to_s3(verdict_responses, BUCKET_NAME, "", METRICS_DIR, verdict_file)
                    logger.info(f"{verdict.capitalize()} verdict responses sent to s3://{BUCKET_NAME}/{verdict_responses_fpath}")
                    logger.info(f"Number of {verdict} responses in total: {df_verdicts.shape[0]}")

            result_df = all_verdicts_df

            judge_model_ids = df_eval_results['judge_model_id'].unique()
            for judge_model_id in judge_model_ids:
                for verdict in verdict_types:
                    df_judge_verdict = df_eval_results[(df_eval_results['verdict'] == verdict) & (df_eval_results['judge_model_id'] == judge_model_id)]

                    if not df_judge_verdict.empty:
                        txt_buffer = io.StringIO()
                        for index, row in df_judge_verdict.iterrows():
                            txt_buffer.write(
                                f"candidate model: {row['candidate_model']}\n"
                                f"candidate model response: {row['candidate_model_response']}\n"
                                f"ground truth: {row['ground_truth']}\n"
                                f"verdict and explanation: {row['completion']}\n\n"
                            )
                        judge_verdict_responses = txt_buffer.getvalue()
                        verdict_file = f"{judge_model_id}_{verdict}_verdicts_evaluation.txt"
                        judge_verdict_responses_fpath = os.path.join(METRICS_DIR, verdict_file)
                        write_to_s3(judge_verdict_responses, BUCKET_NAME, "", METRICS_DIR, verdict_file)
                        logger.info(f"{verdict.capitalize()} verdict responses for judge {judge_model_id} saved to s3://{BUCKET_NAME}/{judge_verdict_responses_fpath}")

        # if the eval method is average pooling, get the pivoted table containing each eval criteria, and
        # the overall rating for that candidate model response
        elif method_name == 'avg_pooling':
            logger.info(f"Method name is {method_name}, sending the different criteria evals to s3")
            df_avg_pooling = df_eval_results.pivot_table(
                index=['candidate_model', 'judge_model_id', 'candidate_model_response', 'payload_file'],
                columns='criteria_name',
                values='eval_rating',
                aggfunc='mean'
            ).reset_index()

            # Ensure all columns are numeric for mean calculation
            numeric_columns = df_avg_pooling.select_dtypes(include='number').columns
            df_avg_pooling['overall_eval_rating'] = df_avg_pooling[numeric_columns].mean(axis=1)
            csv_buffer = io.StringIO()
            df_avg_pooling.to_csv(csv_buffer, index=False)
            avg_pooling_eval_responses = csv_buffer.getvalue()
            avg_pooling_responses_fpath = os.path.join(METRICS_DIR, AVERAGE_POOLING_ALL_EVALS)
            write_to_s3(avg_pooling_eval_responses, BUCKET_NAME, "", METRICS_DIR, AVERAGE_POOLING_ALL_EVALS)
            logger.info(f"Average pooling evaluation responses sent to s3://{BUCKET_NAME}/{avg_pooling_responses_fpath}")
            result_df = df_avg_pooling
    except Exception as e:
        logger.error(f"Error encountered while writing the evaluation responses to s3: {e}")
        result_df = None
    return result_df

In [None]:
# save correct and incorrect verdict files to S3 if the eval method being used is max_voting, else move on
avg_pooling_eval_df = save_evaluation_verdicts_and_ratings_to_s3(df_eval_results, method_name)
avg_pooling_eval_df

In [None]:
def non_perfect_overall_rating_counts(df_eval_results: pd.DataFrame, method_name: str) -> Optional[pd.DataFrame]:
    """
    This function sends counts and returns a dataframe that does not contain an overall evaluation rating of 5 for
    when the method name is "avg_pooling"
    """
    try:
        result_df: Optional[pd.DataFrame] = None
        if method_name == 'max_voting':
            logger.info(f"Method name is {method_name}. Cannot get evaluation ratings since that is done for average pooling")
            return
        # if the eval method is average pooling, get the pivoted table containing each eval criteria, and
        # the overall rating for that candidate model response
        elif method_name == 'avg_pooling':
            logger.info(f"Method name is {method_name}, extracting entries with non perfect ratings (< 5) and sending them to s3")
            result_df = df_eval_results[df_eval_results['overall_eval_rating'] < 5]
            csv_buffer = io.StringIO()
            result_df.to_csv(csv_buffer, index=False)
            non_perfect_ratings_avg_pooling_eval_responses = csv_buffer.getvalue()
            non_perfect_ratings_avg_pooling_responses_fpath = os.path.join(METRICS_DIR, NON_PERFECT_RATING_RESPONSES)
            write_to_s3(non_perfect_ratings_avg_pooling_eval_responses, BUCKET_NAME, "", METRICS_DIR, NON_PERFECT_RATING_RESPONSES)
            logger.info(f"All evaluation ratings below 5 are sent to s3://{BUCKET_NAME}/{non_perfect_ratings_avg_pooling_responses_fpath}")
    except Exception as e:
        logger.error(f"Error encountered while writing the non perfect evaluation ratings to s3: {e}")
        result_df = None
    return result_df

In [None]:
non_perfect_overall_rating_counts(avg_pooling_eval_df, method_name)

#### Check for each panel of LLM evaluator's verdict count on the dataset
---

In [None]:
def generate_panel_summary_responses(df_eval_results: pd.DataFrame, method_name: str) -> pd.DataFrame:
    """
    This function is used for when the evaluation type is max voting. Here, it takes in a method name, 
    and then based on the verdicts for each candidate model, gives the mean cosine similarity score, levenshtein
    distance and toekn set ratio
    """
    if method_name != 'max_voting':
        logger.info(f"Evaluation method is set to {method_name}, exiting out of this function")
        return pd.DataFrame()

    panel_summary_responses_df = df_eval_results.groupby(['judge_model_id', 'candidate_model', 'verdict']).agg(
        count=('verdict', 'size'),
        mean_cosine_similarity=('cosine_similarity_score', 'mean'),
        mean_levenshtein_distance=('levenshtein_distance', 'mean'),
        mean_token_set_ratio=('token_set_ratio_value', 'mean')
    ).unstack(fill_value=0).stack().reset_index()
    csv_buffer = io.StringIO()
    panel_summary_responses_df.to_csv(csv_buffer, index=False)
    panel_summary_responses = csv_buffer.getvalue()
    llm_as_a_judge_per_eval_summary_fpath = os.path.join(METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)

    write_to_s3(panel_summary_responses, BUCKET_NAME, "", METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)
    logger.info(f"Summary on each eval (max voting/average pooling) for each panel judge sent to s3://{BUCKET_NAME}/{llm_as_a_judge_per_eval_summary_fpath}")

    return panel_summary_responses_df

panel_summary_responses_df = generate_panel_summary_responses(df_eval_results, method_name)
panel_summary_responses_df.head()

#### Calculate the overall accuracy of each model scored by the PoLL
---

In [None]:
def generate_per_panel_judgement_result_df(panel_summary_responses_df: pd.DataFrame, method_name: str) -> pd.DataFrame:
    """
    This function is used to get the per panel judgement on each candidate model in terms of the 
    how many responses where correct (accuracy) and how many were incorrect (error rate)
    """
    if method_name != 'max_voting':
        logger.info(f"Evaluation method is set to {method_name}, exiting out of this function")
        return pd.DataFrame()

    per_panel_judgement_result_df = panel_summary_responses_df.pivot_table(
        index=['candidate_model', 'judge_model_id'],
        columns='verdict',
        values='count',
        fill_value=0
    ).reset_index()

    # Ensure 'correct' and 'incorrect' columns exist
    if 'correct' not in per_panel_judgement_result_df.columns:
        per_panel_judgement_result_df['correct'] = 0
    if 'incorrect' not in per_panel_judgement_result_df.columns:
        per_panel_judgement_result_df['incorrect'] = 0

    # Calculate accuracy and error rate
    per_panel_judgement_result_df = per_panel_judgement_result_df.assign(
        accuracy=lambda df: df.apply(lambda row: 100 if row['incorrect'] == 0 else round(row['correct'] / (row['correct'] + row['incorrect']), 2) * 100, axis=1),
        error_rate=lambda df: df.apply(lambda row: 0 if row['incorrect'] == 0 else round(row['incorrect'] / (row['correct'] + row['incorrect']), 2) * 100, axis=1)
    )

    return per_panel_judgement_result_df

per_panel_judgement_result_df = generate_per_panel_judgement_result_df(panel_summary_responses_df, method_name)
per_panel_judgement_result_df.head()

In [None]:
def save_overall_accuracy_metrics(df_eval_results: pd.DataFrame, per_panel_judgement_result_df: pd.DataFrame, method_name: str) -> pd.DataFrame:
    if method_name != 'max_voting':
        logger.info(f"Evaluation method is set to {method_name}, exiting out of this function")
        return pd.DataFrame()

    mean_cosine_similarity = df_eval_results.groupby('candidate_model')['cosine_similarity_score'].mean().reset_index().rename(columns={'cosine_similarity_score': 'mean_cosine_similarity'})
    mean_levenshtein_distance = df_eval_results.groupby('candidate_model')['levenshtein_distance'].mean().reset_index().rename(columns={'levenshtein_distance': 'mean_levenshtein_distance'})
    mean_token_set_ratio = df_eval_results.groupby('candidate_model')['token_set_ratio_value'].mean().reset_index().rename(columns={'token_set_ratio_value': 'mean_token_set_ratio_value'})

    overall_accuracy_grouped_panel_df = per_panel_judgement_result_df.groupby('candidate_model')[['accuracy', 'error_rate']].mean().reset_index()
    overall_accuracy_grouped_panel_df = (
        pd.merge(mean_cosine_similarity, overall_accuracy_grouped_panel_df, on='candidate_model')
        .merge(mean_levenshtein_distance, on='candidate_model')
        .merge(mean_token_set_ratio, on='candidate_model')
        .sort_values(by='accuracy', ascending=False)
    )

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    overall_accuracy_grouped_panel_df.to_csv(csv_buffer, index=False)
    overall_panel_result = csv_buffer.getvalue()
    overall_panel_accuracy_metrics_fpath = os.path.join(METRICS_DIR, PER_MODEL_ACCURACY_POLL)

    write_to_s3(overall_panel_result, BUCKET_NAME, "", METRICS_DIR, PER_MODEL_ACCURACY_POLL)
    logger.info(f"Overall accuracy and error rates results of each model sent to s3://{BUCKET_NAME}/{overall_panel_accuracy_metrics_fpath}")

    return overall_accuracy_grouped_panel_df

overall_accuracy_grouped_panel_df = save_overall_accuracy_metrics(df_eval_results, per_panel_judgement_result_df, method_name)
overall_accuracy_grouped_panel_df.head()

In [None]:
def generate_and_save_accuracy_statement(overall_accuracy_grouped_panel_df: pd.DataFrame, per_panel_judgement_result_df: pd.DataFrame, method_name: str):
    if method_name != 'max_voting':
        logger.info(f"Evaluation method is set to {method_name}, exiting out of this function")
        return

    # Rank models by accuracy
    ranked_models = overall_accuracy_grouped_panel_df.sort_values(by='accuracy', ascending=False)
    highest_accuracy = ranked_models['accuracy'].max()

    # Group models with the highest accuracy
    top_performers = ranked_models[ranked_models['accuracy'] == highest_accuracy]
    other_models = ranked_models[ranked_models['accuracy'] < highest_accuracy]
    final_ranking = pd.concat([top_performers, other_models])
    unique_judge_model_ids = per_panel_judgement_result_df['judge_model_id'].unique()
    PoLL_model_ids = ', '.join(map(str, unique_judge_model_ids))
    top_performing_model_ids = ', '.join(top_performers['candidate_model'].tolist())

    # Cosine similarity score data
    highest_cosine_model = final_ranking.loc[final_ranking['mean_cosine_similarity'].idxmax()]
    highest_cosine_model_name = highest_cosine_model['candidate_model']
    highest_cosine_similarity = highest_cosine_model['mean_cosine_similarity']

    # Levenshtein distance data
    highest_levenshtein_model = final_ranking.loc[final_ranking['mean_levenshtein_distance'].idxmin()]
    highest_levenshtein_model_name = highest_levenshtein_model['candidate_model']
    highest_levenshtein_distance = highest_levenshtein_model['mean_levenshtein_distance']

    # Token set ratio data
    highest_token_set_ratio_model = final_ranking.loc[final_ranking['mean_token_set_ratio_value'].idxmax()]
    highest_token_set_ratio_model_name = highest_token_set_ratio_model['candidate_model']
    highest_token_set_ratio_value = highest_token_set_ratio_model['mean_token_set_ratio_value']

    if other_models.empty:
        other_models_statement = f"All models performed the same with an accuracy of {highest_accuracy:.2f}."
    else:
        other_models_statement = other_models.to_string(index=False)

    # Create the accuracy statement
    accuracy_statement = MAX_VOTING_RESULT_STATEMENT.format(
        judge_model_ids=PoLL_model_ids,
        highest_accuracy=highest_accuracy,
        top_models=top_performers.to_string(index=False),
        highest_cosine_similarity=round(highest_cosine_similarity, 4),
        top_cosine_similarity_model=highest_cosine_model_name,
        highest_levenshtein_distance=round(highest_levenshtein_distance, 4),
        top_levenshtein_model=highest_levenshtein_model_name,
        highest_token_set_ratio_value=round(highest_token_set_ratio_value, 4),
        top_token_set_ratio_model=highest_token_set_ratio_model_name,
        ranked_models=other_models_statement,
        top_performing_model_ids=top_performing_model_ids
    )

    # Send the overall accuracy report to S3
    txt_buffer = io.StringIO()
    txt_buffer.write(accuracy_statement)
    poll_txt_file_content = txt_buffer.getvalue()
    overall_panel_accuracy_metrics_fpath = os.path.join(METRICS_DIR, OVERALL_POLL_REPORT)
    write_to_s3(poll_txt_file_content, BUCKET_NAME, "", METRICS_DIR, OVERALL_POLL_REPORT)
    logger.info(f"Overall accuracy and error rates results of each model sent to s3://{BUCKET_NAME}/{overall_panel_accuracy_metrics_fpath}")
    print(accuracy_statement)

generate_and_save_accuracy_statement(overall_accuracy_grouped_panel_df, per_panel_judgement_result_df, method_name)

### Send all responses from the evaluation process to S3 as a txt file for further downstream processing and readability purposes
---

In [None]:
def write_all_evaluations_to_s3(df_eval_results: pd.DataFrame, method_name: str) -> str:
    if method_name != 'max_voting':
        logger.info(f"Evaluation method is set to {method_name}, exiting out of this function")
        return None
    # Write all explanations to a file and send to S3
    explanations_txt_buffer = io.StringIO()
    for index, row in df_eval_results.iterrows():
        explanations_txt_buffer.write(
            f"candidate model: {row['candidate_model']}\n"
            f"candidate model response: {row['candidate_model_response']}\n"
            f"ground truth: {row['ground_truth']}\n"
            f"verdict and explanation: {row['completion']}\n\n"
        )

    explanations_txt_file_content = explanations_txt_buffer.getvalue()
    explanations_fpath = os.path.join(METRICS_DIR, ALL_EVALUATIONS_IN_TXT)
    write_to_s3(explanations_txt_file_content, BUCKET_NAME, "", METRICS_DIR, ALL_EVALUATIONS_IN_TXT)
    logger.info(f"All text eval content from the llm judge panelists sent to s3://{BUCKET_NAME}/{explanations_fpath}")
    logger.info(f"All of the content including the candidate model responses, ground truth, evaluation are written: {evaluations_content}")
    return explanations_txt_file_content

write_all_evaluations_to_s3(df_eval_results, method_name)

### Evaluation of Evaluations (for max voting debugging purposes)
---

Going over each of the evaluation is a tedious task, while it is an option, in this portion of the step:

1. All of the evaluations are parsed in chunks through a model on Amazon Bedrock, to provide whether all evaluations are done correctly. For example, if all verdicts for Max Voting (either correct or incorrect) are correctly mentioned depending on the comparison between the candidate model response and the ground truth.

1. Scan for any false positive or false negative evaluations, if any, then we can prompt engineer the evaluation prompts to give more accurate results.

1. Give a summary of all evaluations

In [None]:
# # Get the content of the explanations
# explanations_txt_file_content = evaluations_content.getvalue()
# final_summarizer_model_id: str = model_eval_subjective_info.get("final_evaluation_summarizer", None)

# # Split the explanations into chunks of 50,000 words with some overlap
# chunk_size: int = 50000
# chunk_overlap: int = 10000


# def split_into_chunks(text, chunk_size, chunk_overlap):
#     words = text.split()
#     chunks = []
#     for i in range(0, len(words), chunk_size - chunk_overlap):
#         chunk = ' '.join(words[i:i + chunk_size])
#         chunks.append(chunk)
#     return chunks


# chunks = split_into_chunks(explanations_txt_file_content, chunk_size, chunk_overlap)
# num_chunks = len(chunks)

# # Log the number of chunks created
# logger.info(f"Number of chunks created: {num_chunks}")

# # Load the final evaluation prompt template
# final_eval_prompt_template = os.path.join(eval_prompts_dir, model_eval_subjective_info.get("final_evaluation_prompt_template", None))
# final_eval_prompt_content = Path(final_eval_prompt_template).read_text()
# logger.info(f"Prompt being used to evaluate all evaluations: {final_eval_prompt_content}")

# # Evaluate each chunk with the model on Amazon Bedrock and collect the responses
# responses = []
# for chunk in chunks:
#     prompt_content = final_eval_prompt_content.format(context=chunk)
#     response = completion(model=final_summarizer_model_id,
#                           messages=[{"content": prompt_content, "role": "user"}],
#                           temperature=INFERENCE_PARAMETERS_LLM_PANEL.get('temperature', 0.1),
#                           max_tokens=1000,
#                           caching=INFERENCE_PARAMETERS_LLM_PANEL.get('caching', False))
#     responses.append(response['choices'][0]['message']['content'])

# # Combine all the responses into a single text
# combined_response = '\n\n'.join(responses)

# # Write the combined response to a new file and upload to S3
# combined_response_buffer = io.StringIO(combined_response)
# combined_response_content = combined_response_buffer.getvalue()
# combined_response_fpath = os.path.join(METRICS_DIR, "eval_of_evaluations_summary_debug.txt")
# write_to_s3(combined_response_content, BUCKET_NAME, "", METRICS_DIR, "eval_of_evaluations_summary_debug.txt")

# # Log the S3 path and print the combined response
# logger.info(f"Combined evaluations response sent to s3://{BUCKET_NAME}/{combined_response_fpath}")
# print(combined_response_content)

In [None]:
# print(combined_response)