## Get Evaluations on inferences generated by candidate models, gather findings on quantitative metrics (such as _Cosine Similarity, levenshtein distance, and token set ratio_) and Subjective Metrics using Majority Voting with PoLL (Panel of LLM Evaluators)

---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

#### This step of the solution gets evaluations on the quality of responses generated by the candidate models that have to be evaluated. It does so by performing the steps below:

- **Gets the inference request file that contains all results from the inference stage**: The inference request results file containing all inferences from each candidate model is fetched along with the associated metrics such as ground truth (if any), source payload file, concurrency level, etc.

- **Generates quantitative metrics for evaluation**: Calculate quantitative metrics to measure similarity and accuracy, using _Cosine Similarity, levenshtein distance, and token set ratio_. Cosine similarity is a metric used to measure how similar two vectors are, regardless of their size. Levenshtein distance is a string metric for measuring the difference between two sequences. The Token Set Ratio algorithm tokenizes both input strings, removes duplicate tokens, and calculates the similarity score. This helps in getting a quantitative overall score for the entire dataset in terms of which model generates outputs that are most similar and accurate to the ground truth provided. We use these metrics to build a hierarchy evaluation decision tree to move up to the next step of evaluation if the correctness of an answer is not obviously determined using the quantitative metrics only. 
    
    The steps that are followed as a part of this evaluation hierarchy (for Majority Voting) is as given below:
    
    1. First check if either the _Cosine Similarity, levenshtein similarity, or token set ratio_ values or if the average of all three exceed a given threshold. If they do, users have the ability to make quantitative based decisions only. Which means if all quantitative metric values surpass the threshold, those answers can be evaluated as `correct` automatically without parsing them through an LLM evaluator. This would save on cost and latency, but in some cases can introduce edge cases.
    
    1. For the rest of the answers that are not obviously correct or do not have any semantic relation with the ground truth, this process moves to the next step in the hierarchical tree, which is using a panel of LLM evaluators. Responses that satisfy and exceed all quantitative metric thresholds can also be parsed through the LLM evaluator stage.

- **Use a _Panel of LLM Evaluators_ approach to get subjective evaluations**: Refer to this [paper](https://arxiv.org/pdf/2404.18796). We use the following ways to evaluate the responses from the `candidate models` (models used to generate inferences)

- **Majority Voting**: When a dataset provides a ground truth, FMBench uses a technique called `Majority Voting`. Here, we use PoLL, _or a panel of LLM evaluators_, from different model families to evaluate each candidate model's response based on whether it generates a `correct` or an `incorrect` answer simply based on its comparison with the ground truth. Using models from different model families as a PoLL, increases it's ability to match a human level evaluation, makes the evaluation process more streamlined, consistent across all the responses, and reduces the latency and cost of evaluating the candidate models over time. The intra model bias during the evaluation process is also eliminated since more than a single model acts as a panel evaluator. FMBench uses [the majority voting evaluation instructions](prompt_template/eval_criteria/evaluation_instructions_majority_vote.txt) that are fed into the prompt templates supplied to different judge models to evalaute responses at runtime.
       
***All evaluations are generated in a JSON format for further downstream analytics on the evaluation results***

#### Import all of the necessary libraries below to run this notebook

In [None]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [None]:
import io
import ray
import time
import json
import glob
import yaml
import pandas as pd
from numpy import dot
from pathlib import Path
from fuzzywuzzy import fuzz
from fmbench.utils import *
from fmbench.globals import *
from numpy.linalg import norm
from litellm import completion
from typing import List, Optional, Dict
import importlib.resources as pkg_resources
from sentence_transformers import SentenceTransformer

In [None]:
# set a logger to get logs
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

Load the Config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations

In [None]:
logger.info(f"CONFIG_FILE={CONFIG_FILE}")
config = load_main_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

#### Load the associated pricing config file

In [None]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
pricing_file_path: str = config['pricing'] 

# initialize the pricing config file to None
pricing_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files('fmbench'), 'configs')
logger.info(f"Using fmbench.configs directory: {config_dir}")

pricing_module = Path(config['pricing'])
logger.info(f"pricing config provided for inference from this model is --> {pricing_module}")
pricing_file_path = os.path.join(config_dir, pricing_module)
logger.info(f"pricing config file path is --> {pricing_file_path}")

pricing_config = load_config(pricing_file_path)
logger.info(f"pricing config file recorded: {json.dumps(pricing_config, indent=2)}")

### Load the model evaluation information
---

The common model configuration file contains information about which evaluation strategy to use (`majority voting`), 
the ground truth column if provided by the user in the config file which is used in the experiment, which FMs on Bedrock to use as
LLM as evaluators, the prompt templates used by each in the case of Majority voting, the quantitative metric thresholds
for an answer to be correct, inference parameters and more. 

In [None]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
model_eval_fpath: str = config['model_evaluations']

# initialize the pricing config file to None
eval_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files('fmbench'), 'configs')
logger.info(f"Using fmbench.configs directory: {config_dir}")

eval_module = Path(config['model_evaluations'])
logger.info(f"eval config provided for evaluation --> {eval_module}")
eval_file_path = os.path.join(config_dir, eval_module)
logger.info(f"eval config file path is --> {eval_file_path}")

# eval_config = load_config(eval_file_path).format(method_name=config['method_name'])
with open(eval_file_path, 'r') as file:
    model_eval_info = file.read()
    # load the preliminary unformatted config file to fetch the method name and plug it into
    # the prompt template file names
    model_eval_info_config =  yaml.safe_load(model_eval_info)
    model_eval_formatted_content = model_eval_info.format(ground_truth=config['datasets'].get('ground_truth_col_key', None),
                                                          method_name=model_eval_info_config['model_evaluations']['PoLL_Composition_and_Voting'].get('method', None))
    eval_config = yaml.safe_load(model_eval_formatted_content)

# view all information that will be used in the evaluation process, which includes the ground truth
# in the dataset, the evaluation method (Majority voting) and associated information
logger.info(f"eval config file recorded: {json.dumps(eval_config, indent=2)}")

In [None]:
debug = False
if debug is True:
    metrics_path_file: str = os.path.join("..", "..", METADATA_DIR, METRICS_PATH_FNAME)
else:
    metrics_path_file: str = os.path.join(METADATA_DIR, METRICS_PATH_FNAME)
logger.info(f"cwd={os.getcwd()}, METADATA_DIR={METADATA_DIR}, METRICS_PATH_FNAME={METRICS_PATH_FNAME}, metrics_path_file={metrics_path_file}")
METRICS_DIR: str = Path(metrics_path_file).read_text().strip()
logger.info(f"metrics_path_file={metrics_path_file}, METRICS_DIR={METRICS_DIR}")

In [None]:
#file_path: str = os.path.join(METRICS_DIR, config["report"]["per_inference_request_file"])
file_path: str = 'fmbench-bedrock-llama3-stream-responses-fmbench-stack-us-west-2-role/data/metrics/yyyy=2024/mm=08/dd=02/hh=01/mm=09/per_inference_request_results.csv'
logger.info(f"File path containing the metrics per inference folder --> {file_path}")

# Read the file from S3
try:
    file_content = get_s3_object(config['aws']['bucket'], file_path)
    # Use pandas to read the CSV content
    df_per_inference = pd.read_csv(io.StringIO(file_content))
    logger.info(f"{file_path} read into dataframe of shape {df_per_inference.shape}, "
                f"cols={df_per_inference.columns}")
    logger.info(f"{file_path} contains results for the following endpoints={df_per_inference.endpoint_name.unique()}")
    logger.info(df_per_inference.head())
except Exception as e:
    logger.error(f"Error reading from S3: {e}")


In [None]:
logger.info(f"Going to be using this inference file to generate evaluations on -> {df_per_inference.head()}")

### Relationship between prompt token length and inference latency for different instances and concurrency levels

In [None]:
logger.info(f"Information on the inference file being used for evaluations: {df_per_inference.latency.describe()}")

In [None]:
logger.info(f"Total number of inferences to evaluate from candidate models: {df_per_inference.shape[0]}")

### Use the `sentence-transformers/all-mpnet-base-v2` embeddings model to calculate the _Cosine Similarity_ scores 
---

This portion of the evaluation step does as follows:

1. Uses the `sentence-transformers/all-mpnet-base-v2` model from Hugging Face. This is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

1. Use the embeddings model to get quantitative metrics from the inferences. This helps to get a similarity score between the ground truth answers from a dataset if any are given and the actual responses from the model received during inference.

In [None]:
# get the quantitiative evaluation information from the config file, such as the embeddings model
# to be used
embeddings_model_quantitative_info: Dict = eval_config['model_evaluations']['quantitative_eval_info']


def load_model():
    """
    This function loads the sentence-transformers model based on the provided model ID.
    """
    try:
        model=None
        model_id = embeddings_model_quantitative_info['embeddings_model_id'].get('model_id', None)
        if model_id:
            model = SentenceTransformer(model_id)
        else:
            raise ValueError("Model ID is not provided or invalid in the configuration.")
    except Exception as e:
        logger.error(f"The SentenceTransformer embeddings model could not be loaded: {e}")
        model=None
    return model

In [None]:
# load the embeddings model to calculate the cosine similarity scores
model = load_model()
logger.info(f"Embeddings model info which will be used to calculate the cosine similarity scores for Majority Voting Eval: {model}")

In [None]:
def calculate_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between two texts. In this case, 
    the cosine similarity is the comparison between the ground truth in the given dataset
    and the candidate model's response
    """
    try:
        cosine_similarity_score: float = None
        # returns the embedding for a given text using the sentence-transformers model.
        A = model.encode([text1])[0]
        B = model.encode([text2])[0]
        cosine_similarity_score = dot(A, B) / (norm(A) * norm(B))
        logger.info(f"Calculating the cosine similarity score, current score: {cosine_similarity_score}")
    except Exception as e:
        logger.error(f"Cosine similarity was not calculated at this iteration: {e}")
        cosine_similarity_score=None
    return cosine_similarity_score

In [None]:
# get the method that is being used to evaluate the content (which is either Majority voting)
model_eval_subjective_info: List[Dict] = eval_config['model_evaluations']['subjective_eval_info']
method_name: str = eval_config['model_evaluations']['PoLL_Composition_and_Voting'].get('method', None)
logger.info(f"The evaluation method FMBench is going to use to evaluate different model responses: {method_name}")
logger.info(f"judge panel being used to evaluate model responses: {model_eval_subjective_info.get('judge_panel_list', None)}")

In [None]:
# calculate the quantitative metrics if evaluation is set to Majority voting
logger.info(f"Valid ground truth column found in the inference file: {eval_config['model_evaluations'].get('ground_truth_col')}, calculating cosine similarity scores")
logger.info(f"~Creating embeddings and calculating cosine similarity scores for of all candidate model responses now. This might take a 1-2 minutes~")
ground_truth_col_name: Optional[str] = config['datasets'].get('ground_truth_col_key', None)

# Check for ground truth column and raise an exception if not found
if ground_truth_col_name is None:
    raise ValueError(f"Expected a valid ground truth column name in the config file information, got {ground_truth_col_name}. Cannot continue.")

# If we reach this point, we know the ground truth column exists
df_per_inference['cosine_similarity_score'] = df_per_inference.apply(
    lambda row: calculate_cosine_similarity(row['completion'], row['ground_truth']), axis=1
)

logger.info(f"Calculated the cosine similarity score: {df_per_inference.head()}")

## Model Evaluations: Hierarchical Flow
--- 

1. Check for the lexical match/similarity between the ground truth and the answer using three main quantitative metrics: _Cosine similarity score, Levenshtein similarity, and Token set ratio_. 

1. If the thresholds of any of these or the overall quantitative metric threshold are passed (specified in the [model_eval_all_info](configs/model_eval_all_info.yml) config file), users can decide not to parse the obviously correct responses through the LLM as an evaluator process. 

1. For simple datasets that contain a direct answer to a question, quantitative metrics can be used to determine the correctness of an answer. However, in edge cases where quantitative metrics cannot be relied on, for example if an answer needs to be evaluated on the relationship between two people, rather than the similarity of words, then such responses can be parsed through the LLM evaluator.

### Model Evaluation Part 1: Lexical Match & Similarity Score Evaluation Filter
---

Before having the Panel of LLM Evaluators evaluate each candidate model's response, we pass those responses through a quantitative eval step. In this step we use a threshold for a `Lexical match`, `Cosine Similarity`, and `Levenshtein Similarity` scores to define whether that answer is correct without necessarily having an LLM evaluate it. The thresholds for correctness for each quantitative metric is defined in the `model_all_eval_info.yml` config file. 

The reason to do this is to make the evaluation process more like a hierarchy of checks, to make sure each and every candidate model response is evaluated appropriately. Additionally, if the user decides to use the quantitative metrics as a decision point to define whether an answer is correct or incorrect without having to pass it through an LLM for evaluation can lead up to cost and latency optimization. This is specific to the `Ground Truth based approach`. 

For the lexical match, we use the `fuzzy` match algorithm `token_set_ratio` library to determine what percent of the two texts are similar.

**Note**: `Token_set_ratio` algorithm tokenizes both input strings, removes duplicate tokens, and calculates the similarity score based on the intersection and union of the token sets. It captures the essence of the strings’ content rather than their specific order.

In [None]:
def calculate_token_set_ratio(text1: str, text2: str) -> float:
    """
    This function calculates the partial token match or fuzz ratio between two strings.
    If the fuzz ratio exceeds the threshold and the cosine similarity matches or exceeds the threshold, 
    then the answer is correct and it is not evaluated using a judge. If it is not, then it
    is parsed through the PoLL process
    """
    try:
        token_set_ratio: float = None
        if text1 and text2:
            token_set_ratio = fuzz.token_set_ratio(text1, text2) / 100.0
        else:
            token_set_ratio=None
    except Exception as e:
        logger.error(f"Error in calculating token set ratio: {e}")
        token_set_ratio=None
    return token_set_ratio

### Levenshtein distance algorithm
---
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

In [None]:
def levenshtein_distance(s: str, t: str):
    """
    Here, we use Dynamic Programming (DP) to compute the levenshtein distance
    between two strings
    """
    # Initialize lengths of both strings
    m, n = len(s), len(t)

    # Ensure s is the longer string
    if m < n:
        s, t = t, s
        m, n = n, m

    # Initialize the distance matrix with dimensions (m+1) x (n+1)
    d = [list(range(n + 1))] + [[i] + [0] * n for i in range(1, m + 1)]

    # Populate the matrix
    for j in range(1, n + 1):
        for i in range(1, m + 1):
            # If characters match, no cost is added
            if s[i - 1] == t[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                # Otherwise, take the minimum cost from insert, delete, or replace operations
                d[i][j] = min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1
    # Return the computed Levenshtein distance (bottom-right cell of the matrix)
    return d[m][n]


def calculate_levenshtein_distance(input_string: str, reference_string: str) -> float:
    """
    In this function, we calculate the levenshtein distance between the input string (candidate model response) and 
    the reference string (which can be the ground truth or the context provided to answer the question).
    """
    try:
        levenshtein_similarity: Optional[float]=None
        distance = levenshtein_distance(input_string, reference_string)
        max_length = max(len(input_string), len(reference_string))
        levenshtein_similarity = 1 - (distance / max_length)
    except Exception as e:
        logger.error(f"Could not compute the levenshtein similarity score: {e}")
        levenshtein_similarity=None
    return levenshtein_similarity

In [None]:
# # These are examples from the LongBench dataset for testing purposes
# candidate_model_response: str = "Both Sinofranchetia and Stauntonia are from the Lardizabalaceae family. This information is mentioned in the passages for both genera."
# ground_truth: str = "a genus of flowering plant in the Lardizabalaceae family"
# ratio = calculate_levenshtein_distance(candidate_model_response, ground_truth)
# print(f"ratio calculated: {ratio}")

In [None]:
# Compute the token set ratio for each row and add it as a new column
# In this case, the ground truth is used as context to calculate the levenshtein distance
# and the token set ratio if the ground truth is not provided

# calculate the quantitative metrics if evaluation is set to Majority voting
logger.info(f"ground truth column is found: {eval_config['model_evaluations'].get('ground_truth_col')}, calculating token set ratio and levenshtein distance")
df_per_inference = df_per_inference.assign(
    token_set_ratio_value=lambda df: df.apply(lambda row: calculate_token_set_ratio(row['completion'], row['ground_truth']), axis=1),
    levenshtein_distance=lambda df: df.apply(lambda row: calculate_levenshtein_distance(row['completion'], row['ground_truth']), axis=1)
)
df_per_inference.head()

In [None]:
# define the all_metrics path to send the evaluation metrics to
all_metrics_fpath: str = os.path.join(METRICS_DIR, config["report"]["all_metrics_file"])
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_cosine_similarity_scores_csv = csv_buffer.getvalue()
inference_cosine_similarity_scores_s3_path = os.path.join(METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_cosine_similarity_scores_csv, BUCKET_NAME, "", 
            METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{inference_cosine_similarity_scores_s3_path}")
df_per_inference.head()

### Model Evaluation Part 2: Use _Panel of LLM Evaluators_ to get Subjective Evaluations on various evaluation criteria
---

In this portion of the notebook, we run evaluations on all candidate models using a panel of LLM evaluators. We use a main evaluation method: `Majority Voting`. To eliminate intra-model bias, we address this by scoring answer correctness based not on a single judge, but instead on a panel composed of multiple evaluator models.

1. **Majority Voting**: We use the PoLL to evaluate candidate model responses by checking its correctness compared to a provided ground truth answer in the dataset. We prompt each PoLL to evaluate and give the response in a JSON structure, giving a verdict on whether the response is correct or incorrect based on its comparison with the ground truth, and an explanation as to why that is. With all verdicts and responses in JSON, we can perform downstream tasks such as:

    1. Calculate the overall accuracy of each model using the correct versus the (correct + incorrect) responses
    
    1. Calculate the `error rate` or frequency or incorrect responses
    
    1. Categorize the errors based on the explanations provided by the evaluators. Common categories might include misunderstanding the question, incomplete answers, factual inaccuracies
    
    1. Summary of overall correct/incorrect, and the best model based on the PoLL. Rank the models on Correctness versus Incorrectness.

In [None]:
# get the qualitative/subjective evaluation information from the config file to evaluate answers from different
# endpoints on various criteria
model_eval_subjective_info: Dict = eval_config['model_evaluations']['subjective_eval_info']
eval_criteria_list = model_eval_subjective_info.get('eval_criteria', None)
logger.info(f"available llm as a judge evaluation information to use: {json.dumps(model_eval_subjective_info, indent=2)}")

In [None]:
# get the inference parameters that the LLM judge panel will use while evaluating model candidate responses
INFERENCE_PARAMETERS_LLM_PANEL: Dict = eval_config['model_evaluations']['subjective_eval_info'].get('inference_parameters', None)
logger.info(f"Inference parameters that LLM evaluators will use: {INFERENCE_PARAMETERS_LLM_PANEL}")

In [None]:
def get_llm_evaluation(model_id: str,
                        prompt: str):
    """
    Get inference using LiteLLM. This function is called by each evaluator on the panel of 
    llm evaluators to get a response on a given prompt. This is in the case of where there is 
    Majority voting enabled
    """
    # represents the service name
    logger.info(f"get_inference, model_id={model_id}")
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception=None,
               prompt=prompt,
               completion=None,
               completion_token_count=None,
               prompt_token_count=None,
               model_id=model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        print(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{"content": body,"role": "user"}],
                              temperature=INFERENCE_PARAMETERS_LLM_PANEL.get('temperature', 0.1),
                              max_tokens=INFERENCE_PARAMETERS_LLM_PANEL.get('max_tokens', 100),
                              caching=INFERENCE_PARAMETERS_LLM_PANEL.get('caching', False))
        print(f"response: {response}")
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens        
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [None]:
def safe_filename(s):
    """
    convert a string to another string that can be used as a filename
    i.e. remove white space and non-word chars
    """
    if s is None:
        return "None"
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)
    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)
    return s

In [None]:
def parse_as_json(x: str) -> Optional[Dict]:
    """
    Convert a string into a dictionary. Remove any
    stray whitespaces which could break the json parsing
    """
    d: Optional[Dict] = None
    try:
        x = x.replace("\n", "").replace("\t", "")
        d = json.loads(x)
    except Exception as e:
        print(f"parse_as_json, error parsing string as json, string={x}")
    return d

In [None]:
df_per_inference.rename(columns={'completion': 'candidate_model_response'}, inplace=True)
df_per_inference.head()

#### Prepare the evaluation prompt payloads
---

Here, the evaluation prompt template is used by the LLM judge to evaluate the answers on different criteria.
This prompt template function uses a set of rules, prompt template, the answer, and ground truth (if any) in the
evaluation solution.

In [None]:
def prepare_eval_prompts(eval_template: str,
                         answer: str, 
                         rules: str, 
                         ground_truth: Optional[str]):
    """
    This function prepares the evaluation prompts by preparing the standard eval prompt template
    with the rules of a given subjective criteria, context, answer and ground truth (if any ground truth is provided)
    This function prepares prompt payloads for both evaluation criteria: Majority voting. In the 
    case of Majority voting, there is no subjective criteria that is inputted.
    """
    try:
        processed_eval_template: Optional[str] = None
        processed_eval_template = eval_template.format(
            rules=rules,
            answer=answer,
            ground_truth=ground_truth)
    except Exception as e:
        logger.error(f"Error encountered while generating the evaluation prompt template: {e}")
        processed_eval_template=None
    return processed_eval_template

In [None]:
def clear_dir(dir_path: str):
    files = glob.glob(os.path.join(dir_path, "*"))
    for f in files:
        os.remove(f)

# create the metrics directory that stores all of the json files containing evaluations from all Panel of LLM evaluators
METRICS_PER_POLL_EVAL_DIR: str = os.path.join(METRICS_DIR, METRICS_PER_POLL_EVAL_DIR_NAME)
_ = list(map(clear_dir, [METRICS_PER_POLL_EVAL_DIR]))

In [None]:
def run_panel_of_llm_evals(i: int, total: int, row: Dict,  model_id: str, eval_method_name: str, uuid: str) -> Dict:
    """
    Runs the evaluation for one row 
    The eval prompt is already available in the row dictionary
    and we simply want to run the inference against the judge model.
    The results are returned in a new dictionary that contains the model 
    response and some fields from the original dictionary
    """
    try: 
        # save all the responses from the model in a dictionary
        resp: Optional[Dict]=None
        logger.info(f"run_eval, row {i}/{total}, judge_model_id={model_id}, candidate model={row['endpoint_name']}")
        # create the payload for model inference
        prompt = row[f'{model_id}_{method_name}_eval_prompt']
        # generate the evaluation on the data using the model judge
        resp = get_llm_evaluation(model_id, prompt)
        # assign the completion from the candidate model to the `candidate_model_response`, 
        # and the actual evaluation will be contained in a field called `completion`
        resp['candidate_model_response'] = row['candidate_model_response']
        logger.info(f"Panel of LLM evaluator {model_id} completion: {resp['completion']}")
        resp['candidate_model'] = row['endpoint_name']
        resp['payload_file'] = row['payload_file']
        resp['cosine_similarity_score'] = row['cosine_similarity_score']
        resp['levenshtein_distance'] = row['levenshtein_distance']
        resp['token_set_ratio_value'] = row['token_set_ratio_value']
        # if there is a ground truth (in case of Majority voting) or
        # criteria name (in case of average pooline), include those in the json response
        resp['ground_truth'] = row['ground_truth']
    except Exception as e:
        logger.error(f"Error encountered while running evaluation: {e}")
        resp=None
    return resp

# we use Ray to parallize
@ray.remote
def async_run_eval(i: int, total: int, row: Dict, model_id: str, eval_method_name: str, uuid: str) -> Dict:
    print(f"async_run_eval, i={i}, total={total}, judge_model_info={model_id}, eval_method: {eval_method_name}, uuid: {uuid}")
    return run_panel_of_llm_evals(i, total, row, model_id, eval_method_name, uuid)

In [None]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
df_per_inference_list = json.loads(df_per_inference.to_json(orient='records'))
logger.info(f"Total number of candidate models going to be evaluated: {len(df_per_inference_list)}")

#### Prepare evaluation prompt templates
---

This portion of the step prepares the evaluation prompt templates that are used in the evaluation process of using `Majority Voting` using the PoLL.

In [None]:
model_eval_subjective_info

In [None]:
logger.info(f"Number of judges being used for this model evaluation: {len(model_eval_subjective_info.get('judge_panel_list', None))}")
logger.info(f"Inference Parameters that are going to be used by the judge panels while evaluating candidate models: {model_eval_subjective_info.get('inference_parameters', None)}")

#### Prepare prompt payloads
---

In this portion of the step, FMBench iterates through each of the row containing the model response and prepares the corresponding prompt payloads. In this step, the prompt template for a given evaluation method is used. For Majority voting, a standard prompt template is used with evaluation instructions and candidate model responses.

In [None]:
# Assuming fmbench is a valid Python package and scripts is a subdirectory within it
model_eval_dir: Optional[str] = eval_config['model_evaluations']['model_eval_dir']
eval_prompts_dir: str = Path(pkg_resources.files('fmbench'), 
                             f"{config['s3_read_data']['prompt_template_dir']}/{model_eval_dir.get('eval_prompts_dir', None)}")

try:
    # Iterate through each LLM as a judge and each evaluation criterion
    for llm_info in model_eval_subjective_info.get('judge_panel_list', []):
        model_id: str = llm_info['model_id']
        method_name: str = eval_config['model_evaluations']['PoLL_Composition_and_Voting'].get("method", None)
        eval_prompt_template_fname: str = f"{llm_info.get('eval_prompt_template_name', None)}.txt"

        # Use the evaluation prompt template path to read in the standard prompt template that
        # is used in the creation of prompt payloads
        eval_prompt_template_dir = llm_info.get('eval_prompt_template_dir', None)
        eval_prompt_template_path = os.path.join(eval_prompts_dir, eval_prompt_template_dir, eval_prompt_template_fname)
        logger.info(f"evaluation prompt template file path being used for {model_id}: {eval_prompt_template_path}")
        logger.info(f"evaluation prompt template file name: {eval_prompt_template_fname}")
        eval_prompt_template = Path(eval_prompt_template_path).read_text()
        logger.info(f"Evaluation prompt template being used: {eval_prompt_template}")

        # There is a standard instructions file for both Majority voting on how to evaluate the 
        # model responses (whether it should be a binary decision or rating on a scale of 1-5)
        eval_instructions_fname = next((rule for rule in model_eval_dir.get('eval_instructions_files', None) if method_name in rule), None)
        rules = Path(os.path.join(eval_prompts_dir, eval_instructions_fname)).read_text()
        logger.info(f"rules: {rules}")
        column_name = f"{model_id}_{method_name}_eval_prompt"
        df_per_inference[column_name] = df_per_inference.apply(
            lambda r: prepare_eval_prompts(
                eval_prompt_template,
                r['candidate_model_response'],
                rules,
                r['ground_truth']
            ),
            axis=1
        )
except Exception as e:
    logger.error(f"Error occurred in the creation of prompt payloads: {e}")
    df_per_inference=None

df_per_inference.head()

In [None]:
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_eval_prompt_payloads = csv_buffer.getvalue()
eval_prompt_payloads_for_inference = os.path.join(METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(df_per_inference_with_eval_prompt_payloads, BUCKET_NAME, "", 
            METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS)
logger.info(f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{eval_prompt_payloads_for_inference}")
df_per_inference.head()

In [None]:
df_per_inference.shape

In [None]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
eval_records_list = json.loads(df_per_inference.to_json(orient='records'))
logger.info(f"Total number evaluations to be done: {len(eval_records_list)}")

### Run the hierarchy of Model Evaluations
---

In this portion of the step, FMBench performs the following actions:

1. For `Majority Voting` - We suppose that a ground truth already exists in the dataset. We first calculate quantitative metrics. If the desired quantitative threshold for either cosine similarity, levenshtein distance or token set ratio is exceeded, we mark the candidate model response as correct if the user enables the `use_quantitative_metrics` parameter in the common model eval config file. If none of the thresholds are passed, then we check for the overall metric and whether that satisfies the average quantitative threshold. If it does, customers can decide to filter out those responses that are passing the threshold and assume those are correct. If not, all the questions can be supplied to the panel of LLMs for the next set of evaluations.

1. We use the LLM panel of judges (in this case 3 judges), to give a verdict on whether the `answer` from the candidate models during inference is `correct` or `incorrect`. The panel of LLM judges also gives an explanation as to why it evaluated a candidate model response as correct or incorrect.

1. Each model response is given in a JSON structure which is further used for downstream analytics, to decide the comparision of evaluation results between different model candidates and more.

***This step takes a couple of minutes to complete based on the size of the dataset and the judge models. Model completion time depends on the PoLL models being used. `Llama3-70b`, `Cohere command-r-v1` and `claude 3 Sonnet` were used for this example***

In [None]:
# get the llm as a judge panel list
judge_panel_list: List[Dict] = model_eval_subjective_info.get('judge_panel_list', None)
logger.info(f"The judge panel list contains {len(judge_panel_list)} judges. Their information: {judge_panel_list}")

In [None]:
logger.info(f"~Panel of LLM evaluators are going to start evaluating responses. This might take a couple of minutes depending on the size of the dataset and candidate model responses~")

In [None]:
is_quantitative_eval_enabled: bool = eval_config['model_evaluations']['PoLL_Composition_and_Voting'].get('use_quantitative_metrics', False)
logger.info(f"Are quantitative metrics going to be used to make a final eval decision: {is_quantitative_eval_enabled}")

### Start the evaluation process
---

This process loops through the evaluation prompt payloads that are prepared. For Majority voting, a JSON containing 2 elements is generated: "verdict" of whether the given answer is correct or incorrect and an "explanation". 

Responses from either evaluation processes are sent for further downstream processes to determine the most accurate
and subjectively correct model based on domain specific use cases.

In [None]:
n: int = model_eval_subjective_info.get('run_parallel_inference_count', 5)
list_of_lists = [eval_records_list[i * n:(i + 1) * n] for i in range((len(eval_records_list) + n - 1) // n)]
resp_list = []
erroneous_count: int = 0
st: float = time.perf_counter()

# Iterate over the judge panel and sublists
for judge_panelist_info in judge_panel_list:
    logger.info(f"============Running inference for judge panelist {judge_panelist_info['model_id']} for {method_name} ============")
    for idx, sublist in enumerate(list_of_lists):
        model_id: str = judge_panelist_info['model_id']
        logger.info(f"Getting inference for list {idx + 1}/{len(list_of_lists)}, size of list={len(sublist)}")
        try:
            resp_list.extend(ray.get([async_run_eval.remote(i + 1, len(sublist), record, model_id, method_name, record['uuid'])
                               for i, record in enumerate(sublist)]))
        except Exception as e:
            logger.error(f"Error processing list {idx + 1}/{len(list_of_lists)}: {e}")
            erroneous_count += 1
    # Sleep for two seconds before moving on to the next model
    logger.info(f"~Sleeping for one second before the next Panel of LLM evaluates the responses~")
    time.sleep(1)

elapsed_time = time.perf_counter() - st
logger.info(f"Total elapsed time for inference: {elapsed_time:.2f} seconds")
logger.info(f"Total erroneous lists: {erroneous_count}")

#### Send all Panel of LLM evaluator responses to S3 as `JSON` files
---

In [None]:
# Collect all of the panel of LLM evals and send them all as JSON files to S3
if resp_list:
    save_s3_list = []
    try:
        for resp in resp_list:
            llm_eval_response = json.dumps(resp, indent=2)
            candidate_model_id = resp.get('candidate_model', None)
            # Extract a few words from the poll eval response to append to the file name
            response_excerpt = " ".join(resp.get('candidate_model_response', "").split()[:5])
            sanitized_response_excerpt = "".join([c if c.isalnum() else "_" for c in response_excerpt])
            llm_eval_json_fname = f"{candidate_model_id}_{time.time()}_{sanitized_response_excerpt}.json"
            response_s3_path = os.path.join(METRICS_PER_POLL_EVAL_DIR, llm_eval_json_fname)
            logger.info(f"Sending model eval result files to s3 path prefix: {response_s3_path}")
            save_s3_list.append((llm_eval_response,
                                config['aws']['bucket'],
                                "",
                                METRICS_PER_POLL_EVAL_DIR,
                                llm_eval_json_fname))

        # Split the save_s3_list into smaller batches to get
        # rid of the cannot write to s3 bucket - request rate was hitting maximum threshold
        batch_size: int = 50
        delay: float = 1 
        for i in range(0, len(save_s3_list), batch_size):
            batch = save_s3_list[i:i + batch_size]
            # write a batch of evaluation result files to s3
            write_multiple_to_s3(batch)
            time.sleep(delay)  # Delay between batches

    except Exception as e:
        logger.error(f"Error processing or writing to S3: {e}")
else:
    logger.info("No responses to write to S3")

### Save All Results: Perform downstream analytical tasks on each PoLL evaluation result
---

In this portion of the evaluation step:

1. We compile all metrics gathered from the Majority Voting experiment, and send them as `CSV`, `txt` files to s3.

1. These metrics include: Quantitative metrics and binary decision scores (for Majority Voting).

In [None]:
# convert the results list into a dataframe for easy analytics
df_eval_results = pd.DataFrame(resp_list)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.dropna(axis=1, how='all')
# the exception, judge model id, prompt token count, will be NaN for the verdicts decided
# using the lexical match and not moved forward to the panel of LLM evaluators
df_eval_results.head()

In [None]:
# parse out the completion from LLM as a judge and column bind
# the fields of the dictionary to the original results dataframe
df_eval_results_only = df_eval_results['completion'].apply(parse_as_json).apply(pd.Series)
df_eval_results_only.dropna(axis=1, how='all')
df_eval_results = pd.concat([df_eval_results, df_eval_results_only], axis=1)
df_eval_results.rename(columns={'model_id': 'judge_model_id'}, inplace=True)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.dropna(axis=1, how='all')
df_eval_results.head()

### Evaluate the correctness of LLM Evaluators using quantitative metrics
---

In this portion of the evaluation step, we perform the following steps:

1. Evaluate whether the LLM evaluators sent in the correct evaluations using another layer of checks with _Cosine Similarity Score_. 

1. If the verdicts decided by the LLM evaluators (`correct` or `incorrect`) do not meet the respective cosine similarity thresholds, then they are sent into another file for further analysis for human or another LLM evaluation loop. 

There are two possible cases for this evaluation: 

1. **Incorrect Verdicts**: If the verdict from the judge model is incorrect, then check if the cosine similarity of that
    incorrectly identified verdict is less than the `incorrect_verdict_cosine_similarity_threshold`. If so, then it is 
    finally sent in as is into the dataframe. If the LLM evaluator defines a verdict as incorrect but if it has a higher cosine 
    similarity than the incorrect cosine similarity threshold, then it is marked for "needing further evaluation using a human" or
    another LLM evalution.

2. **Correct Verdicts**: If the verdict from the judge model is correct and if it exceeds the correctness cosine similarity threshold, 
    then the model is evaluated as correct and sent in for further downstream analytics. For the correct verdicts identified by the judge models
    that do not meet the correctness cosine similarity threshold, are defined as "needed further human/LLM evaluation".

In [None]:
def quantitative_verdict_cosine_similarity_decision(row: pd.Series) -> pd.Series:
    """
    Given an LLM evaluator response, this function checks for whether a verdict provided by an LLM evaluator 
    is correctly evaluated using a cosine similarity metric threshold for correct and incorrect verdicts. These
    are the two cases that this function handles for each evaluation done using LLM as evaluators:

    1. Incorrect Verdicts: If the verdict from the judge model is incorrect, then check if the cosine similarity of that
    incorrectly identified verdict is less than the `incorrect_verdict_cosine_similarity_threshold`. If so, then it is 
    finally sent in as is into the dataframe. If the LLM evaluator defines a verdict as incorrect but if it has a higher cosine 
    similarity than the incorrect cosine similarity threshold, then it is marked for "needing further evaluation using a human" or
    another LLM evalution.

    2. Correct Verdicts: If the verdict from the judge model is correct and if it exceeds the correctness cosine similarity threshold, 
    then the model is evaluated as correct and sent in for further downstream analytics. For the correct verdicts identified by the judge models
    that do not meet the correctness cosine similarity threshold, are defined as "needed further human/LLM evaluation".

    This function is used if the evaluation method being used is Majority voting, specifically in the case
    of when ground truth is provided.
    """
    try:
        # This is a boolean value that is returned defining whether a given verdict is valid based on 
        # the comparison of its respective cosine similarity score and cosine similarity threshold for correctness/incorrectness
        is_eval_done_correctly: Optional[bool] = None
        correct_cosine_similarity_threshold: Optional[float] = None
        incorrect_cosine_similarity_threshold: Optional[float] = None

        # Check if the evaluation method is Majority voting and if the customer has enabled
        # evaluation decisions to also be made by quantitative metric thresholds
        if is_quantitative_eval_enabled:
            # Retrieve the information that is going to be used to check for whether a verdict is 
            # incorrectly identified as correct or incorrect
            judge_model_id: str = row['judge_model_id']
            verdict: str = row['verdict']
            explanation: str = row['explanation']
            cosine_similarity_score: float = row['cosine_similarity_score']

            # Get the correctness and incorrectness cosine similarity threshold scores
            correct_cosine_similarity_threshold = eval_config['model_evaluations']['quantitative_eval_info'].get('correct_verdict_cosine_similarity_threshold', None)
            incorrect_cosine_similarity_threshold = eval_config['model_evaluations']['quantitative_eval_info'].get('incorrect_verdict_cosine_similarity_threshold', None)

            # If the verdict is correct and is greater than or equal to the correct cosine similarity threshold, then 
            # the verdict is correct. If not, the verdict is identified to need further evaluation
            if verdict == 'correct':
                if cosine_similarity_score >= correct_cosine_similarity_threshold:
                    row['explanation'] = f"Judge model explanation: {explanation}. Cosine similarity is {cosine_similarity_score}, which does meets the threshold of {correct_cosine_similarity_threshold}."
                    is_eval_done_correctly = True
                else:
                    row['verdict'] = "needs further human/LLM evaluation"
                    row['explanation'] = f"Judge model explanation: {explanation}. Cosine similarity is {cosine_similarity_score}, which does not meet the threshold of {correct_cosine_similarity_threshold}. Evaluate it further to determine the correct answer."
                    is_eval_done_correctly = False

            # If the verdict is incorrect and is less than or equal to the incorrect cosine similarity threshold, then 
            # the verdict is correctly identified as incorrect. If not, the verdict is identified to need further evaluation
            elif verdict == 'incorrect':
                if cosine_similarity_score <= incorrect_cosine_similarity_threshold:
                    row['explanation'] = f"Judge model explanation: {explanation}. Cosine similarity is {cosine_similarity_score}, which does meets the threshold of {correct_cosine_similarity_threshold}."
                    is_eval_done_correctly = True
                else:
                    row['verdict'] = "needs further human/LLM evaluation"
                    row['explanation'] = f"Judge model explanation: {explanation}. Cosine similarity is {cosine_similarity_score}, which does not meet the threshold of {incorrect_cosine_similarity_threshold}. Evaluate it further to determine the correct answer."
                    is_eval_done_correctly = False
    except Exception as e:
        logging.error(f"Error in quantitative_verdict_cosine_similarity_decision: {str(e)}")
        is_eval_done_correctly = None
    return row

#### Apply the layer of another evaluation filter on the dataframe containing all LLM as evaluator results
---

In [None]:
if df_eval_results is not None:
    df_eval_results = df_eval_results.apply(lambda r: quantitative_verdict_cosine_similarity_decision(r), axis=1)
df_eval_results.head()

In [None]:
df_eval_results[df_eval_results['verdict'] == 'needs further human/LLM evaluation'].count

In [None]:
# send the raw results as a csv file to the S3 bucket
csv_buffer = io.StringIO()
df_eval_results.to_csv(csv_buffer, index=False)
eval_llm_as_a_judge_results = csv_buffer.getvalue()
eval_results_csv_fpath = os.path.join(METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(eval_llm_as_a_judge_results, BUCKET_NAME, "", 
            METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV)
logger.info(f"Per PoLL model responses saved as a csv to s3://{BUCKET_NAME}/{eval_results_csv_fpath}")
df_eval_results.head()

In [None]:
logger.info(f"Total number of evaluations that are done using different panel of LLM evaluators: {df_eval_results.shape[0]}")

### Majority Voting Results: Send the incorrect and correct responses to S3 separately in `CSV` files for downstream analytics for each model judge
---

In this portion of the step, we will send the model responses as CSV, txt files to s3 for further downstream processing and report generations

In [None]:
# For Majority Voting - all responses from the panel of LLM as evaluators are sent 
# to s3 as a csv file
try:
    logger.info(f"Method name is {method_name}, sending the correct and incorrect verdicts to s3")
    verdict_types: List[str] = ['incorrect', 'correct']
    all_llm_eval_responses_df: Optional[pd.DataFrame] = None
    # iterate through each of the verdict tupe and save each verdict type responses from each evaluator in different
    # csv files. For example, a csv files containing only incorrect verdicts from all model judges, whereas another 
    # csv file containing only the correct verdicts.
    for verdict in verdict_types:
        df_verdicts = df_eval_results[df_eval_results['verdict'] == verdict]
        all_llm_eval_responses_df = pd.concat([all_llm_eval_responses_df, df_verdicts], ignore_index=True)
        if not df_verdicts.empty:
            csv_buffer = io.StringIO()
            df_verdicts.to_csv(csv_buffer, index=False)
            verdict_responses = csv_buffer.getvalue()
            verdict_file = INCORRECT_VERDICT_RESPONSES_FILE if verdict == 'incorrect' else CORRECT_VERDICT_RESPONSES_FILE
            verdict_responses_fpath = os.path.join(METRICS_DIR, verdict_file)
            write_to_s3(verdict_responses, BUCKET_NAME, "", METRICS_DIR, verdict_file)
            logger.info(f"{verdict.capitalize()} verdict responses sent to s3://{BUCKET_NAME}/{verdict_responses_fpath}")
            logger.info(f"Number of {verdict} responses in total: {df_verdicts.shape[0]}")
except Exception as e:
    logger.error(f"Error encountered while writing the evaluation responses to s3: {e}")
    all_llm_eval_responses_df = None

all_llm_eval_responses_df.head()

In [None]:
# For Majority Voting - send all incorrect and correct verdicts as txt files to s3 for readability purposes
try:
    logger.info(f"Method name is {method_name}, sending the correct and incorrect verdicts to s3")
    verdict_types: List[str] = ['incorrect', 'correct']
    judge_model_ids = df_eval_results['judge_model_id'].unique()
    # save each judge model's correct and incorrect verdict files as txt files
    # for downstream analytics and readability purposes
    for judge_model_id in judge_model_ids:
        for verdict in verdict_types:
            df_judge_verdict = df_eval_results[(df_eval_results['verdict'] == verdict) & (df_eval_results['judge_model_id'] == judge_model_id)]
            if not df_judge_verdict.empty:
                txt_buffer = io.StringIO()
                for index, row in df_judge_verdict.iterrows():
                    txt_buffer.write(
                        f"candidate model: {row['candidate_model']}\n"
                        f"candidate model response: {row['candidate_model_response']}\n"
                        f"ground truth: {row['ground_truth']}\n"
                        f"verdict and explanation: {row['completion']}\n\n"
                    )
                judge_verdict_responses = txt_buffer.getvalue()
                verdict_file = f"{judge_model_id}_{verdict}_verdicts_evaluation.txt"
                judge_verdict_responses_fpath = os.path.join(METRICS_DIR, verdict_file)
                write_to_s3(judge_verdict_responses, BUCKET_NAME, "", METRICS_DIR, verdict_file)
                logger.info(f"{verdict.capitalize()} verdict responses for judge {judge_model_id} saved to s3://{BUCKET_NAME}/{judge_verdict_responses_fpath}")
except Exception as e:
    logger.error(f"Error encountered while writing the evaluation responses to s3: {e}")

#### Calculate the overall quantitate metrics of each model scored by the PoLL
---

In [None]:
# mean cosine similarity score, levenshtein distance and token set ratio
try:
    panel_summary_responses_df = df_eval_results.groupby(['judge_model_id', 'candidate_model', 'verdict']).agg(
        count=('verdict', 'size'),
        mean_cosine_similarity=('cosine_similarity_score', 'mean'),
        mean_levenshtein_distance=('levenshtein_distance', 'mean'),
        mean_token_set_ratio=('token_set_ratio_value', 'mean')
    ).unstack(fill_value=0).stack().reset_index()
    csv_buffer = io.StringIO()
    panel_summary_responses_df.to_csv(csv_buffer, index=False)
    panel_summary_responses = csv_buffer.getvalue()
    llm_as_a_judge_per_eval_summary_fpath = os.path.join(METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)
    write_to_s3(panel_summary_responses, BUCKET_NAME, "", METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)
    logger.info(f"Summary on each eval (Majority voting) for each panel judge sent to s3://{BUCKET_NAME}/{llm_as_a_judge_per_eval_summary_fpath}")
    logger.info(f"View information on the accuracy metrics: {panel_summary_responses_df.head()}")
except Exception as e:
    logger.error(f"Could not calculate the overall accuracy metrics for Majority Voting: {e}")
panel_summary_responses_df.head(15)

In [None]:
try:
    panel_summary_responses_df = df_eval_results.groupby(['judge_model_id', 'candidate_model', 'verdict']).agg(
        count=('verdict', 'size'),
        mean_cosine_similarity=('cosine_similarity_score', 'mean'),
        mean_levenshtein_distance=('levenshtein_distance', 'mean'),
        mean_token_set_ratio=('token_set_ratio_value', 'mean')
    ).unstack(fill_value=0).stack().reset_index()
    csv_buffer = io.StringIO()
    panel_summary_responses_df.to_csv(csv_buffer, index=False)
    panel_summary_responses = csv_buffer.getvalue()
    llm_as_a_judge_per_eval_summary_fpath = os.path.join(METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)

    write_to_s3(panel_summary_responses, BUCKET_NAME, "", METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES)
    logger.info(f"Summary on each eval (Majority voting) for each panel judge sent to s3://{BUCKET_NAME}/{llm_as_a_judge_per_eval_summary_fpath}")
except Exception as e:
    logger.error(f"Could not calculate the overall accuracy metrics for Majority Voting: {e}")

panel_summary_responses_df.head(15)

In [None]:
try:
    # Get the per panel judgement on each candidate model in terms of the 
    # how many responses where correct (accuracy) and how many were incorrect (error rate)
    per_panel_judgement_result_df = panel_summary_responses_df.pivot_table(
        index=['candidate_model', 'judge_model_id'],
        columns='verdict',
        values='count',
        fill_value=0
    ).reset_index()

    # Ensure 'correct' and 'incorrect' columns exist
    if 'correct' not in per_panel_judgement_result_df.columns:
        per_panel_judgement_result_df['correct'] = 0
    if 'incorrect' not in per_panel_judgement_result_df.columns:
        per_panel_judgement_result_df['incorrect'] = 0

    # Calculate accuracy and error rate
    per_panel_judgement_result_df = per_panel_judgement_result_df.assign(
        accuracy=lambda df: df.apply(lambda row: 100 if row['incorrect'] == 0 else round(row['correct'] / (row['correct'] + row['incorrect']), 2) * 100, axis=1),
        error_rate=lambda df: df.apply(lambda row: 0 if row['incorrect'] == 0 else round(row['incorrect'] / (row['correct'] + row['incorrect']), 2) * 100, axis=1)
    )
    per_panel_judgement_result_df.head()
except Exception as e:
    logger.error(f"Could not calculate the overall accuracy metrics for Majority Voting: {e}")

per_panel_judgement_result_df.head(10)

In [None]:
try:
    mean_cosine_similarity = df_eval_results.groupby('candidate_model')['cosine_similarity_score'].mean().reset_index().rename(columns={'cosine_similarity_score': 'mean_cosine_similarity'})
    mean_levenshtein_distance = df_eval_results.groupby('candidate_model')['levenshtein_distance'].mean().reset_index().rename(columns={'levenshtein_distance': 'mean_levenshtein_distance'})
    mean_token_set_ratio = df_eval_results.groupby('candidate_model')['token_set_ratio_value'].mean().reset_index().rename(columns={'token_set_ratio_value': 'mean_token_set_ratio_value'})

    overall_accuracy_grouped_panel_df = per_panel_judgement_result_df.groupby('candidate_model')[['accuracy', 'error_rate']].mean().reset_index()
    overall_accuracy_grouped_panel_df = (
        pd.merge(mean_cosine_similarity, overall_accuracy_grouped_panel_df, on='candidate_model')
        .merge(mean_levenshtein_distance, on='candidate_model')
        .merge(mean_token_set_ratio, on='candidate_model')
        .sort_values(by='accuracy', ascending=False)
    )

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    overall_accuracy_grouped_panel_df.to_csv(csv_buffer, index=False)
    overall_panel_result = csv_buffer.getvalue()
    overall_panel_accuracy_metrics_fpath = os.path.join(METRICS_DIR, PER_MODEL_ACCURACY_POLL)

    write_to_s3(overall_panel_result, BUCKET_NAME, "", METRICS_DIR, PER_MODEL_ACCURACY_POLL)
    logger.info(f"Overall accuracy and error rates results of each model sent to s3://{BUCKET_NAME}/{overall_panel_accuracy_metrics_fpath}")
except Exception as e:
    logger.error(f"Could not calculate the overall accuracy metrics for Majority Voting: {e}")

overall_accuracy_grouped_panel_df.head()

#### Send all responses from the evaluation process to S3 as a txt file for further downstream processing and readability purposes
---

In [None]:
try:
    # Write all explanations to a file and send to S3
    explanations_txt_buffer = io.StringIO()
    for index, row in df_eval_results.iterrows():
        explanations_txt_buffer.write(
            f"candidate model: {row['candidate_model']}\n"
            f"candidate model response: {row['candidate_model_response']}\n"
            f"ground truth: {row['ground_truth']}\n"
            f"verdict and explanation: {row['completion']}\n\n"
        )

    explanations_txt_file_content = explanations_txt_buffer.getvalue()
    explanations_fpath = os.path.join(METRICS_DIR, ALL_EVALUATIONS_IN_TXT)
    write_to_s3(explanations_txt_file_content, BUCKET_NAME, "", METRICS_DIR, ALL_EVALUATIONS_IN_TXT)
    logger.info(f"All text eval content from the llm judge panelists sent to s3://{BUCKET_NAME}/{explanations_fpath}")
    logger.info(f"All of the content including the candidate model responses, ground truth, evaluation are written: {explanations_txt_file_content}")
except Exception as e:
    logger.error(f"Could not calculate the overall accuracy metrics for Majority Voting: {e}")