## Evaluate candidate models using Majority Voting with PoLL (Panel of LLM Evaluators), gather findings on quantitative metrics (such as _Cosine Similarity, levenshtein distance, and token set ratio_)

---

_This notebook works best with the conda_python3 kernel on a ml.t3.medium machine_.

#### This step of the solution evaluates the quality of responses generated by the candidate models that have to be benchmarked/evaluated. It does so by performing the steps below:

- **Fetches the inference request file that contains all results from the inference step**: The inference results file that contains all inferences from each candidate model is fetched along with the associated metrics such as ground truth (if any), source payload file, concurrency level, etc.

- **Generates quantitative metrics for evaluation**: This step calculates quantitative metrics to measure similarity and accuracy, using the _Cosine Similarity, levenshtein distance, and token set ratio_ metrics. Cosine similarity is a metric used to measure how similar two vectors are, regardless of their size. Levenshtein distance is a string metric for measuring the difference between two sequences. The Token Set Ratio algorithm tokenizes both input strings, removes duplicate tokens, and calculates the similarity score. This helps in getting a quantitative overall score to get an overall accuracy/similarity score for each model based on its responses. These quantitative metrics can be used for further downstream tasks to measure the trends, patterns and behaviours of different models on the same dataset or the same models hosted on different serving stacks across various AWS Generative AI services.
- **Use a _Panel of LLM Evaluators_ approach to get subjective evaluations**: Refer to this [paper](https://arxiv.org/pdf/2404.18796). We use the following ways to evaluate the responses from the `candidate models` (models used to generate inferences)

  - **Majority Voting**: When a dataset provides a ground truth, FMBench uses a technique called `Majority Voting`. Here, we use PoLL, _or a panel of LLM evaluators_, from different model families to evaluate each candidate model's response based on whether it generates a `correct` or an `incorrect` answer simply based on its comparison with the ground truth. Using models from different model families as a PoLL, increases it's ability to match a human level evaluation, makes the evaluation process more streamlined, consistent across all the responses, and reduces the latency and cost of evaluating the candidate models over time. The intra model bias during the evaluation process is also eliminated since more than a single model acts as a panel evaluator. FMBench uses [the majority voting evaluation instructions](prompt_template/eval_criteria/evaluation_instructions_majority_vote.txt) that are fed into the prompt templates supplied to different judge models to evalaute responses at runtime.

  **Evaluation Process Flow**

  1. First, all the quantitative metrics are calculated, such as the _Cosine Similarity, levenshtein similarity, and token set ratio_ for each candidate model. Once all the quantitative metrics are calculated, all the responses from each candidate model on each payload file is sent through a _Panel of LLM Evaluators_. This panel uses the ground truth for each sample from the data and checks the correctness of every candidate model output across the entire dataset. An example of a prompt that is used to perform this evaluation is [here](prompt_template/eval_criteria/claude_eval_prompt_templates/claude_eval_majority_vote.txt). Every LLM evaluator uses its specific prompt template that is populated with the question, context, ground truth, candidate model response and evaluation instructions at runtime. The evaluation instructions that are used for majority voting can be viewed [here](prompt_template/eval_criteria/evaluation_instructions_majority_vote.txt). Every LLM evaluator compares the model output to the given ground truth and gives a verdict as a binary decision ([`correct`/`incorrect`]) based on the correctness of the candidate model response and an explanation for that given verdict. For the purpose of this evaluation, FMBench uses 3 models as LLM evaluators but users/customers can use less or more.

  1. Next, all LLM evaluations of the candidate models are sent through another evaluation layer. This layer performs a check on the evaluations done using the _Cosine Similarity Score_. This evaluation step reinforces the correctness or incorrectness of the evaluation made by the panel of LLM evaluators. This categorizes each evaluation into 3 sub categories:

     1. If an LLM evaluator evaluates a candidate model response as `incorrect`, then we check if the _Cosine similarity score_ of that response is less than or equal to the `incorrect_verdict_cosine_similarity_threshold`. If so, then we define the evaluation done as correctly incorrect and move to the next.

     1. If an LLM evaluator evaluates a candidate model response as `correct`, then we check if the _Cosine similarity score_ of that response is greater than or equal to the `correct_verdict_cosine_similarity_threshold`. If so, then we define the evaluation done as correctly correct and move to the next.

     1. If an LLM evaluator evaluates a candidate model response as `correct` or `incorrect`, but none or either of the Cosine Similarity Thresholds are not met, those responses are labelled as `needs_further_human_or_LLM_evaluation`. For responses that are initially `incorrect` and do not satisfy the cosine similarity threshold are categorized as incorrect anyways.

     ![](img/llm_eval_flowchart.png)

**_All evaluations are generated in a JSON format for further downstream analytics on the evaluation results_**


#### Import all of the necessary libraries below to run this notebook


In [None]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys

if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))


In [None]:
import io
import ray
import time
import json
import glob
import yaml
import tempfile
import pandas as pd
import pandas as pd
from numpy import dot
import plotly.io as pio
from pathlib import Path
from statistics import mode
import plotly.express as px
from fmbench.utils import *
from fmbench.globals import *
from numpy.linalg import norm
from litellm import completion
from typing import List, Optional, Dict
import importlib.resources as pkg_resources
from sentence_transformers import SentenceTransformer
from fmbench.scripts.pricing import load_and_update_pricing


In [None]:
# set a logger to get logs
logging.basicConfig(
    format="[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s",
    level=logging.INFO,
)
logger = logging.getLogger(__name__)


In [None]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()


Load the Config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations


In [None]:
logger.info(f"CONFIG_FILE={CONFIG_FILE}")
config = load_main_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))


#### Load the associated pricing config file


In [None]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
pricing_file_path: str = config["pricing"]

# initialize the pricing config file to None
pricing_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files("fmbench"), "configs")
logger.info(f"Using fmbench.configs directory: {config_dir}")

pricing_module = Path(config["pricing"])
logger.info(
    f"pricing config provided for inference from this model is --> {pricing_module}"
)
pricing_file_path = os.path.join(config_dir, pricing_module)
logger.info(f"pricing config file path is --> {pricing_file_path}")

instance_list = [
    experiment.get("instance_type")
    for experiment in config.get("experiments", [])
    if experiment.get("instance_type")
]


# Print the extracted instance types
logger.info(f"Extracted instances from the main config --> {instance_list}")

pricing_config = load_and_update_pricing(
    pricing_file_path, PRICING_FALLBACK_YAML_PATH, instance_list
)
logger.info(f"pricing config file recorded: {json.dumps(pricing_config, indent=2)}")


### Load the model evaluation information

---

The common model configuration file contains information about which evaluation strategy to use (`majority voting`),
the ground truth column if provided by the user in the config file, which FMs on Bedrock to use as LLM as evaluators,
the prompt templates used by each LLM evaluator for Majority voting, the quantitative metric thresholds for an evaluation to be correct/incorrect,
directory paths, inference parameters and more.


In [None]:
# represents getting the config file from the s3 bucket/https path for pricing yml information
model_eval_fpath: str = config["model_evaluations"]

# initialize the pricing config file to None
eval_config: Optional[Dict] = None

# get the current config dir path
config_dir = Path(pkg_resources.files("fmbench"), "configs")
logger.info(f"Using fmbench.configs directory: {config_dir}")

eval_module = Path(config["model_evaluations"])
logger.info(f"eval config provided for evaluation --> {eval_module}")
eval_file_path = os.path.join(config_dir, eval_module)
logger.info(f"eval config file path is --> {eval_file_path}")

# eval_config = load_config(eval_file_path).format(method_name=config['method_name'])
with open(eval_file_path, "r") as file:
    model_eval_info = file.read()
    # load the preliminary unformatted config file to fetch the method name and plug it into
    # the prompt template file names
    model_eval_info_config = yaml.safe_load(model_eval_info)
    model_eval_formatted_content = model_eval_info.format(
        ground_truth=config["datasets"].get("ground_truth_col_key", None),
        method_name=model_eval_info_config["model_evaluations"][
            "PoLL_Composition_and_Voting"
        ].get("method", None),
        question=config["datasets"].get("question_col_key", None),
    )
    eval_config = yaml.safe_load(model_eval_formatted_content)

# view all information that will be used in the evaluation process, which includes the ground truth
# in the dataset, the evaluation method (Majority voting) and associated information
logger.info(f"eval config file recorded: {json.dumps(eval_config, indent=2)}")


In [None]:
debug = False
if debug is True:
    metrics_path_file: str = os.path.join("..", "..", METADATA_DIR, METRICS_PATH_FNAME)
else:
    metrics_path_file: str = os.path.join(METADATA_DIR, METRICS_PATH_FNAME)
logger.info(
    f"cwd={os.getcwd()}, METADATA_DIR={METADATA_DIR}, METRICS_PATH_FNAME={METRICS_PATH_FNAME}, metrics_path_file={metrics_path_file}"
)
METRICS_DIR: str = Path(metrics_path_file).read_text().strip()
logger.info(f"metrics_path_file={metrics_path_file}, METRICS_DIR={METRICS_DIR}")


In [None]:
file_path: str = os.path.join(
    METRICS_DIR, config["report"]["per_inference_request_file"]
)
logger.info(f"File path containing the metrics per inference folder --> {file_path}")

# Read the file from S3
try:
    file_content = get_s3_object(config["aws"]["bucket"], file_path)
    # Use pandas to read the CSV content
    df_per_inference = pd.read_csv(io.StringIO(file_content))
    logger.info(
        f"{file_path} read into dataframe of shape {df_per_inference.shape}, "
        f"cols={df_per_inference.columns}"
    )
    logger.info(
        f"{file_path} contains results for the following endpoints={df_per_inference.endpoint_name.unique()}"
    )
    logger.info(df_per_inference.head())
except Exception as e:
    logger.error(f"Error reading from S3: {e}")


#### Remove duplicates in the inference file caused due to higher concurrency levels

Calculate the accuracy on a unique set of data for each candidate model. If a given candidate model ran inferences on multiple concurrency levels for benchmarking purposes, FMBench uses only the unique set of prompts used per candidate model to get a measure of accuracy. This in turn reduces the time and cost to get model evaluations through the panel of LLM evalautors


In [None]:
df_per_inference.head(100)


In [None]:
logger.info(
    f"Inferences recorded from {len(df_per_inference.endpoint_name.unique())} endpoints."
)
logger.info(
    f"Shape of the inference file before removing duplicate inferences per candidate model: {df_per_inference.shape}"
)
df_per_inference = df_per_inference.drop_duplicates(
    ["endpoint_name", "prompt"], keep="last"
)
logger.info(
    f"Shape of the inference file after removing duplicate inferences per candidate model: {df_per_inference.shape}"
)
df_per_inference.head(10)


In [None]:
logger.info(
    f"Going to be using this inference file to generate evaluations on -> {df_per_inference.head()}"
)


### Relationship between prompt token length and inference latency for different instances and concurrency levels


In [None]:
logger.info(
    f"Information on the inference file being used for evaluations: {df_per_inference.latency.describe()}"
)


In [None]:
logger.info(
    f"Total number of inferences to evaluate from candidate models: {df_per_inference.shape[0]}"
)


### Use the `sentence-transformers/all-mpnet-base-v2` embeddings model to calculate the _Cosine Similarity_ scores

---

This portion of the evaluation step does as follows:

1. Uses the `sentence-transformers/all-mpnet-base-v2` model from Hugging Face. This is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

1. Use the embeddings model to get quantitative metrics from the inferences. This helps to get a similarity score between the ground truth answers from a dataset if any are given and the actual responses from the model received during inference.


In [None]:
# get the quantitiative evaluation information from the config file, such as the embeddings model
# to be used
embeddings_model_quantitative_info: Dict = eval_config["model_evaluations"][
    "quantitative_eval_info"
]


def load_model():
    """
    This function loads the sentence-transformers model based on the provided model ID.
    """
    try:
        model = None
        model_id = embeddings_model_quantitative_info["embeddings_model_id"].get(
            "model_id", None
        )
        if model_id:
            model = SentenceTransformer(model_id)
        else:
            raise ValueError(
                "Model ID is not provided or invalid in the configuration."
            )
    except Exception as e:
        logger.error(
            f"The SentenceTransformer embeddings model could not be loaded: {e}"
        )
        model = None
    return model


In [None]:
# load the embeddings model to calculate the cosine similarity scores
model = load_model()
logger.info(
    f"Embeddings model info which will be used to calculate the cosine similarity scores for Majority Voting Eval: {model}"
)


In [None]:
def calculate_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between two texts. In this case,
    the cosine similarity is the comparison between the ground truth in the given dataset
    and the candidate model's response
    """
    try:
        cosine_similarity_score: float = None
        # returns the embedding for a given text using the sentence-transformers model.
        A = model.encode([text1])[0]
        B = model.encode([text2])[0]
        cosine_similarity_score = dot(A, B) / (norm(A) * norm(B))
        logger.info(
            f"Calculating the cosine similarity score, current score: {cosine_similarity_score}"
        )
    except Exception as e:
        logger.error(f"Cosine similarity was not calculated at this iteration: {e}")
        cosine_similarity_score = None
    return cosine_similarity_score


In [None]:
# get the method that is being used to evaluate the content (which is either Majority voting)
model_eval_subjective_info: List[Dict] = eval_config["model_evaluations"][
    "subjective_eval_info"
]
method_name: str = eval_config["model_evaluations"]["PoLL_Composition_and_Voting"].get(
    "method", None
)
logger.info(
    f"The evaluation method FMBench is going to use to evaluate different model responses: {method_name}"
)
logger.info(
    f"judge panel being used to evaluate model responses: {model_eval_subjective_info.get('judge_panel_list', None)}"
)


In [None]:
# calculate the quantitative metrics if evaluation is set to Majority voting
logger.info(
    f"Valid ground truth column found in the inference file: {eval_config['model_evaluations'].get('ground_truth_col')}, calculating cosine similarity scores"
)
logger.info(
    f"~Creating embeddings and calculating cosine similarity scores for of all candidate model responses now. This might take a 1-2 minutes~"
)
ground_truth_col_name: Optional[str] = config["datasets"].get(
    "ground_truth_col_key", None
)

# Check for ground truth column and raise an exception if not found
if ground_truth_col_name is None:
    raise ValueError(
        f"Expected a valid ground truth column name in the config file information, got {ground_truth_col_name}. Cannot continue."
    )

# If we reach this point, we know the ground truth column exists
df_per_inference["cosine_similarity_score"] = df_per_inference.apply(
    lambda row: calculate_cosine_similarity(row["completion"], row["ground_truth"]),
    axis=1,
)

logger.info(f"Calculated the cosine similarity score: {df_per_inference.head()}")


## Model Evaluations: Hierarchical Flow

---

1. Check for the lexical match/similarity between the ground truth and the answer using one main quantitative metrics: \_Cosine similarity score.

1. Parse each candidate model response through a panel of LLM evaluators to determine the accuracy of that model across the entire dataset.

1. Send all evaluations from the LLM evaluators through a final evaluation layer to check if the evaluation is correctly or incorrectly made with the help of quantitative metric thresholds.


In [None]:
# define the all_metrics path to send the evaluation metrics to
all_metrics_fpath: str = os.path.join(METRICS_DIR, config["report"]["all_metrics_file"])
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_cosine_similarity_scores_csv = csv_buffer.getvalue()
inference_cosine_similarity_scores_s3_path = os.path.join(
    METRICS_DIR, PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES
)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(
    df_per_inference_with_cosine_similarity_scores_csv,
    BUCKET_NAME,
    "",
    METRICS_DIR,
    PER_INFERENCE_FILE_WITH_COSINE_SIMILARITY_SCORES,
)
logger.info(
    f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{inference_cosine_similarity_scores_s3_path}"
)
df_per_inference.head()


### Use _Panel of LLM Evaluators_ to get Subjective Evaluations on various evaluation criteria

---

In this portion of the notebook, we run evaluations on all candidate models using a panel of LLM evaluators. We use a main evaluation method: `Majority Voting`. To eliminate intra-model bias, we address this by scoring answer correctness based not on a single judge, but instead on a panel composed of multiple evaluator models.

1. **Majority Voting**: We use the PoLL to evaluate candidate model responses by checking its correctness compared to a provided ground truth answer in the dataset. We prompt each PoLL to evaluate and give the response in a JSON structure, giving a verdict on whether the response is correct or incorrect based on its comparison with the ground truth, and an explanation as to why that is. With all verdicts and responses in JSON, we can perform downstream tasks such as:

   1. Calculate the overall accuracy of each model using the correct versus the (correct + incorrect) responses

   1. Calculate the `error rate` or frequency or incorrect responses

   1. Categorize the errors based on the explanations provided by the evaluators. Common categories might include misunderstanding the question, incomplete answers, factual inaccuracies

   1. Summary of overall correct/incorrect, and the best model based on the PoLL. Rank the models on Correctness versus Incorrectness.


In [None]:
# get the qualitative/subjective evaluation information from the config file to evaluate answers from different
# endpoints on various criteria
model_eval_subjective_info: Dict = eval_config["model_evaluations"][
    "subjective_eval_info"
]
eval_criteria_list = model_eval_subjective_info.get("eval_criteria", None)
logger.info(
    f"available llm as a judge evaluation information to use: {json.dumps(model_eval_subjective_info, indent=2)}"
)


In [None]:
# get the inference parameters that the LLM judge panel will use while evaluating model candidate responses
INFERENCE_PARAMETERS_LLM_PANEL: Dict = eval_config["model_evaluations"][
    "subjective_eval_info"
].get("inference_parameters", None)
logger.info(
    f"Inference parameters that LLM evaluators will use: {INFERENCE_PARAMETERS_LLM_PANEL}"
)


In [None]:
def get_llm_evaluation(model_id: str, prompt: str):
    """
    Get inference using LiteLLM. This function is called by each evaluator on the panel of
    llm evaluators to get a response on a given prompt. This is in the case of where there is
    Majority voting enabled
    """
    # represents the service name
    logger.info(f"get_inference, model_id={model_id}")
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name
    # initialize the response dict
    ret = dict(
        exception=None,
        prompt=prompt,
        completion=None,
        completion_token_count=None,
        prompt_token_count=None,
        input_token_cost=None,
        output_token_cost=None,
        total_cost=None,
        model_id=model_id,
    )
    body = ret["prompt"]
    os.environ["AWS_REGION_NAME"] = aws_region
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        print(f"Invoking {bedrock_model}......")
        response = completion(
            model=bedrock_model,
            messages=[{"content": body, "role": "user"}],
            temperature=INFERENCE_PARAMETERS_LLM_PANEL.get("temperature", 0.1),
            max_tokens=INFERENCE_PARAMETERS_LLM_PANEL.get("max_tokens", 100),
            caching=INFERENCE_PARAMETERS_LLM_PANEL.get("caching", False),
        )
        print(f"response: {response}")
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens
        ret["prompt_token_count"] = response.usage.prompt_tokens
        ret["completion_token_count"] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret["exception"] = e
    logger.info(f"completion: {ret['completion']}")
    return ret


In [None]:
def safe_filename(s):
    """
    convert a string to another string that can be used as a filename
    i.e. remove white space and non-word chars
    """
    if s is None:
        return "None"
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", "", s)
    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", "-", s)
    return s


In [None]:
def parse_as_json(x: str) -> Optional[Dict]:
    """
    Convert a string into a dictionary. Remove any
    stray whitespaces which could break the json parsing
    """
    d: Optional[Dict] = None
    try:
        x = x.replace("\n", "").replace("\t", "")
        d = json.loads(x)
    except Exception as e:
        print(f"parse_as_json, error parsing string as json, string={x}")
    return d


In [None]:
df_per_inference.rename(
    columns={"completion": "candidate_model_response"}, inplace=True
)
df_per_inference.head()


#### Prepare the evaluation prompt payloads

---

Here, the evaluation prompt template is used by the LLM judge to evaluate the answers on different criteria.
This prompt template function uses a set of rules, prompt template, the answer, and ground truth (if any) in the
evaluation solution.


In [None]:
def prepare_eval_prompts(
    eval_template: str,
    answer: str,
    rules: str,
    ground_truth: Optional[str],
    question: Optional[str],
):
    """
    This function prepares the evaluation prompts by preparing the standard eval prompt template
    with the rules of a given subjective criteria, context, answer and ground truth (if any ground truth is provided)
    This function prepares prompt payloads for both evaluation criteria: Majority voting. In the
    case of Majority voting, there is no subjective criteria that is inputted.
    """
    try:
        processed_eval_template: Optional[str] = None
        processed_eval_template = eval_template.format(
            rules=rules, answer=answer, ground_truth=ground_truth, question=question
        )
    except Exception as e:
        logger.error(
            f"Error encountered while generating the evaluation prompt template: {e}"
        )
        processed_eval_template = None
    return processed_eval_template


In [None]:
def clear_dir(dir_path: str):
    files = glob.glob(os.path.join(dir_path, "*"))
    for f in files:
        os.remove(f)


# create the metrics directory that stores all of the json files containing evaluations from all Panel of LLM evaluators
METRICS_PER_POLL_EVAL_DIR: str = os.path.join(
    METRICS_DIR, METRICS_PER_POLL_EVAL_DIR_NAME
)
_ = list(map(clear_dir, [METRICS_PER_POLL_EVAL_DIR]))


In [None]:
# Retrieve the pricing information for the instance type
bedrock_pricing = pricing_config["pricing"]["token_based"]


In [None]:
def normalize_candidate_model_name(model_name: str) -> str:
    # the candidate model is actually an endpoint name, so remove the timestamp and -endpoint from
    # string so "Meta-Llama-3-1-8B-Instruct-g5-2024-08-17-01-25-45-284-endpoint" would become
    # "Meta-Llama-3-1-8B-Instruct-g5", and no change would happen for Bedrock models as they dont
    # contain timestamp, for example anthropic.claude-3-opus-20240229-v1:0 would remain unchanged

    # regex to match the timestamp and the endpoint part
    regex = r"-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}-\d{3}.*$"
    # removing the matched part
    model_name_normalized = re.sub(regex, "", model_name)
    return model_name_normalized


def run_panel_of_llm_evals(
    i: int, total: int, row: Dict, model_id: str, eval_method_name: str, uuid: str
) -> Dict:
    """
    Runs the evaluation for one row
    The eval prompt is already available in the row dictionary
    and we simply want to run the inference against the judge model.
    The results are returned in a new dictionary that contains the model
    response and some fields from the original dictionary
    """
    try:
        # initialize the response dictionary that contains the pricing information
        # along with other metrics. If there is any error encountered, this response
        # dictionary is returned as is
        resp = dict(
            exception=None,
            completion=None,
            completion_token_count=None,
            prompt_token_count=None,
            input_token_cost=None,
            output_token_cost=None,
            total_cost=None,
            model_id=model_id,
            candidate_model_response=row["candidate_model_response"],
            candidate_model=None,
            payload_file=row["payload_file"],
            cosine_similarity_score=row["cosine_similarity_score"],
            ground_truth=row["ground_truth"],
            question=row["question"] if "question" in row else None,
        )
        # save all the responses from the model in a dictionary
        resp: Optional[Dict] = None
        candidate_model = normalize_candidate_model_name(row["endpoint_name"])
        logger.info(
            f"run_eval, row {i}/{total}, judge_model_id={model_id}, candidate model={candidate_model}"
        )
        # create the payload for model inference
        prompt = row[f"{model_id}_{method_name}_eval_prompt"]
        # generate the evaluation on the data using the model judge
        resp = get_llm_evaluation(model_id, prompt)
        # assign the completion from the candidate model to the `candidate_model_response`,
        # and the actual evaluation will be contained in a field called `completion`
        resp["candidate_model_response"] = row["candidate_model_response"]
        resp["candidate_model"] = candidate_model
        resp["payload_file"] = row["payload_file"]
        resp["cosine_similarity_score"] = row["cosine_similarity_score"]
        # Calculate cost based on the number of input and output tokens
        model_pricing = bedrock_pricing.get(model_id, None)
        if model_pricing:
            resp["input_token_cost"] = (
                resp["prompt_token_count"] / 1000.0
            ) * model_pricing["input-per-1k-tokens"]
            resp["output_token_cost"] = (
                resp["completion_token_count"] / 1000.0
            ) * model_pricing["output-per-1k-tokens"]
            resp["total_cost"] = resp["input_token_cost"] + resp["output_token_cost"]
            logger.info(
                f"instance_type={model_id}, prompt_tokens={resp['prompt_token_count']}, "
                f"input_token_cost={resp['input_token_cost']}, output_token_cost={resp['completion_token_count']}, "
                f"output_token_cost={resp['output_token_cost']}, total cost={resp['total_cost']}"
            )
        else:
            logger.error(
                f'model pricing for "{model_id}" not found, '
                f"cannot calculate experiment cost"
            )
        # if there is a ground truth (in case of Majority voting) or
        # criteria name (in case of average pooline), include those in the json response
        resp["ground_truth"] = row["ground_truth"]
        if "question" in row:
            resp["question"] = row["question"]
    except Exception as e:
        logger.error(f"Error encountered while running evaluation: {e}")
        resp["exception"] = str(e)
    return resp


# we use Ray to parallize
@ray.remote
def async_run_eval(
    i: int, total: int, row: Dict, model_id: str, eval_method_name: str, uuid: str
) -> Dict:
    print(
        f"async_run_eval, i={i}, total={total}, judge_model_info={model_id}, eval_method: {eval_method_name}, uuid: {uuid}"
    )
    logger.info(
        f"async_run_eval, i={i}, total={total}, judge_model_info={model_id}, eval_method: {eval_method_name}, uuid: {uuid}"
    )
    return run_panel_of_llm_evals(i, total, row, model_id, eval_method_name, uuid)


In [None]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
df_per_inference_list = json.loads(df_per_inference.to_json(orient="records"))
logger.info(
    f"Total number of candidate models going to be evaluated: {len(df_per_inference_list)}"
)


#### Prepare evaluation prompt templates

---

This portion of the step prepares the evaluation prompt templates that are used in the evaluation process of using `Majority Voting` using the PoLL.


In [None]:
model_eval_subjective_info


In [None]:
logger.info(
    f"Number of judges being used for this model evaluation: {len(model_eval_subjective_info.get('judge_panel_list', None))}"
)
logger.info(
    f"Inference Parameters that are going to be used by the judge panels while evaluating candidate models: {model_eval_subjective_info.get('inference_parameters', None)}"
)


#### Prepare prompt payloads

---

In this portion of the step, FMBench iterates through each of the row containing the model response and prepares the corresponding prompt payloads. In this step, the prompt template for a given evaluation method is used. For Majority voting, a standard prompt template is used with evaluation instructions and candidate model responses.


In [None]:
# Assuming fmbench is a valid Python package and scripts is a subdirectory within it
model_eval_dir: Optional[str] = eval_config["model_evaluations"]["model_eval_dir"]
eval_prompts_dir: str = Path(
    pkg_resources.files("fmbench"),
    f"{config['s3_read_data']['prompt_template_dir']}/{model_eval_dir.get('eval_prompts_dir', None)}",
)

try:
    # Iterate through each LLM as a judge and each evaluation criterion
    for llm_info in model_eval_subjective_info.get("judge_panel_list", []):
        model_id: str = llm_info["model_id"]
        method_name: str = eval_config["model_evaluations"][
            "PoLL_Composition_and_Voting"
        ].get("method", None)
        eval_prompt_template_fname: str = (
            f"{llm_info.get('eval_prompt_template_name', None)}.txt"
        )

        # Use the evaluation prompt template path to read in the standard prompt template that
        # is used in the creation of prompt payloads
        eval_prompt_template_dir = llm_info.get("eval_prompt_template_dir", None)
        eval_prompt_template_path = os.path.join(
            eval_prompts_dir, eval_prompt_template_dir, eval_prompt_template_fname
        )
        logger.info(
            f"evaluation prompt template file path being used for {model_id}: {eval_prompt_template_path}"
        )
        logger.info(
            f"evaluation prompt template file name: {eval_prompt_template_fname}"
        )
        eval_prompt_template = Path(eval_prompt_template_path).read_text()
        logger.info(f"Evaluation prompt template being used: {eval_prompt_template}")

        # There is a standard instructions file for both Majority voting on how to evaluate the
        # model responses (whether it should be a binary decision or rating on a scale of 1-5)
        eval_instructions_fname = next(
            (
                rule
                for rule in model_eval_dir.get("eval_instructions_files", None)
                if method_name in rule
            ),
            None,
        )
        rules = Path(
            os.path.join(eval_prompts_dir, eval_instructions_fname)
        ).read_text()
        logger.info(f"rules: {rules}")
        column_name = f"{model_id}_{method_name}_eval_prompt"
        df_per_inference[column_name] = df_per_inference.apply(
            lambda r: prepare_eval_prompts(
                eval_prompt_template,
                r["candidate_model_response"],
                rules,
                r["ground_truth"],
                r["question"],
            ),
            axis=1,
        )
except Exception as e:
    logger.error(f"Error occurred in the creation of prompt payloads: {e}")
    df_per_inference = None

df_per_inference.head()


In [None]:
csv_buffer = io.StringIO()
df_per_inference.to_csv(csv_buffer, index=False)
df_per_inference_with_eval_prompt_payloads = csv_buffer.getvalue()
eval_prompt_payloads_for_inference = os.path.join(
    METRICS_DIR, PROCESSED_EVAL_PROMPT_PAYLOADS
)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(
    df_per_inference_with_eval_prompt_payloads,
    BUCKET_NAME,
    "",
    METRICS_DIR,
    PROCESSED_EVAL_PROMPT_PAYLOADS,
)
logger.info(
    f"Per inference cosine similarity scores saved to s3://{BUCKET_NAME}/{eval_prompt_payloads_for_inference}"
)
df_per_inference.head()


In [None]:
df_per_inference.shape


In [None]:
# convert the dataframe into a list of dicts as that is easy to parallize via Ray
eval_records_list = json.loads(df_per_inference.to_json(orient="records"))
logger.info(f"Total number evaluations to be done: {len(eval_records_list)}")


### Run the hierarchy of Model Evaluations

---

In this portion of the step, FMBench performs the following actions:

1. For `Majority Voting` - We suppose that a ground truth already exists in the dataset. We first calculate quantitative metrics.

1. We use the LLM panel of judges (in this case 3 judges), to give a verdict on whether the `answer` from the candidate models during inference is `correct` or `incorrect`. The panel of LLM judges also gives an explanation as to why it evaluated a candidate model response as correct or incorrect.

1. Each model response is given in a JSON structure which is further used for downstream analytics, to decide the comparision of evaluation results between different model candidates and more.

1. The evaluations are sent through a final layer to decide if an evaluation made using an LLM evaluator is made correctly/incorrectly.

**_This step takes a couple of minutes to complete based on the size of the dataset and the judge models. Model completion time depends on the PoLL models being used. `Llama3-70b`, `Cohere command-r-v1` and `claude 3 Sonnet` were used for this example_**


In [None]:
# get the llm as a judge panel list
judge_panel_list: List[Dict] = model_eval_subjective_info.get("judge_panel_list", None)
logger.info(
    f"The judge panel list contains {len(judge_panel_list)} judges. Their information: {judge_panel_list}"
)


In [None]:
logger.info(
    f"~Panel of LLM evaluators are going to start evaluating responses. This might take a couple of minutes depending on the size of the dataset and candidate model responses~"
)


In [None]:
is_quantitative_eval_enabled: bool = eval_config["model_evaluations"][
    "PoLL_Composition_and_Voting"
].get("use_quantitative_metrics", False)
logger.info(
    f"Are quantitative metrics going to be used to make a final eval decision: {is_quantitative_eval_enabled}"
)


### Start the evaluation process

---

This process loops through the evaluation prompt payloads that are prepared. For Majority voting, a JSON containing 2 elements is generated: "verdict" of whether the given answer is correct or incorrect and an "explanation".

Responses from either evaluation processes are sent for further downstream processes to determine the most accurate
and subjectively correct model based on domain specific use cases.


In [None]:
n: int = model_eval_subjective_info.get("run_parallel_inference_count", 5)
list_of_lists = [
    eval_records_list[i * n : (i + 1) * n]
    for i in range((len(eval_records_list) + n - 1) // n)
]
resp_list = []
erroneous_count: int = 0
st: float = time.perf_counter()

# Iterate over the judge panel and sublists
for judge_panelist_info in judge_panel_list:
    logger.info(
        f"============Running inference for judge panelist {judge_panelist_info['model_id']} for {method_name} ============"
    )
    for idx, sublist in enumerate(list_of_lists):
        model_id: str = judge_panelist_info["model_id"]
        logger.info(
            f"Getting inference for list {idx + 1}/{len(list_of_lists)}, size of list={len(sublist)}"
        )
        try:
            resp_list.extend(
                ray.get(
                    [
                        async_run_eval.remote(
                            i + 1,
                            len(sublist),
                            record,
                            model_id,
                            method_name,
                            record["uuid"],
                        )
                        for i, record in enumerate(sublist)
                    ]
                )
            )
        except Exception as e:
            logger.error(f"Error processing list {idx + 1}/{len(list_of_lists)}: {e}")
            erroneous_count += 1
    # Sleep for two seconds before moving on to the next model
    logger.info(
        f"~Sleeping for one second before the next Panel of LLM evaluates the responses~"
    )
    time.sleep(1)

elapsed_time = time.perf_counter() - st
logger.info(f"Total elapsed time for inference: {elapsed_time:.2f} seconds")
logger.info(f"Total erroneous lists: {erroneous_count}")


#### Send all Panel of LLM evaluator responses to S3 as `JSON` files

---


In [None]:
# Collect all of the panel of LLM evals and send them all as JSON files to S3
if resp_list:
    save_s3_list = []
    try:
        for resp in resp_list:
            if resp:
                llm_eval_response = json.dumps(resp, indent=2)
                candidate_model_id = resp.get("candidate_model", None)
                if candidate_model_id:  # Ensure candidate_model_id is not None
                    # Extract a few words from the poll eval response to append to the file name
                    response_excerpt = " ".join(
                        resp.get("candidate_model_response", "").split()[:5]
                    )
                    sanitized_response_excerpt = "".join(
                        [c if c.isalnum() else "_" for c in response_excerpt]
                    )
                    llm_eval_json_fname = f"{candidate_model_id}_{time.time()}_{sanitized_response_excerpt}.json"
                    response_s3_path = os.path.join(
                        METRICS_PER_POLL_EVAL_DIR, llm_eval_json_fname
                    )
                    logger.info(
                        f"Sending model eval result files to s3 path prefix: {response_s3_path}"
                    )
                    save_s3_list.append(
                        (
                            llm_eval_response,
                            config["aws"]["bucket"],
                            "",
                            METRICS_PER_POLL_EVAL_DIR,
                            llm_eval_json_fname,
                        )
                    )
                else:
                    logger.warning(
                        "candidate_model_id is None, skipping this response."
                    )
            else:
                logger.warning("Response is None, skipping this entry.")
        if save_s3_list:
            # Split the save_s3_list into smaller batches to get
            # rid of the cannot write to s3 bucket - request rate was hitting maximum threshold
            batch_size: int = 50
            delay: float = 1
            for i in range(0, len(save_s3_list), batch_size):
                batch = save_s3_list[i : i + batch_size]
                # write a batch of evaluation result files to s3
                write_multiple_to_s3(batch)
                time.sleep(delay)  # Delay between batches
        else:
            logger.error("No valid responses to write to S3.")

    except Exception as e:
        logger.error(f"Error processing or writing to S3: {e}")
else:
    logger.info("No responses to write to S3")


### Save All Results: Perform downstream analytical tasks on each PoLL evaluation result

---

In this portion of the evaluation step:

1. We compile all metrics gathered from the Majority Voting experiment, and send them as `CSV`, `txt` files to s3.

1. These metrics include: Quantitative metrics and binary decision scores (for Majority Voting).


In [None]:
# convert the results list into a dataframe for easy analytics
df_eval_results = pd.DataFrame(resp_list)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.dropna(axis=1, how="all")
# the exception, judge model id, prompt token count, will be NaN for the verdicts decided
# using the lexical match and not moved forward to the panel of LLM evaluators
df_eval_results.head()


In [None]:
# parse out the completion from LLM as a judge and column bind
# the fields of the dictionary to the original results dataframe
df_eval_results_only = (
    df_eval_results["completion"].apply(parse_as_json).apply(pd.Series)
)
df_eval_results_only.dropna(axis=1, how="all")
df_eval_results = pd.concat([df_eval_results, df_eval_results_only], axis=1)
df_eval_results.rename(columns={"model_id": "judge_model_id"}, inplace=True)
logger.info(f"df_eval_results shape={df_eval_results.shape}")
df_eval_results.dropna(axis=1, how="all")
df_eval_results.head()


In [None]:
# create a new column and assign the original verdict to this column
df_eval_results["original_verdict"] = df_eval_results["verdict"]
df_eval_results.head(10)


### Evaluate the correctness of LLM Evaluators using quantitative metrics

---

In this portion of the evaluation step, we perform the following steps:

1. Evaluate whether the LLM evaluators sent in the correct evaluations using another layer of checks with _Cosine Similarity Score_.

1. If the verdicts decided by the LLM evaluators (`correct` or `incorrect`) do not meet the respective cosine similarity thresholds, then they are sent into another file for further analysis for human or another LLM evaluation loop.

There are two possible cases for this evaluation:

1. **Incorrect Verdicts**: If the verdict from the judge model is incorrect, then check if the cosine similarity of that
   incorrectly identified verdict is less than the `incorrect_verdict_cosine_similarity_threshold`. If so, then it is
   finally sent in as is into the dataframe. If the LLM evaluator defines a verdict as incorrect but if it has a higher cosine
   similarity than the incorrect cosine similarity threshold, then it is marked for "needing further evaluation using a human" or
   another LLM evalution.

2. **Correct Verdicts**: If the verdict from the judge model is correct and if it exceeds the correctness cosine similarity threshold,
   then the model is evaluated as correct and sent in for further downstream analytics. For the correct verdicts identified by the judge models
   that do not meet the correctness cosine similarity threshold, are defined as "needed further human/LLM evaluation".


In [None]:
def quantitative_verdict_cosine_similarity_decision(row: pd.Series) -> pd.Series:
    """
    Given an LLM evaluator response, this function checks for whether a verdict provided by an LLM evaluator
    is correctly evaluated using a cosine similarity metric threshold for correct and incorrect verdicts. These
    are the two cases that this function handles for each evaluation done using LLM as evaluators:

    1. Incorrect Verdicts: If the verdict from the judge model is incorrect, then check if the cosine similarity of that
    incorrectly identified verdict is less than the `incorrect_verdict_cosine_similarity_threshold`. If so, then it is
    finally sent in as is into the dataframe. If the LLM evaluator defines a verdict as incorrect but if it has a higher cosine
    similarity than the incorrect cosine similarity threshold, then it is marked for "needing further evaluation using a human" or
    another LLM evalution.

    2. Correct Verdicts: If the verdict from the judge model is correct and if it exceeds the correctness cosine similarity threshold,
    then the model is evaluated as correct and sent in for further downstream analytics. For the correct verdicts identified by the judge models
    that do not meet the correctness cosine similarity threshold, are defined as "needed further human/LLM evaluation".

    This function is used if the evaluation method being used is Majority voting, specifically in the case
    of when ground truth is provided.
    """
    try:
        # This is a boolean value that is returned defining whether a given verdict is valid based on
        # the comparison of its respective cosine similarity score and cosine similarity threshold for correctness/incorrectness
        is_eval_done_correctly: Optional[bool] = None
        correct_cosine_similarity_threshold: Optional[float] = None
        incorrect_cosine_similarity_threshold: Optional[float] = None

        # Check if the evaluation method is Majority voting and if the customer has enabled
        # evaluation decisions to also be made by quantitative metric thresholds
        if is_quantitative_eval_enabled:
            # Retrieve the information that is going to be used to check for whether a verdict is
            # incorrectly identified as correct or incorrect
            judge_model_id: str = row["judge_model_id"]
            verdict: str = row["verdict"]
            explanation: str = row["explanation"]
            cosine_similarity_score: float = row["cosine_similarity_score"]

            # Get the correctness and incorrectness cosine similarity threshold scores
            correct_cosine_similarity_threshold = eval_config["model_evaluations"][
                "quantitative_eval_info"
            ].get("correct_verdict_cosine_similarity_threshold", None)
            incorrect_cosine_similarity_threshold = eval_config["model_evaluations"][
                "quantitative_eval_info"
            ].get("incorrect_verdict_cosine_similarity_threshold", None)

            # If the verdict is correct and is greater than or equal to the correct cosine similarity threshold, then
            # the verdict is correct. If not, the verdict is identified to need further evaluation

            # include the original verdict here
            if verdict == "correct":
                if cosine_similarity_score >= correct_cosine_similarity_threshold:
                    row["explanation"] = (
                        f"{explanation} Cosine similarity is {cosine_similarity_score}, which does meets and is above the threshold of {correct_cosine_similarity_threshold}."
                    )
                    is_eval_done_correctly = True
                else:
                    row["verdict"] = "needs_further_human_or_LLM_evaluation"
                    row["explanation"] = (
                        f"{explanation} Cosine similarity is {cosine_similarity_score}, which does not meet the threshold of {correct_cosine_similarity_threshold}. Evaluate it further to determine the correct answer."
                    )
                    is_eval_done_correctly = False

            # If the verdict is incorrect and is less than or equal to the incorrect cosine similarity threshold, then
            # the verdict is correctly identified as incorrect. If not, the verdict is identified to need further evaluation
            elif verdict == "incorrect":
                if cosine_similarity_score <= incorrect_cosine_similarity_threshold:
                    row["explanation"] = (
                        f"{explanation} Cosine similarity is {cosine_similarity_score}, which does is below the threshold of {incorrect_cosine_similarity_threshold}."
                    )
                    is_eval_done_correctly = True
                else:
                    row["verdict"] = "needs_further_human_or_LLM_evaluation"
                    # if the verdict needs further evaluation but was incorrect originally, then reset the verdict to incorrect
                    if row["verdict"] == "needs_further_human_or_LLM_evaluation":
                        row["verdict"] = "incorrect"
                        row["explanation"] = (
                            f"{explanation} Cosine Similarity of {cosine_similarity_score} >= {incorrect_cosine_similarity_threshold} incorrect cosine similarity threshold, does not meet threshold."
                        )
                        is_eval_done_correctly = True
    except Exception as e:
        logging.error(
            f"Error in quantitative_verdict_cosine_similarity_decision: {str(e)}"
        )
        is_eval_done_correctly = None
    return row


#### Apply the layer of another evaluation filter on the dataframe containing all LLM as evaluator results

---


In [None]:
if df_eval_results is not None:
    df_eval_results = df_eval_results.apply(
        lambda r: quantitative_verdict_cosine_similarity_decision(r), axis=1
    )
df_eval_results.head()


In [None]:
# send the raw results as a csv file to the S3 bucket
csv_buffer = io.StringIO()
df_eval_results.to_csv(csv_buffer, index=False)
eval_llm_as_a_judge_results = csv_buffer.getvalue()
eval_results_csv_fpath = os.path.join(
    METRICS_DIR, MODEL_EVAL_COMPLETIONS_CSV
)  # Define full S3 path

# Write the CSV data to S3
write_to_s3(
    eval_llm_as_a_judge_results,
    BUCKET_NAME,
    "",
    METRICS_DIR,
    MODEL_EVAL_COMPLETIONS_CSV,
)
logger.info(
    f"Per PoLL model responses saved as a csv to s3://{BUCKET_NAME}/{eval_results_csv_fpath}"
)
df_eval_results.head()


In [None]:
logger.info(
    f"Total number of evaluations that are done using different panel of LLM evaluators: {df_eval_results.shape[0]}"
)


#### Calculate evaluation cost per LLM evaluator per candidate model

---

In this portion of the evaluation step, the evaluation cost is calculated. The cost for the input and output tokens processed per LLM evaluator for each evaluation for each candidate model is summed up to give a total cost for evaluating the dataset using each evaluator. The total cost is added up in the final model metrics step.


In [None]:
eval_cost_df = (
    df_eval_results.groupby("judge_model_id")[
        ["total_cost", "prompt_token_count", "completion_token_count"]
    ]
    .sum()
    .reset_index()
)
eval_cost_df = eval_cost_df.sort_values("total_cost", ascending=False)
eval_cost_df


In [None]:
# Send the cost calculation for running each evaluator to s3. This CSV file contains the total cost (which is the
# summation of the input and output tokens across all evaluations across all candidate models), the total prompt token counts
# and the total completion token counts across the entire dataset
try:
    eval_cost_df = (
        df_eval_results.groupby("judge_model_id")[
            ["total_cost", "prompt_token_count", "completion_token_count"]
        ]
        .sum()
        .reset_index()
    )
    eval_cost_df = eval_cost_df.sort_values("total_cost", ascending=False)
    eval_cost_df["total_cost"] = round(eval_cost_df["total_cost"], 4)
    csv_buffer = io.StringIO()
    eval_cost_df.to_csv(csv_buffer, index=False)
    eval_cost_df_responses = csv_buffer.getvalue()
    eval_cost_df_responses_fpath = os.path.join(METRICS_DIR, EVAL_COST_PER_JUDGE_MODEL)
    write_to_s3(
        eval_cost_df_responses, BUCKET_NAME, "", METRICS_DIR, EVAL_COST_PER_JUDGE_MODEL
    )
    logger.info(
        f"Cost calculations for running each LLM evaluator to evaluate candidate models is sent to s3://{BUCKET_NAME}/{eval_cost_df_responses_fpath}"
    )
except Exception as e:
    logger.error(
        f"Could not calculate the total cost for running each LLM evaluator to evaluate candidate models: {e}"
    )

if eval_cost_df is not None:
    eval_cost_df.head(15)


### Majority Voting Results: Send the incorrect and correct responses to S3 separately in `CSV` files for downstream analytics for each model judge

---

In this portion of the step, we will send the model responses as CSV, txt files to s3 for further downstream processing and report generations

1. We calculate the majority vote done using the verdicts from each panel of LLM judges

1. Calculate the majority vote accuracy ranking for each candidate model, i.e., which candidate model ranked at the top using majority correct votes from panel of LLM evaluators and so on.

1. Generate metrics on a final `candidate_model_accuracy` table containing insights into accuracy of a model per judge per candidate model as well as accuracy of that given model across all judges as per majority vote.


In [None]:
# For Majority Voting - all responses from the panel of LLM as evaluators are sent
# to s3 as a csv file
try:
    logger.info(
        f"Method name is {method_name}, sending the correct and incorrect verdicts to s3"
    )
    verdict_types: List[str] = [
        "incorrect",
        "correct",
        "needs_further_human_or_LLM_evaluation",
    ]
    all_llm_eval_responses_df: Optional[pd.DataFrame] = None
    # iterate through each of the verdict tupe and save each verdict type responses from each evaluator in different
    # csv files. For example, a csv files containing only incorrect verdicts from all model judges, whereas another
    # csv file containing only the correct verdicts.
    for verdict in verdict_types:
        df_verdicts = df_eval_results[df_eval_results["verdict"] == verdict]
        all_llm_eval_responses_df = pd.concat(
            [all_llm_eval_responses_df, df_verdicts], ignore_index=True
        )
        if not df_verdicts.empty:
            csv_buffer = io.StringIO()
            df_verdicts.to_csv(csv_buffer, index=False)
            verdict_responses = csv_buffer.getvalue()
            verdict_file = (
                INCORRECT_VERDICT_RESPONSES_FILE
                if verdict == "incorrect"
                else (
                    CORRECT_VERDICT_RESPONSES_FILE
                    if verdict == "correct"
                    else NEEDS_FURTHER_EVAL_FILE
                )
            )
            verdict_responses_fpath = os.path.join(METRICS_DIR, verdict_file)
            write_to_s3(verdict_responses, BUCKET_NAME, "", METRICS_DIR, verdict_file)
            logger.info(
                f"{verdict.capitalize()} verdict responses sent to s3://{BUCKET_NAME}/{verdict_responses_fpath}"
            )
            logger.info(
                f"Number of {verdict} responses in total: {df_verdicts.shape[0]}"
            )
except Exception as e:
    logger.error(f"Error encountered while writing the evaluation responses to s3: {e}")
    all_llm_eval_responses_df = None

all_llm_eval_responses_df.head()


In [None]:
# get the number of unique judges
num_judge_models: int = len(all_llm_eval_responses_df.judge_model_id.unique())
logger.info(f"there are {num_judge_models} LLM judge models")


In [None]:
# For Majority Voting - send all incorrect and correct verdicts as txt files to s3 for readability purposes
try:
    logger.info(
        f"Method name is {method_name}, sending the correct and incorrect verdicts to s3"
    )
    verdict_types: List[str] = [
        "incorrect",
        "correct",
        "needs_further_human_or_LLM_evaluation",
    ]
    judge_model_ids = df_eval_results["judge_model_id"].unique()
    # save each judge model's correct and incorrect verdict files as txt files
    # for downstream analytics and readability purposes
    for judge_model_id in judge_model_ids:
        for verdict in verdict_types:
            df_judge_verdict = df_eval_results[
                (df_eval_results["verdict"] == verdict)
                & (df_eval_results["judge_model_id"] == judge_model_id)
            ]
            if not df_judge_verdict.empty:
                txt_buffer = io.StringIO()
                for index, row in df_judge_verdict.iterrows():
                    txt_buffer.write(
                        f"candidate model: {row['candidate_model']}\n"
                        f"Question: {row['question']}\n"
                        f"candidate model response: {row['candidate_model_response']}\n"
                        f"ground truth: {row['ground_truth']}\n"
                        f"verdict: {row['verdict']}\n"
                        f"explanation: {row['explanation']}\n"
                        f"cosine similarity: {row['cosine_similarity_score']}\n\n"
                    )
                judge_verdict_responses = txt_buffer.getvalue()
                verdict_file = f"{judge_model_id}_{verdict}_verdicts_evaluation.txt"
                judge_verdict_responses_fpath = os.path.join(METRICS_DIR, verdict_file)
                write_to_s3(
                    judge_verdict_responses, BUCKET_NAME, "", METRICS_DIR, verdict_file
                )
                logger.info(
                    f"{verdict.capitalize()} verdict responses for judge {judge_model_id} saved to s3://{BUCKET_NAME}/{judge_verdict_responses_fpath}"
                )
except Exception as e:
    logger.error(f"Error encountered while writing the evaluation responses to s3: {e}")


#### Calculate the overall quantitate metrics of each model scored by the PoLL

---


In [None]:
# mean cosine similarity score, levenshtein distance and token set ratio
try:
    panel_summary_responses_df = (
        df_eval_results.groupby(["judge_model_id", "candidate_model", "verdict"])
        .agg(
            count=("verdict", "size"),
            mean_cosine_similarity=("cosine_similarity_score", "mean"),
        )
        .unstack(fill_value=0)
        .stack()
        .reset_index()
    )
    csv_buffer = io.StringIO()
    panel_summary_responses_df.to_csv(csv_buffer, index=False)
    panel_summary_responses = csv_buffer.getvalue()
    llm_as_a_judge_per_eval_summary_fpath = os.path.join(
        METRICS_DIR, LLM_JUDGE_PANEL_RESPONSE_SUMMARIES
    )
    write_to_s3(
        panel_summary_responses,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        LLM_JUDGE_PANEL_RESPONSE_SUMMARIES,
    )
    logger.info(
        f"Summary on each eval (Majority voting) for each panel judge sent to s3://{BUCKET_NAME}/{llm_as_a_judge_per_eval_summary_fpath}"
    )
    logger.info(
        f"View information on the accuracy metrics: {panel_summary_responses_df.head()}"
    )
except Exception as e:
    logger.error(
        f"Could not calculate the overall accuracy metrics for Majority Voting: {e}"
    )
panel_summary_responses_df.head(15)


In [None]:
def majority_vote(row):
    """
    This function calculates the majority vote based on whether the candidate model response is correct or incorrect
    based on the vote from the panel of judges. It only returns 'correct' if there are more 'correct' votes than 'incorrect'
    and 'NaN' values combined, and similarly for 'incorrect'. Otherwise, it returns 'no_majority_vote'.
    """
    verdict_columns = [col for col in row.index if col.endswith("_verdict")]
    # find majority vote
    verdicts = [row[c] for c in verdict_columns]
    majority_vote = mode(verdicts)
    return majority_vote


In [None]:
# get the majority voting pivot table along with the majority vote decision
try:
    majority_vote_pivoted_df = df_eval_results.pivot_table(
        index=["question", "candidate_model", "payload_file"],
        columns="judge_model_id",
        values=["verdict"],
        aggfunc="first",
    )

    majority_vote_pivoted_df.columns = [
        f"{judge_model}_{col}" for col, judge_model in majority_vote_pivoted_df.columns
    ]
    majority_vote_pivoted_df.reset_index(inplace=True)
    majority_vote_pivoted_df["majority_vote"] = majority_vote_pivoted_df.apply(
        majority_vote, axis=1
    )

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    majority_vote_pivoted_df.to_csv(csv_buffer, index=False)
    majority_vote_raw_results = csv_buffer.getvalue()
    majority_vote_raw_results_metrics_fpath = os.path.join(
        METRICS_DIR, MAJORITY_VOTE_DF_RAW_RESULTS_FILE
    )

    write_to_s3(
        majority_vote_raw_results,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        MAJORITY_VOTE_DF_RAW_RESULTS_FILE,
    )
    logger.info(
        f"Majority results file containing raw results sent to s3://{BUCKET_NAME}/{majority_vote_raw_results_metrics_fpath}"
    )
except Exception as e:
    logger.error(f"Could not calculate the raw responses for Majority Voting: {e}")

majority_vote_pivoted_df.head()


In [None]:
# get the total number of correct and incorrect count for the model for each payload file
majority_vote_data_df_per_payload = pd.DataFrame()
majority_vote_data_df_per_payload["correct_count"] = majority_vote_pivoted_df.groupby(
    ["candidate_model", "payload_file"]
)["majority_vote"].apply(lambda x: (x == "correct").sum())
majority_vote_data_df_per_payload["incorrect_count"] = majority_vote_pivoted_df.groupby(
    ["candidate_model", "payload_file"]
)["majority_vote"].apply(lambda x: (x == "incorrect").sum())
majority_vote_data_df_per_payload.reset_index(inplace=True)
majority_vote_data_df_per_payload.sort_values(by="correct_count", ascending=False)


#### Accuracy as per the Majority Vote per Payload file


In [None]:
# get the accuracy of the model based on majority voting
if (
    "correct_count"
    and "incorrect_count"
    in majority_vote_data_df_per_payload.sort_values(
        by="correct_count", ascending=False
    ).columns
):
    majority_vote_data_df_per_payload["majority_voting_accuracy"] = round(
        (
            majority_vote_data_df_per_payload["correct_count"]
            / (
                majority_vote_data_df_per_payload["correct_count"]
                + majority_vote_data_df_per_payload["incorrect_count"]
            )
        )
        * 100,
        2,
    )

majority_vote_data_df_per_payload.sort_values(
    by="majority_voting_accuracy", ascending=False
)


#### Get the per candidate model accuracy per judge


In [None]:
# get the df on judge model id, candidate model and verdict, and then calculate the accuracy of each judge model
df_per_model_accuracy_counts_df = (
    df_eval_results.groupby(
        ["judge_model_id", "candidate_model", "payload_file", "verdict"]
    )
    .size()
    .unstack(fill_value=0)
)

# get the accuracy for each candidate model
df_per_model_accuracy_counts_df["accuracy"] = (
    df_per_model_accuracy_counts_df.get("correct", 0)
    / (
        df_per_model_accuracy_counts_df.get("incorrect", 0)
        + df_per_model_accuracy_counts_df.get("needs_further_evaluation", 0)
        + df_per_model_accuracy_counts_df.get("correct", 0)
    )
    * 100
)

df_per_model_accuracy_counts_df["accuracy"] = round(
    df_per_model_accuracy_counts_df["accuracy"], 2
)
df_per_model_accuracy_counts_df.reset_index(inplace=True)
df_per_model_accuracy_counts_df


In [None]:
overall_accuracy_df = df_per_model_accuracy_counts_df.pivot_table(
    index=["candidate_model", "payload_file"],
    columns="judge_model_id",
    values="accuracy",
)
overall_accuracy_df.reset_index(inplace=True)
print(overall_accuracy_df.columns)
overall_accuracy_df.columns = ["candidate_model", "payload_file"] + [
    f"judge_{col}_accuracy" for col in overall_accuracy_df.columns[2:]
]
overall_accuracy_df.head(10)


In [None]:
# merge both panel voting and per model eval df to get all metrics together
merged_accuracy_df = pd.merge(
    overall_accuracy_df,
    majority_vote_data_df_per_payload,
    on=["candidate_model", "payload_file"],
)
merged_accuracy_df = merged_accuracy_df.drop(
    columns=["correct_count", "incorrect_count"], axis=1
)
merged_accuracy_df = merged_accuracy_df.sort_values(
    by="majority_voting_accuracy", ascending=False
)

# Send the accuracy metrics to S3
csv_buffer = io.StringIO()
merged_accuracy_df.to_csv(csv_buffer, index=False)
per_model_per_payload_accuracy_counts = csv_buffer.getvalue()
per_model_per_payload_accuracy_counts_fpath = os.path.join(
    METRICS_DIR, PER_PAYLOAD_MODEL_ACCURACY_MAJORITY_VOTING
)

write_to_s3(
    per_model_per_payload_accuracy_counts,
    BUCKET_NAME,
    "",
    METRICS_DIR,
    PER_PAYLOAD_MODEL_ACCURACY_MAJORITY_VOTING,
)
logger.info(
    f"Per model per payload majority vote accuracy scores sent to s3://{BUCKET_NAME}/{per_model_per_payload_accuracy_counts_fpath}"
)
merged_accuracy_df


#### Get the majority voting accuracy per model

---


In [None]:
# get the majority voting accuracy per model based on the number of correct and incorrect verdicts
try:
    majority_vote_data_df = pd.DataFrame()
    majority_vote_data_df["correct_count"] = majority_vote_pivoted_df.groupby(
        "candidate_model"
    )["majority_vote"].apply(lambda x: (x == "correct").sum())
    majority_vote_data_df["incorrect_count"] = majority_vote_pivoted_df.groupby(
        "candidate_model"
    )["majority_vote"].apply(lambda x: (x == "incorrect").sum())
    majority_vote_data_df.reset_index(inplace=True)
    majority_vote_data_df.sort_values(by="correct_count", ascending=False)

    if "correct_count" and "incorrect_count" in majority_vote_data_df.columns:
        majority_vote_data_df["majority_voting_accuracy"] = round(
            (
                majority_vote_data_df["correct_count"]
                / (
                    majority_vote_data_df["correct_count"]
                    + majority_vote_data_df["incorrect_count"]
                )
            )
            * 100,
            2,
        )

    majority_vote_data_df = majority_vote_data_df.sort_values(
        by="majority_voting_accuracy", ascending=False
    )

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    majority_vote_data_df.to_csv(csv_buffer, index=False)
    majority_vote_per_model_accuracy = csv_buffer.getvalue()
    majority_vote_per_model_accuracy_metrics_fpath = os.path.join(
        METRICS_DIR, PER_MODEL_ACCURACY_POLL
    )

    write_to_s3(
        majority_vote_per_model_accuracy,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        PER_MODEL_ACCURACY_POLL,
    )
    logger.info(
        f"Per model PoLL accuracy sent to to s3://{BUCKET_NAME}/{majority_vote_per_model_accuracy_metrics_fpath}"
    )
except Exception as e:
    logger.error(f"Could not calculate per model PoLL accuracy: {e}")

majority_vote_data_df.head()


In [None]:
# get the majority vote per payload file
try:
    # Group by payload_file and candidate_model to calculate correct and incorrect counts
    majority_vote_payload_df = (
        majority_vote_pivoted_df.groupby(["payload_file", "candidate_model"])[
            "majority_vote"
        ]
        .apply(lambda x: (x == "correct").sum())
        .reset_index(name="correct_count")
    )
    majority_vote_payload_df["incorrect_count"] = (
        majority_vote_pivoted_df.groupby(["payload_file", "candidate_model"])[
            "majority_vote"
        ]
        .apply(lambda x: (x == "incorrect").sum())
        .reset_index(drop=True)
    )

    if "correct_count" and "incorrect_count" in majority_vote_data_df.columns:
        majority_vote_payload_df["majority_voting_accuracy"] = round(
            (
                majority_vote_payload_df["correct_count"]
                / (
                    majority_vote_payload_df["correct_count"]
                    + majority_vote_payload_df["incorrect_count"]
                )
            )
            * 100,
            2,
        )

    # Sort by accuracy for better readability
    majority_vote_payload_df = majority_vote_payload_df.sort_values(
        by="majority_voting_accuracy", ascending=False
    )

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    majority_vote_payload_df.to_csv(csv_buffer, index=False)
    majority_vote_per_payload_accuracy = csv_buffer.getvalue()
    majority_vote_per_payload_accuracy_metrics_fpath = os.path.join(
        METRICS_DIR, PER_PAYLOAD_PER_MODEL_POLL_ACCURACY
    )
    write_to_s3(
        majority_vote_per_payload_accuracy,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        PER_PAYLOAD_PER_MODEL_POLL_ACCURACY,
    )
    logger.info(
        f"Per payload file accuracy sent to s3://{BUCKET_NAME}/{majority_vote_per_payload_accuracy_metrics_fpath}"
    )
except Exception as e:
    logger.error(f"Could not calculate per payload file accuracy: {e}")

majority_vote_payload_df.head()


In [None]:
# get the per candidate model accuracy per panel of LLM evaluator
try:
    df_per_model_accuracy_counts_df = (
        df_eval_results.groupby(["judge_model_id", "candidate_model", "verdict"])
        .size()
        .unstack(fill_value=0)
    )

    # get the accuracy for each candidate model
    df_per_model_accuracy_counts_df["accuracy"] = (
        df_per_model_accuracy_counts_df.get("correct", 0)
        / (
            df_per_model_accuracy_counts_df.get("incorrect", 0)
            + df_per_model_accuracy_counts_df.get("needs_further_evaluation", 0)
            + df_per_model_accuracy_counts_df.get("correct", 0)
        )
        * 100
    )

    df_per_model_accuracy_counts_df["accuracy"] = round(
        df_per_model_accuracy_counts_df["accuracy"], 2
    )
    df_per_model_accuracy_counts_df.reset_index(inplace=True)

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    df_per_model_accuracy_counts_df.to_csv(csv_buffer, index=False)
    df_per_model_accuracy_counts = csv_buffer.getvalue()
    df_per_model_accuracy_counts_metrics_fpath = os.path.join(
        METRICS_DIR, PER_MODEL_ACCURACY_PER_EVAL_JUDGE
    )

    write_to_s3(
        df_per_model_accuracy_counts,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        PER_MODEL_ACCURACY_PER_EVAL_JUDGE,
    )
    logger.info(
        f"Per model accuracy per eval judge sent to s3://{BUCKET_NAME}/{df_per_model_accuracy_counts_metrics_fpath}"
    )
except Exception as e:
    logger.error(f"Could not calculate per model accuracy per eval judge: {e}")

df_per_model_accuracy_counts_df.head()


#### Get the summary table

---

Fetch the summary table containing the per judge accuracy per candidate model and the per model accuracy based on majority vote


In [None]:
# get the per candidate model accuracy per panel of LLM evaluator
try:
    overall_accuracy_df = df_per_model_accuracy_counts_df.pivot_table(
        index="candidate_model", columns="judge_model_id", values="accuracy"
    )
    overall_accuracy_df.reset_index(inplace=True)
    overall_accuracy_df.columns = ["candidate_model"] + [
        f"judge_{col}_accuracy" for col in overall_accuracy_df.columns[1:]
    ]

    merged_accuracy_df = pd.merge(
        overall_accuracy_df, majority_vote_data_df, on="candidate_model"
    )
    merged_accuracy_df = merged_accuracy_df.drop(
        columns=["correct_count", "incorrect_count"], axis=1
    )
    merged_accuracy_df = merged_accuracy_df.sort_values(
        by="majority_voting_accuracy", ascending=False
    )

    # Send the accuracy metrics to S3
    csv_buffer = io.StringIO()
    merged_accuracy_df.to_csv(csv_buffer, index=False)
    merged_accuracy_df_val = csv_buffer.getvalue()
    merged_accuracy_df_metrics_fpath = os.path.join(
        METRICS_DIR, CANDIDATE_MODEL_ACCURACY_FILE
    )

    write_to_s3(
        merged_accuracy_df_val,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        CANDIDATE_MODEL_ACCURACY_FILE,
    )
    logger.info(
        f"Per model accuracy per eval judge sent to s3://{BUCKET_NAME}/{merged_accuracy_df_metrics_fpath}"
    )
except Exception as e:
    logger.error(f"Could not calculate per model accuracy per eval judge: {e}")

merged_accuracy_df.head()


#### Final Verdict Type: Overlap Analysis

---

In this portion, we check the `final verdict type`. This generates a verdict which is categorized into the following 4 main parts:

1. correct_by_unanimous_decision: If all the panel of LLM judges evalaute a candidate model response as `correct`, then the final verdict is correct by unanimous decision.

1. incorrect_by_unanimous_decision: If all the panel of LLM judges evalaute a candidate model response as `incorrect`, then the final verdict is incorrect by unanimous decision.

1. correct_by_majority_vote_w_disagreement: If the panel of LLMs have diverse verdicts, but the majority vote is correct for a given candidate model response, then the final verdict is correct_by_majority_vote_w_disagreement.

1. incorrect_by_majority_vote_w_disagreement: If the panel of LLMs have diverse verdicts, but the majority vote is incorrect for a given candidate model response, then the final verdict is incorrect_by_majority_vote_w_disagreement.


In [None]:
def check_overlap_of_PoLL(row):
    """
    This function checks how many judges overlapped in rating responses from the candidate model on questions
    "correctly" and how many did not overlap (where one judge said correct and another said incorrect).
    It only returns 'all_correct' if all columns ending with '_verdict' have a 'correct' value (i.e., no NaNs or incorrect votes).
    """
    # Filter columns that end with '_verdict'
    verdict_columns = [col for col in row.index if col.endswith("_verdict")]

    # Initialize the counts based on the filtered columns
    correct_count = (row[verdict_columns] == "correct").sum()
    incorrect_count = (row[verdict_columns] == "incorrect").sum()
    nan_count = row[verdict_columns].isna().sum()
    # check for when all models rate
    total_judges: int = len(verdict_columns)

    # Determine the overlap based on the counts
    if correct_count == total_judges:
        return "correct_by_unanimous_decision"
    elif incorrect_count == total_judges:
        return "incorrect_by_unanimous_decision"
    elif row["majority_vote"] == "correct":
        return f"correct_by_majority_vote_w_{incorrect_count+nan_count}_dissagreement"
    elif row["majority_vote"] == "incorrect":
        return f"incorrect_by_majority_vote_w_{correct_count+nan_count}_dissagreement"
    else:
        return "no_overlaps"


In [None]:
try:
    majority_vote_pivoted_df["verdict_type"] = majority_vote_pivoted_df.apply(
        check_overlap_of_PoLL, axis=1
    )

    csv_buffer = io.StringIO()
    majority_vote_pivoted_df.to_csv(csv_buffer, index=False)
    majority_vote_pivoted_final_verdict = csv_buffer.getvalue()
    majority_vote_pivoted_final_verdict_fpath = os.path.join(
        METRICS_DIR, PER_MODEL_ACCURACY_W_VERDICT_TYPE_FILE
    )

    write_to_s3(
        majority_vote_pivoted_final_verdict,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        PER_MODEL_ACCURACY_W_VERDICT_TYPE_FILE,
    )
    logger.info(
        f"Majority vote data and final verdicts are sent to s3://{BUCKET_NAME}/{majority_vote_pivoted_final_verdict_fpath}"
    )
except Exception as e:
    logger.error(f"Could not calculate majority vote data and final verdicts: {e}")

majority_vote_pivoted_df.head(10)


In [None]:
# now calculate the verdict breakdown for correct responses and verdict breakdown for
# incorrect responses


In [None]:
# for incorrect responses
try:
    majority_vote_df_for_incorrect_verdict_analysis = majority_vote_pivoted_df.copy()
    majority_vote_df_for_incorrect_verdict_analysis = (
        majority_vote_df_for_incorrect_verdict_analysis[
            majority_vote_df_for_incorrect_verdict_analysis.majority_vote == "incorrect"
        ]["verdict_type"]
        .value_counts(normalize=True)
        .rename_axis("verdict_type_breakdown_for_incorrect")
        .reset_index(name="counts")
    )

    csv_buffer = io.StringIO()
    majority_vote_df_for_incorrect_verdict_analysis.to_csv(csv_buffer, index=False)
    majority_vote_pivoted_df_incorrect = csv_buffer.getvalue()
    majority_vote_pivoted_df_incorrect_fpath = os.path.join(
        METRICS_DIR, VERDICT_TYPE_BREAKDOWN_FOR_INCORRECT_FILE
    )

    write_to_s3(
        majority_vote_pivoted_df_incorrect,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        VERDICT_TYPE_BREAKDOWN_FOR_INCORRECT_FILE,
    )
    logger.info(
        f"Majority vote with incorrect verdict breakdown is sent to s3://{BUCKET_NAME}/{majority_vote_pivoted_df_incorrect_fpath}"
    )
except Exception as e:
    logger.error(
        f"Could not calculate majority vote with incorrect verdict breakdown: {e}"
    )

majority_vote_df_for_incorrect_verdict_analysis.head(10)


In [None]:
# for correct responses
try:
    majority_vote_df_for_correct_verdict_analysis = majority_vote_pivoted_df.copy()
    majority_vote_df_for_correct_verdict_analysis = (
        majority_vote_df_for_correct_verdict_analysis[
            majority_vote_df_for_correct_verdict_analysis.majority_vote == "correct"
        ]["verdict_type"]
        .value_counts(normalize=True)
        .rename_axis("verdict_type_breakdown_for_correct")
        .reset_index(name="counts")
    )

    csv_buffer = io.StringIO()
    majority_vote_df_for_correct_verdict_analysis.to_csv(csv_buffer, index=False)
    majority_vote_pivoted_df_correct = csv_buffer.getvalue()
    majority_vote_pivoted_df_correct_fpath = os.path.join(
        METRICS_DIR, VERDICT_TYPE_BREAKDOWN_FOR_CORRECT_FILE
    )

    write_to_s3(
        majority_vote_pivoted_df_correct,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        VERDICT_TYPE_BREAKDOWN_FOR_CORRECT_FILE,
    )
    logger.info(
        f"Majority vote with correct verdict breakdown is sent to s3://{BUCKET_NAME}/{majority_vote_pivoted_df_correct_fpath}"
    )
except Exception as e:
    logger.error(
        f"Could not calculate majority vote with correct verdict breakdown: {e}"
    )

majority_vote_df_for_correct_verdict_analysis.head(10)


#### Send all responses from the evaluation process to S3 as a txt file for further downstream processing and readability purposes

---


In [None]:
try:
    # Write all explanations to a file and send to S3
    explanations_txt_buffer = io.StringIO()
    for index, row in df_eval_results.iterrows():
        explanations_txt_buffer.write(
            f"candidate model: {row['candidate_model']}\n"
            f"Question: {row['question']}\n"
            f"candidate model response: {row['candidate_model_response']}\n"
            f"ground truth: {row['ground_truth']}\n"
            f"verdict: {row['verdict']}\n"
            f"explanation: {row['explanation']}\n"
            f"cosine similarity: {row['cosine_similarity_score']}\n\n"
        )

    explanations_txt_file_content = explanations_txt_buffer.getvalue()
    explanations_fpath = os.path.join(METRICS_DIR, ALL_EVALUATIONS_IN_TXT)
    write_to_s3(
        explanations_txt_file_content,
        BUCKET_NAME,
        "",
        METRICS_DIR,
        ALL_EVALUATIONS_IN_TXT,
    )
    logger.info(
        f"All text eval content from the llm judge panelists sent to s3://{BUCKET_NAME}/{explanations_fpath}"
    )
    logger.info(
        f"All of the content including the candidate model responses, ground truth, evaluation are written: {explanations_txt_file_content}"
    )
except Exception as e:
    logger.error(
        f"Could not calculate the overall accuracy metrics for Majority Voting: {e}"
    )
