### Generate metrics & Run evaluations (ROUGE, COSINE, LLM acting as a judge in the loop)
---

In this notebook:

1. We will extract the titles generated as completions from the bedrock models (claude sonnet, llama, mistral), and load these into a CSV file 

1. Generate metrics on accuracy ([ROUGE-L](https://en.wikipedia.org/wiki/ROUGE_(metric)) and [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) scores), performance, token throughput, inference, etc.

1. View all model completions to get a ***Vibe check*** on how each of the model performs. Next, have Claude Sonnet as a judge in the loop to go through each completion from multiple models, and decide which one best matches the human generated title. [Claude Sonnet](https://www.anthropic.com/claude) evaluates the most optimal model based on the [evaluation prompt](data/prompts/eval_template.txt) that is tuned into it. In this case, Sonnet acts as a judge to find the title that best captures the content of the meeting.

In [None]:
# import libraries
import os
import ray
import json
import yaml
import glob
import copy
import time
import boto3
import logging
import pandas as pd  
from numpy import dot
from pathlib import Path
from numpy.linalg import norm
from litellm import completion ## support for text generation models on bedrock
from rouge_score import rouge_scorer
from typing import Dict, Optional, List
from bedrock_utils import get_bedrock_client


#### Set a logger 

In [None]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

#### Initialize the Ray Server that is used to run Asynchronous inference

In [None]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

#### Load the config file: Contains model information, data directory information

In [None]:
## load the config file
# global constants
CONFIG_FILE_PATH = "config.yml"

In [None]:
# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

In [None]:
## Represents extracted all metric files
fpath = os.path.join(config['dir']['completions'], "**", "*", "*.json")
metric_files = glob.glob(fpath, recursive=True)
logger.info(f"there are {len(metric_files)} files in {fpath}")

#### Generate a simple CSV with metrics on title completions, chapters, and performance latency
---

1. This section of the notebook calculates metrics like title completions from each model in the config file for respective chapters, latency.

1. The CSV also contains the original title that was given as a human generated title in the original data frame if any. If the human generated title is not provided, the data frame will not have it.

In [None]:
metrics = []
for f in metric_files:
    metrics.append(json.loads(Path(f).read_text()))
df = pd.DataFrame(metrics)
df = df.drop(columns=['exception', 'prompt'])
df = df.sort_values(by=['file_name', 'model_id', 'chapter_id'])
df = df.rename(columns={'completion': 'chapter_title', 'time_taken_in_seconds': 'latency_seconds'})
logger.info(f"all metrics data is read into a dataframe of shape {df.shape}")
count = df.shape[0]

In [None]:
df.head(20)

#### Calculate Cosine versus ROUGE metrics for generated chapter titles

In [None]:
def sanitize_title(title):
    """
    This function sanitizes the chapter titles that are generated. To add elements you want to remove from the chapter titles, modify the 
    'response_prefix_to_remove' in the config file
    """
    if title is None:
        return title
    suffixes_to_remove: List[str] = config['response_prefix_to_remove']
    for response_to_remove in suffixes_to_remove:
        title = title.replace(response_to_remove, "")
    title = title.strip()
    title = title.split("\n")[0]
    return title
df.chapter_title = df.chapter_title.map(sanitize_title)
# view information about the type of data generated by the models, and other metrics below
df.head(10)

#### ROUGE & Cosine Similarity Scores for titles:
---

Here, the `amazon.titan-embed-text-v1` is used to get the embeddings of texts. To use a different embeddings model, change the `model` in the `embeddings_model_info` and modify this function.

In [None]:
from typing import Optional
MAX_TEXT_LEN_FOR_EMBEDDING: int = config['embeddings_model_info']['max_text_len_for_embedding']
bedrock: Optional[get_bedrock_client] = None

def get_embedding(text: str, modelId: str=config['embeddings_model_info'].get('model'), accept: str='application/json', contentType: str='application/json'):
    """
    Generates embeddings for the chapter titles and original titles to generate cosine similarity measures
    """
    global bedrock
    if bedrock is None:
        bedrock = get_bedrock_client()
    body = json.dumps({"inputText": text[:MAX_TEXT_LEN_FOR_EMBEDDING]})
    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    token_count = response_body.get('inputTextTokenCount')
    return embedding, token_count

def get_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between the chapter title generated from models, and the human generated title (if any)
    """
    A,_ = get_embedding(text1)
    B,_ = get_embedding(text2)
    cosine = dot(A, B)/(norm(A)*norm(B))
    return cosine

def get_rouge_l_score(completion: str, golden: str) -> float:
    """
    This function calculates the rouge-l score between the chapter title generated from models, and the human generated title (if any)
    """
    rouge_metric_selection: str = config['embeddings_model_info']['rouge_metric_selection']
    scorer = rouge_scorer.RougeScorer([rouge_metric_selection])
    scores = scorer.score(golden, completion)
    return round(scores[rouge_metric_selection].fmeasure, 4)

In [None]:
def compare_titles(row):
    """
    Generates the rouge and cosine similarity scores for chapter titles and original titles
    """
    if (row.get('original_title') and row.get('chapter_title') is not None) and (pd.notna(row.get('original_title')) and pd.notna(row.get('chapter_title'))):
        logger.info(f"Chapter title: {row['chapter_title']}, Original title: {row['original_title']}")
        rouge_l_score = get_rouge_l_score(row['chapter_title'], row['original_title'])
        cosine_sim = get_cosine_similarity(row['chapter_title'].lower(), row['original_title'].lower())
        return pd.Series([rouge_l_score, cosine_sim])
    else:
        logger.info(f'ROUGE scores and Cosine similarity scores cannot be computed since original titles are not provided in the chapterized dataset')
        rouge_l_score, cosine_sim = None, None

if 'original_title' in df.columns:
    df[['rouge_l_f1_score', 'cosine_similarity']] = df.apply(compare_titles, axis=1)
else:
    logger.info(f"No evaluation metrics available since Golden titles are not provided in the dataset.")

In [None]:
# show the number of chapter titles generated by each of the model
df_per_model_id_counts = df['model_id'].value_counts()
df_per_model_id_counts

In [None]:
df.head(30)

In [None]:
metrics_dir = config['dir']['metrics']
# Create the directory if it doesn't exist
os.makedirs(metrics_dir, exist_ok=True)
# Construct the file path
metrics_file_path = os.path.join(metrics_dir, config['dir']['metrics_file'])
df.to_csv(metrics_file_path, index=False)

In [None]:
df_summary = df.groupby('model_id').mean(numeric_only=True)
if 'rouge_l_f1_score' and 'cosine_similarity' in df_summary.columns:
    df_summary = df_summary.rename(columns={'rouge_l_f1_score': 'mean_rouge_l_f1_score', 'cosine_similarity': 'mean_cosine_similarity'})
df_summary['p95_latency_seconds'] = df.groupby('model_id')['latency_seconds'].quantile(0.95)
df_summary['avg_cost_per_txn'] = df_summary.input_token_price + df_summary.output_token_pricing
df_summary['p95_cost_per_txn'] = df.groupby('model_id')['input_token_price'].quantile(0.95) + \
                                 df.groupby('model_id')['output_token_pricing'].quantile(0.95)
df_summary.completion_token_count = df_summary.completion_token_count.astype(int)
df_summary.prompt_token_count = df_summary.prompt_token_count.astype(int)
df_summary['p95_completion_token_count'] = df.groupby('model_id')['completion_token_count'].quantile(0.95)
df_summary['p95_prompt_token_count'] = df.groupby('model_id')['prompt_token_count'].quantile(0.95)
df_summary = df_summary.drop(columns=['chapter_id'])
# Reset the index to make 'model_id' a column
df_summary = df_summary.reset_index()
df_summary


#### Calculate the long short view of the completions

In [None]:
# handle if the title is given in the data frame, include it in the pivoted df, else exclude it
if 'original_title'in df.columns:
    index_cols = ['file_name', 'chapter_id', 'chapter_text', 'original_title']
else:
    index_cols = ['file_name', 'chapter_id', 'chapter_text']
    
df_pivoted = df.pivot_table(index=index_cols, columns='model_id', values='chapter_title', aggfunc='first')
cols_other_than_index_cols = [f"{c}_title" for c in df_pivoted.columns if c not in index_cols]
df_pivoted = df_pivoted.reset_index()
df_pivoted.columns = index_cols + cols_other_than_index_cols
df_pivoted.head()

In [None]:
# Construct the file path
movel_evals_fpath = os.path.join(metrics_dir, config['dir']['model_evals_file'])
df_pivoted.to_csv(movel_evals_fpath, index=False)
df_pivoted.head()

In [None]:
df_summary

In [None]:
def create_summary(row, summary):
    return summary.format(
                model_id=row.name,
                avg_latency=round(row['latency_seconds'], 4),
                p95_latency=round(row['p95_latency_seconds'], 4),
                avg_cost=round(10000 * row['avg_cost_per_txn'], 6),
                p95_cost_per_txn=round(10000 * row['p95_cost_per_txn'], 6),
                avg_prompt_token_count=row['prompt_token_count'],
                p95_prompt_token_count=row['p95_prompt_token_count'],
                avg_completion_token_count=row['completion_token_count'],
                p95_completion_token_count=row['p95_completion_token_count'],
                mean_rouge_l_score=('None' if row.get('mean_rouge_l_f1_score') is None else round(row['mean_rouge_l_f1_score'], 4)),
                mean_cosine_similarity_score=('None, (no human generated title provided in the data)' if row.get('mean_cosine_similarity') is None else round(row['mean_cosine_similarity'], 4)),
                count=int(row['count'])
            )
df_summary = pd.merge(left=df_summary, right=df_per_model_id_counts, on="model_id", how="left")

df_summary['overall_report'] = df_summary.apply(lambda r: create_summary(r, config['report']['summary_text']), axis=1)
df_summary = df_summary.round(6)
summary_metrics_file_path = os.path.join(metrics_dir, config['dir']['summary_metrics_file'])
df_summary = df_summary.sort_values(by=['mean_cosine_similarity', 'mean_rouge_l_f1_score'], ascending=False)
df_summary.to_csv(summary_metrics_file_path, index=False)

In [None]:
# view the df_summary elements
df_summary.head(10)

### Title Evaluation: Using LLM as a Judge in the loop
---

In this portion:

1. Titles generated by each model are evaluated on relevance and meaning by [Claude](https://www.anthropic.com/news/claude-3-family) Sonnet/Your model of choice. Prompt for the model that acts as a judge in the loop can be viewed in: [eval_template.txt](data/prompts/eval_template.txt). Edit and review this prompt based on the use case and criteria for subjective evaluation.

2. The role of the model acting as a judge it to compare the titles generated by each model to a human generated title (Aka ***golden title***). It provides information on the selected model, title, and an explanation of its selection, with an in depth analysis of comparison between other titles and why it chose the one it did. In this case, the model as a judge is prompted to ***capture the most relevant aspects of the meeting*** while generating a title.

3. A final evaluation metric is calculated that shows the distribution of the selected models and their respective titles. This will give a judgement call of which model to use in production ready workloads.

***Note: For more information on the use of having a Model act as a judge, view: https://huggingface.co/learn/cookbook/en/llm_judge***

In [None]:
try:
    # convert the model evaluation metrics stored as a df 
    model_eval_df = pd.read_csv(os.path.join(config['dir']['metrics'], config['dir']['model_evals_file']))  
    logger.info(f"Model eval file found with all model completions. Ready to evaluate responses...")
    model_eval_df.head()
except Exception as e:
    logger.error(f"Model evaluation csv file not found in the directory. Error: {e}")
model_eval_df.head(10)

#### Prepare the evaluation prompt payloads

Here, the [`evaluation prompt template`](data/prompts/eval_template.txt) is used by the LLM judge to evaluate different chapter titles and suggest the most suitable title based on the evaluation criteria mentioned in the prompt template.

In [None]:
def prepare_eval_prompts(row):
    """
    This function evaluates the prompts by incorporating all of the titles generated by various bedrock models into the evaluation prompt template.
    """
    # represents the eval template used by the model judge
    eval_template: Optional[str] = None
    processed_eval_template: Optional[str] = None
    model_titles: List[str] = []
    try:
        # file path to the eval template
        eval_template_path: str = os.path.join(config['dir']['prompts'], config['eval_model_info'].get('prompt_template'))
        with open(eval_template_path, "r") as f:
            eval_template = f.read()
            logger.info(f"evaluation prompt template recorded: {eval_template}")
    except FileNotFoundError:
        print(f"Error: Evaluation template not found at {eval_template_path}")
    logger.info(f"chapter_text: {row['chapter_text']}")
    logger.info(f"original_title: {row['original_title']}")
    for column in row.index:
        if column.endswith("_title") and column != "original_title":
            model_id = column.split("_title")[0]
            model_title = row[column]
            model_titles.append(f"\n<{model_id}>\n{model_title}\n</{model_id}>\n")
    processed_eval_template = eval_template.format(
        chapter_text=row['chapter_text'], 
        original_title=row['original_title'],
        model_titles="\n".join(model_titles)
    )

    return processed_eval_template

Add `evaluation prompt` as a column into a df with respective model and chapter titles to send into the Model for further evaluation in the loop

In [None]:
if model_eval_df is not None:
    model_eval_df['eval_prompt'] = model_eval_df.apply(lambda r: prepare_eval_prompts(r), axis=1)
    logger.info("preparing the evaluation prompt templates for the LLM judge....")
else:
    logger.error(f"Model evaluation dataset is not available to process.")
model_eval_df_f_path = os.path.join(metrics_dir, config['dir']['processed_prompts_for_eval'])
model_eval_df.to_csv(model_eval_df_f_path, index=False)
model_eval_df.head(10)

#### Using LLM (Claude) as a judge in the loop to evaluate and narrow down the titles generated by different models of choice

In [None]:
def llm_judge_json_evaluations(model_id: str, prompt: str):
    # represents the service name
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception = None,
               prompt = prompt,
               completion = None,
               file_name = None,
               original_title = None, 
               # initializing to 0 since none type throws an error later, this is used to calculate price per token input/output on ODT pricing
               completion_token_count = 0,
               # initializing to 0 since none type throws an error later
               prompt_token_count=0,
               input_token_price = None, 
               output_token_pricing = None,
               model_id = model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    parameters = config['inference_parameters_for_explanations']
    temperature = parameters.get('temperature', 0.1)
    caching = parameters.get('caching', False)
    max_tokens = parameters.get("max_tokens", 500)
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        logger.info(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{ "content": body,"role": "user"}],
                              temperature=temperature,
                              max_tokens=max_tokens,
                              caching=caching)
        
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens (this is the same structure for embeddings and text generation models on Amazon Bedrock)
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    
    logger.info(f"completion: {ret['completion']}")
    return ret

In [None]:
def get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    print(f"row={row}")
    logger.info(f"row {i}/{total}, prompt_template={model_info['prompt_template']}, model_id={model_info['model']}")
    model_id = model_info['model']
    # create the payload for model inference
    prompt = row['eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = llm_judge_json_evaluations(model_id, prompt)
    resp['original_title'] = row['original_title']
    resp['file_name'] = row['file_name']
    # calculate the input and output token price for all of the calls
    resp['input_token_price'] = (resp['prompt_token_count']/1000) * model_info['input_tokens_pricing']
    logger.info(f"The price for {resp['prompt_token_count']} tokens for {model_id} for filename={row['file_name']} chapter={row['chapter_id']} is {resp['input_token_price']}")
    resp['output_token_pricing'] = (resp['completion_token_count']/1000) * model_info['output_tokens_pricing']
    logger.info(f"The price for {resp['completion_token_count']} tokens for {model_id} for filename={row['file_name']} chapter={row['chapter_id']} is {resp['output_token_pricing']}")
    dir_path = os.path.join(config['dir']['model_eval_completions'], row['file_name'], model_id.replace(":", "-"))
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"model_evaluation_{row['chapter_id']}.json")
    logger.info(f"writing response={resp} to {fpath}")
    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))
    logger.info(f"response {i}: {resp}")
    return resp

@ray.remote
def async_get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return get_inference(i, row, total, model_info)

In [None]:
model_eval_df = json.loads(model_eval_df.to_json(orient='records'))
n: int = config['parallel_inference_count']
from typing import List
resp_list: List = []
model_list = config['eval_model_info']
st = time.perf_counter()
logger.info(f"------ running inference for {model_list.get('model')} -----")
list_of_lists = [model_eval_df[i * n:(i + 1) * n] for i in range((len(model_eval_df) + n - 1) // n )]
logger.info(f"split input list of size {len(model_eval_df)} into {len(list_of_lists)} lists")
for idx, l in enumerate(list_of_lists):
    logger.info(f"getting inference for list {idx+1}/{len(list_of_lists)}, size of list={len(l)} ")
    resp_list.extend(ray.get([async_get_inference.remote(i+1, e, len(l), model_list) for i, e in enumerate(l)]))
elapsed_time = time.perf_counter() - st
logger.info(f"------ model={model_list.get('model')} completed in {elapsed_time} ------ ")

#### Extract all evaluations from the model evaluator

In [None]:
## Represents extracted all metric files
fpath_evaluated_files = os.path.join(config['dir']['model_eval_completions'], "**", "*", "*.json")
eval_metric_files = glob.glob(fpath_evaluated_files, recursive=True)
logger.info(f"there are {len(eval_metric_files)} evaluated files by {config['eval_model_info']['model']} LLM judge in {fpath_evaluated_files}")

In [None]:
model_evaluation_responses = []
for f in eval_metric_files:
    with open(f, 'r') as file:
        model_evaluation_responses.append(json.loads(file.read()))
# results_df will contain the evaluation responses, including the completion and the model id
results_df = pd.DataFrame(model_evaluation_responses)
results_df = results_df.drop(columns=['exception', 'prompt', 'file_name'])
results_df.head(10)

In [None]:
def clean_model_eval_json(data):
    """
    This function is to take in json data, and clean it, assign the selected title as outputted by the model evaluator
    """
    try:
        json_data = json.loads(data.replace('\\', '\\\\'))
        return pd.Series({
            'best_match_title': json_data['best_match_title'],
            'selected_model': json_data['selected_model'],
            'explanation': json_data['explanation'],
        })
    except json.JSONDecodeError:
        return pd.Series({
            'best_match_title': None,
            'selected_model': None,
            'explanation': None,
        })

In [None]:
def tidy_split(df, column, sep=',', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.
    
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [None]:
new_results_df = results_df['completion'].apply(clean_model_eval_json)
# removing any unnecessary characters from the selected_model if any
new_results_df['selected_model'] = new_results_df['selected_model'].str.replace(r'<[^>]+>', '', regex=True)
# here we split the elements of the selected_model column using the tidy split function
new_exploded_df = tidy_split(new_results_df, 'selected_model', sep=',')
new_results_df['chapter_title'] = results_df['original_title']
new_results_df['input_token_price'] = results_df['input_token_price']
new_results_df['output_token_price'] = results_df['output_token_pricing']
new_results_df = new_results_df.reindex(columns=['chapter_title', 'best_match_title', 'selected_model', 'explanation', 'input_token_price', 'output_token_price'])
logger.info(f"All evaluation data is read into a dataframe of shape {results_df.shape}")
processed_prompts_for_eval_path = os.path.join(metrics_dir, config['dir']['filtered_titles_for_eval'])
new_results_df.to_csv(processed_prompts_for_eval_path, index=False)
# display the selected title, model explanation and the respective golden title in a side by side view
new_results_df.head(10)

In [None]:
# Compute the percentage of each model selection and reset the index
new_exploded_df['selected_model'] = new_exploded_df['selected_model'].map(lambda x: x.strip())
model_percentage_df = new_exploded_df['selected_model'].value_counts(normalize=True).reset_index()
model_percentage_df['proportion'] *= 100
model_distribution_fpath = os.path.join(metrics_dir, config['dir']['model_distribution'])
model_percentage_df.to_csv(model_distribution_fpath, index=False)
model_percentage_df.rename(columns = {'selected_model':'model_id'}, inplace = True)
model_percentage_df.head(10)

In [None]:
# Identify the most frequently selected model
most_selected_index = model_percentage_df.proportion.idxmax()
report_template: str = config['report']['model_recommendation']
report: str = report_template.format(
                count=new_results_df.best_match_title.count(),
                model_id=model_percentage_df.iloc[most_selected_index]['model_id'],
                percentage_of_occurrence=model_percentage_df.proportion.max(), 
                total_evaluation_cost=round((new_results_df.input_token_price.sum() + new_results_df.output_token_price.sum()), 4))
result_data = {'model_recommendation': [report]}
results_summary_df = pd.DataFrame(result_data)
recommended_model_fpath = os.path.join(metrics_dir, config['dir']['final_report'])
# Saving to CSV
results_summary_df.to_csv(recommended_model_fpath, index=False)
print(report)

In [None]:
merged_df = pd.merge(df_summary, model_percentage_df, on='model_id', how='left')
merged_df.rename(columns={'proportion': 'LLM_as_a_judge_pick_rate'}, inplace=True)
merged_df['LLM_as_a_judge_pick_rate'] = merged_df['LLM_as_a_judge_pick_rate'].fillna("not available")
merged_df['mean_rouge_l_f1_score'] = merged_df['mean_rouge_l_f1_score'].fillna("not available")
merged_df['mean_cosine_similarity'] = merged_df['mean_cosine_similarity'].fillna("not available")
eval_report_template = config['report']['eval_report_explanation']

# Calculate the evaluation report for each row for the mean cosine, rouge and llm as a judge pick rate
merged_df['eval_report'] = merged_df.apply(lambda row: eval_report_template.format(
    rouge_score=row['mean_rouge_l_f1_score'],
    cosine_score=row['mean_cosine_similarity'],
    llm_as_a_judge=row['LLM_as_a_judge_pick_rate']
), axis=1)
merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
cols = merged_df.columns.tolist()
idx = cols.index('mean_cosine_similarity')
cols.insert(idx + 1, cols.pop(cols.index('LLM_as_a_judge_pick_rate')))
cols.insert(idx + 2, cols.pop(cols.index('eval_report')))
merged_df = merged_df[cols]
merged_df.to_csv(summary_metrics_file_path, index=False)
merged_df

In [None]:
# get the explanation results from llm as a judge
new_results_df = new_results_df.loc[:, ~new_results_df.columns.duplicated()]
new_results_df = new_results_df.rename(columns={'selected_model': 'model_id'})
new_results_df = pd.merge(new_results_df, merged_df[['model_id', 'eval_report']], on='model_id', how='left')
new_results_df = new_results_df.rename(columns={'model_id': 'selected_model'})

In [None]:
# insert the report right next to the explanation
cols = new_results_df.columns.tolist()
explanation_idx = cols.index('explanation')
cols.insert(explanation_idx + 1, cols.pop(cols.index('eval_report')))
new_results_df = new_results_df[cols]
new_results_df.to_csv(processed_prompts_for_eval_path, index=False)
new_results_df.head(10)

### Compute the Recommended LLM based on a combined score of `Subjective` and `Quantitative` evaluation using `LLM as a judge`, `ROUGE` and `Cosine Similarity` metrics

In [None]:
# Fill NaN values with 0 in the normalized pick rate
merged_df['LLM_as_a_judge_pick_rate'].replace('not available', 0, inplace=True)
merged_df['mean_rouge_l_f1_score'].replace('not available', 0, inplace=True)
merged_df['mean_cosine_similarity'].replace('not available', 0, inplace=True)
merged_df['LLM_as_a_judge_pick_rate'] = merged_df['LLM_as_a_judge_pick_rate'] / 100
merged_df

In [None]:
best_llm_judge_model = merged_df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['model_id']
best_llm_judge_model

In [None]:
best_rouge_score_model = merged_df.sort_values(by='mean_rouge_l_f1_score', ascending=False).iloc[0]['model_id']
best_rouge_score_model

In [None]:
best_cosine_model = merged_df.sort_values(by='mean_cosine_similarity', ascending=False).iloc[0]['model_id']
best_cosine_model

In [None]:
best_llm_judge_model_value = merged_df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['LLM_as_a_judge_pick_rate']
best_llm_judge_model_value

In [None]:
def recommend_model(df) -> str:
    """
    This function computes the recommended model based on the three evaluation criteria.
    If a model has the highest score for all three criteria, then it becomes the best model agreed by all three.
    If not, then it is checked for the combination of the rest of the two criteria. If none of the cases satisfy,
    then a best recommended model is returned for each of the evaluation criteria.
    """
    try: 
        evaluation_report: Optional[str] = None
        # model with the highest score using LLM as a judge eval
        best_llm_judge_model = df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['model_id']
        best_llm_judge_model_value = df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['LLM_as_a_judge_pick_rate']
        # model with the highest score using the ROUGE f1 score
        best_rouge_score_model = df.sort_values(by='mean_rouge_l_f1_score', ascending=False).iloc[0]['model_id']
        best_rouge_score_model_value = df.sort_values(by='mean_rouge_l_f1_score', ascending=False).iloc[0]['mean_rouge_l_f1_score']
        # model with the highest score using the Cosine Similarity score
        best_cosine_model = df.sort_values(by='mean_cosine_similarity', ascending=False).iloc[0]['model_id']
        best_cosine_model_value = df.sort_values(by='mean_cosine_similarity', ascending=False).iloc[0]['mean_cosine_similarity']

        # check if all three models that are selected on the three criteria are the same
        if best_llm_judge_model == best_rouge_score_model == best_cosine_model:
            evaluation_report = (
                f"As per all three evaluation criteria, '{best_llm_judge_model}' is the best recommended model for your workload "
                f"based on the LLM as a judge pick rate of {best_llm_judge_model_value*100}%, Cosine Similarity of {best_cosine_model_value} and ROUGE score of {best_rouge_score_model_value}."
            )
        # Check combinations of any two criteria permutations
        elif best_llm_judge_model == best_rouge_score_model:
            evaluation_report = (
                f"As per the two evaluation criteria, '{best_llm_judge_model}' is the best recommended model for your workload "
                f"based on the LLM as a judge pick rate of {best_llm_judge_model_value*100}%, and ROUGE score of {best_rouge_score_model_value}."
            )
        elif best_llm_judge_model == best_cosine_model:
            evaluation_report = (
                f"As per the two evaluation criteria, '{best_llm_judge_model}' is the best recommended model for your workload "
                f"based on the LLM as a judge pick rate of {best_llm_judge_model_value*100}%, and Cosine Similarity score of {best_cosine_model_value}."
            )
        elif best_rouge_score_model == best_cosine_model:
            evaluation_report = (
                f"As per the two evaluation criteria, '{best_rouge_score_model}' is the best recommended model for your workload "
                f"based on the Cosine Similarity of {best_cosine_model_value} and ROUGE score of {best_rouge_score_model_value}."
            )
        # If none of the combinations match, recommend based on each individual criterion
        else:
            evaluation_report = (
                f"Based on each evaluation criteria, the following models are best recommended. "
                f"LLM as a judge selects {best_llm_judge_model} as the best recommended model. "
                f"Cosine Similarity score selects {best_cosine_model} as the best recommended model. "
                f"ROUGE score selects {best_rouge_score_model} as the best recommended model."
            )
    except Exception as e:
        logger.error(f"The best recommended model could not be provided: {e}")
        evaluation_report: Optional[str] = None
    return evaluation_report

In [None]:
# get the overall model recommendation based on the three evaluation criteria
recommendation = recommend_model(merged_df)
# Save the overall model evaluation recommendation to a csv
overall_eval_report_fpath: str = os.path.join(config['dir']['metrics'], config['dir']['overall_eval_report'])
overall_eval_data = {'overall_eval_recommendation': [recommendation]}
overall_eval_df = pd.DataFrame(overall_eval_data)
overall_eval_df.to_csv(overall_eval_report_fpath, index=False)
print(recommendation)