## Evaluations Model Responses using _LLM as a judge_
---

This notebook does as follows:

1. Reads all the responses from the previous inference step and runs evaluations on the responses using an _LLM as a judge_ that selects the best model, corresponding best response given the question and context, and the subjective evaluation/explanation for choosing that model.

1. Records metrics like the `p90, p95` latency, as well as `explanation` files as to why a given model was selected by the _LLM as a judge_ and why other's were not based on correctness and relevancy.

1. Uses a _Final LLM as a summarizer_ to parse through all of the subjective evaluations/explanations provided by the _LLM as a judge_ and gives a final analysis on the trends, patterns spotted across the model performance and gives a summary of which model is preferred for a given use case/dataset

*The model to be used as a judge and the final analysis summarizer can be configured in the `llm_as_a_judge_info` and the `final_analysis_summarizer` sections in the [config.yaml](config.yaml) file.*

In [None]:
# import the libraries
import os
import re
import ray
import json
import glob
import yaml
import time
import boto3
import logging
import botocore
import pandas as pd
from pathlib import Path
from functools import reduce
from litellm import completion
from typing import Dict, List, Optional

In [None]:
# set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

In [None]:
# global constants
CONFIG_FILE_PATH = "config.yaml"

# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

In [None]:
# initialize all global variables that are used across this notebook hydrated from the `config.yaml` file

# name of your csv file (containing the dataframe)
FILE_NAME: str = config['dir_info']['dataset_file_name']
# data directory
DATA_DIR: str = config['dir_info']['data_dir']

# result files
INFERENCE_LATENCY_SUMMARY_FPATH = os.path.join(DATA_DIR, config['dir_info']['inference_latency_summary_fname'])
METRICS_DIR: str = os.path.join(DATA_DIR, config['dir_info'] ['metrics'])
JSON_TXT_FILE_PATH: str = os.path.join(METRICS_DIR, config['dir_info']['llm_comparisons_txt'])
ALL_EXPLANATIONS_FPATH: str = os.path.join(METRICS_DIR, config['dir_info']['all_explanations'])
FINAL_ANALYSIS_MODEL_ID: str = config['final_analysis_summarizer']
FINAL_SUMMARY_ANALYSIS: str = os.path.join(METRICS_DIR, config['dir_info']['final_summary_analysis'])
bedrock_model_ids: List[str] = config['bedrock_fms_to_test']
USER_PROMPT_COL: str = config['dataset_info']['user_question_col']
SYSTEM_PROMPT_COL: str = config['dataset_info']['system_prompt_col']
INFERENCE_PARAMETERS: Dict = config['inference_parameters']
ON_LIST = list(filter(None, [USER_PROMPT_COL, 
                             SYSTEM_PROMPT_COL]))

### Use _LLM as a Judge_ Evaluations
---
In this portion:

1. Responses generated by each model are evaluated on relevance and meaning by your model of choice that acts as a `Judge`. Prompt for the model that acts as a judge be viewed and tweaked for different use cases in the: [prompt_template/](prompt_template/) directory. Edit and review this prompt based on the use case and criteria for subjective evaluation.

1. The role of the model acting as a judge it to compare the responses generated by each model and the already provided responses in the source dataset (if any). It provides information on the selected model, response, and an explanation of its selection, with a detailed analysis of comparison between other responses and why it chose the one it did.

*Note: For more information on the use of having a Model act as a judge, view: https://huggingface.co/learn/cookbook/en/llm_judge*

In [None]:
def prepare_eval_prompts(row):
    """
    This function evaluates the prompts by incorporating all of the responses generated by various models into the evaluation prompt template.
    """
    eval_template: Optional[str] = None
    processed_eval_template: Optional[str] = None
    model_responses: List[str] = []
    try:
        # file path to the eval template
        eval_template_path: str = config['llm_as_a_judge_info']['prompt_template']
        with open(eval_template_path, "r") as f:
            eval_template = f.read()
            logger.info(f"evaluation prompt template recorded: {eval_template}")
    except FileNotFoundError:
        print(f"Error: Evaluation template not found at {eval_template_path}")
    for column in row.index:
        if column.endswith("-response") and column != config['dataset_info']['pre_existing_response_col']:
            model_id = column.split("-response")[0]
            model_response = row[column]
            model_responses.append(f"\n<{model_id}>\n{model_response}\n</{model_id}>\n")
    print(f"model_responses: {model_responses}")

    if config['dataset_info']['system_prompt_col'] is not None:
        # if the system prompt is provided in the dataset, it is used as context
        processed_eval_template = eval_template.format(
            context=row[config['dataset_info']['system_prompt_col']], 
            question=row[config['dataset_info']['user_question_col']], 
            original_answer=row[config['dataset_info']['pre_existing_response_col']],
            model_responses="\n".join(model_responses)
        )
    else:
        # if the system prompt is not provided, the user column is assumed to have the context and so 
        # all the context is fit into the question itself
        processed_eval_template = eval_template.format(
            context=" ", 
            question=row[config['dataset_info']['user_question_col']], 
            original_answer=row[config['dataset_info']['pre_existing_response_col']],
            model_responses="\n".join(model_responses)
        )
    return processed_eval_template

#### Retrieve all the results from the `results.csv` file generated in the _Inference Step_

In [None]:
# Read the inference results
inference_results_file: str = os.path.join(METRICS_DIR, 
                                           config['dir_info']['all_results_file_name'])
df_resp_all = pd.read_csv(inference_results_file)
df_resp_all.head(10)

### Construct the ***LLM as a Judge Prompt Template***
---

In this portion of the notebook, the prompt template that is used by the LLM as a judge is prepared. This sample contains examples of evaluation prompt templates using a Llama3 evaluation prompt template [here](model-evals/llm_as_a_judge/data/prompt_template/llama3_eval_prompt.txt). There is another example of an Anthropic Claude Evaluation prompt template [here](model-evals/llm_as_a_judge/data/prompt_template/claude_eval_prompt.txt).

Information on which LLM as a judge to use can be configured in the `llm_as_a_judge_info` section of the config file.

In [None]:
if df_resp_all is not None:
    df_resp_all['eval_prompt'] = df_resp_all.apply(lambda r: prepare_eval_prompts(r), axis=1)
    logger.info("preparing the evaluation prompt templates for the LLM judge....")
else:
    logger.error(f"Model evaluation dataset is not available to process.")
eval_path_df: str = os.path.join(METRICS_DIR, config['dir_info']['processed_eval_prompts'])
df_resp_all.insert(0, 'prompt_id', df_resp_all.index)
df_resp_all.to_csv(eval_path_df, index=False)

In [None]:
df_resp_all

### Using LLM as a judge in the loop to evaluate and narrow down the responses generated by different models of choice

In [None]:
def llm_judge_json_evaluations(model_id: str, prompt: str):
    # represents the service name
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception = None,
               user_prompt=None,
               prompt = prompt,
               completion = None,
               # initializing to 0 since none type throws an error later, this is used to calculate price per token input/output on ODT pricing
               completion_token_count = 0,
               # initializing to 0 since none type throws an error later
               prompt_token_count=0,
               input_token_cost = None, 
               output_token_cost = None,
               model_id = model_id)
    
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    parameters = config['inference_parameters']
    temperature = parameters.get('temperature', 0.1)
    caching = parameters.get('caching', False)
    max_tokens = parameters.get("max_tokens", 500)

    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        logger.info(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{ "content": body,"role": "user"}],
                              temperature=temperature,
                              max_tokens=max_tokens,
                              caching=caching)
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens (this is the same structure for embeddings and text generation models on Amazon Bedrock)
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
        
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [None]:
def get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    print(f"row={row}")
    model_id = model_info['model']
    # create the payload for model inference
    prompt = row['eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = llm_judge_json_evaluations(model_id, prompt)
    resp[config['dataset_info']['pre_existing_response_col']] = row[config['dataset_info']['pre_existing_response_col']]
    # calculate the input and output token price for all of the calls
    resp['input_token_cost'] = (resp['prompt_token_count']/1000) * model_info['input_tokens_pricing']
    resp['output_token_cost'] = (resp['completion_token_count']/1000) * model_info['output_tokens_pricing']
    dir_path = os.path.join(config['dir_info']['llm_as_a_judge_dir'], str(row['prompt_id']), model_id.replace(":", "-"))
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"model_evaluation_{row['prompt_id']}.json")
    logger.info(f"writing response={resp} to {fpath}")
    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))
    logger.info(f"response {i}: {resp}")
    return resp

@ray.remote
def async_get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return get_inference(i, row, total, model_info)

In [None]:
df_resp_all = json.loads(df_resp_all.to_json(orient='records'))
n: int = config.get('parallel_inference_count')
resp_list: List = []
erroneous_count = 0  # To keep track of errors
st = time.perf_counter()
EVAL_MODEL_INFO: Dict = config['llm_as_a_judge_info']
logger.info(f"------ running inference for {EVAL_MODEL_INFO.get('model')} -----")

# Split the input list
list_of_lists = [df_resp_all[i * n:(i + 1) * n] for i in range((len(df_resp_all) + n - 1) // n)]
logger.info(f"split input list of size {len(df_resp_all)} into {len(list_of_lists)} lists")

# Process each list
for idx, l in enumerate(list_of_lists):
    try:
        logger.info(f"getting inference for list {idx+1}/{len(list_of_lists)}, size of list={len(l)}")
        resp_list.extend(ray.get([async_get_inference.remote(i + 1, e, len(l), EVAL_MODEL_INFO) for i, e in enumerate(l)]))
    except Exception as e:
        logger.error(f"Error processing list {idx+1}/{len(list_of_lists)}: {e}")
        erroneous_count += 1

elapsed_time = time.perf_counter() - st
logger.info(f"------ model={EVAL_MODEL_INFO.get('model')} completed in {elapsed_time} ------")
logger.info(f"Total erroneous lists: {erroneous_count}")

In [None]:
# view the raw responses from the LLM as a judge evaluation
df_resp_all

### Visualize `LLM as a judge` completions and get more evaluation metrics

In [None]:
## Represents extracted all metric files
fpath_evaluated_files = os.path.join(config['dir_info']['llm_as_a_judge_dir'], "**", "*", "*.json")
eval_metric_files = glob.glob(fpath_evaluated_files, recursive=True)
logger.info(f"there are {len(eval_metric_files)} evaluated files by {config['llm_as_a_judge_info']['model']} LLM judge in {fpath_evaluated_files}")

In [None]:
def extract_sections(text: str) -> Optional[str]:
    """
    This function is used to clean up the data generated by the LLM as a judge to get
    responses split out a JSON format
    """
    try:
        question_match = re.search(r'Question:(.*?)```', text, re.DOTALL)
        question = question_match.group(1).strip() if question_match else None
    except Exception as e:
        print(f"The question was not extracted: {e}")
        question = None
    return question

In [None]:
os.makedirs(config['dir_info']['metrics'], exist_ok=True)
model_evaluation_responses = []

for f in eval_metric_files:
    with open(f, 'r') as file:
        model_evaluation_responses.append(json.loads(file.read()))
# results_df will contain the evaluation responses, including the completion and the model id
results_df = pd.DataFrame(model_evaluation_responses)
raw_llm_as_a_judge_responses: str = config['dir_info']['raw_llm_as_a_judge_completions']
raw_llm_fpath: str = os.path.join(METRICS_DIR, raw_llm_as_a_judge_responses)
results_df = results_df.dropna(axis=1, how='all')
results_df.head(10)

In [None]:
def replace_unescaped_quotes(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, str):
            value = value.replace("'", r"\'").replace('"', r'\"')
        new_pairs.append((key, value))
    return dict(new_pairs)

def clean_model_eval_json(data):
    """
    This function takes in JSON data, cleans it, and assigns the selected title as outputted by the model evaluator.
    """
    try:
        # Preprocess the input string to handle unescaped double quotes at the start
        if data.startswith('"'):
            data = "'" + data[1:-1].replace('"', '\\"') + "'"
        data = data.replace('\n', ' ')

        json_data = json.loads(data, object_pairs_hook=replace_unescaped_quotes)
        
        # Remove angle brackets from the selected_model value
        selected_model = json_data.get('selected_model', '')
        json_data['selected_model'] = re.sub(r'[<>]', '', selected_model)

        return pd.Series({
            'best_match_answer': json_data.get('best_match_answer'),
            'selected_model': json_data.get('selected_model'),
            'explanation': json_data.get('explanation'),
        })
    except (json.JSONDecodeError, KeyError) as e:
        print(f"Invalid JSON data: {data} - {e}")
        return pd.Series({
            'best_match_answer': None,
            'selected_model': None,
            'explanation': None,
        })

In [None]:
def tidy_split(df, column, sep=',', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [None]:
new_results_df = results_df['completion'].apply(clean_model_eval_json)
# removing any unnecessary characters from the selected_model if any
new_results_df['selected_model'] = new_results_df['selected_model'].str.replace(r'<[^>]+>', '', regex=True)
# here we split the elements of the selected_model column using the tidy split function
new_exploded_df = tidy_split(new_results_df, 'selected_model', sep=',')
new_results_df[config['dataset_info']['pre_existing_response_col']] = results_df[config['dataset_info']['pre_existing_response_col']]
new_results_df['input_token_cost'] = results_df['input_token_cost']
new_results_df['output_token_cost'] = results_df['output_token_cost']
logger.info(f"All evaluation data is read into a dataframe of shape {results_df.shape}")
cols = new_results_df.columns.tolist()
idx = cols.index('selected_model')
cols.insert(idx + 1, cols.pop(cols.index(config['dataset_info']['pre_existing_response_col'])))
new_results_df.drop(columns=['input_token_cost', 'output_token_cost'], inplace=True)
# display the selected title, model explanation and the respective golden title in a side by side view
new_results_df.head(20)

In [None]:
initial_df = pd.read_csv(eval_path_df)
# Merge the two DataFrames on 'gpt_response'
merged_df = pd.merge(new_results_df, initial_df[[config['dataset_info']['pre_existing_response_col'], 
                                                config['dataset_info']['user_question_col']]], on=config['dataset_info']['pre_existing_response_col'], how='left')

cols = [col for col in merged_df.columns if col != 'user prompt']
processed_prompts_for_eval_path = os.path.join(METRICS_DIR, config['dir_info']['llm_as_a_judge_comparisons'])
merged_df.to_csv(processed_prompts_for_eval_path, index=False)
merged_df

### View the LLM as a judge comparison and evaluation

In [None]:
processed_prompts_for_eval_path = os.path.join(METRICS_DIR, config['dir_info']['llm_as_a_judge_comparisons'])
merged_df = pd.read_csv(processed_prompts_for_eval_path)
merged_df

In [None]:
# Convert the DataFrame to JSON
merged_df_json = merged_df.to_json(orient='records')

# Save the JSON to a text file
with open(JSON_TXT_FILE_PATH, 'w') as json_text_file:
    json_text_file.write(merged_df_json)
logger.info(f"CSV saved to: {processed_prompts_for_eval_path}")

### Generate the LLM as a judge `pick rate` to show how many times a model was picked having the best response over the other models

In [None]:
# Compute the percentage of each model selection and reset the index
new_exploded_df['selected_model'] = new_exploded_df['selected_model'].map(lambda x: x.strip())
response_index_percentage_df = new_exploded_df['selected_model'].value_counts(normalize=True).reset_index()
response_distribution_fpath = os.path.join(METRICS_DIR, config['dir_info']['llm_as_a_judge_pick_rate'])
response_index_percentage_df['proportion'] *= 100
response_index_percentage_df.to_csv(response_distribution_fpath, index=False)
response_index_percentage_df.head(10)

### Final Summary: `LLM evaluation`

In [None]:
# simple function to get a final summary on all of the data provided from LLM as a judge
def final_analysis_summary(bedrock: botocore.client, 
                           prompt: str) -> str:
    """
    This function takes in the prompt that checks whether the text file has a response to the question and if not, 
    returns "not found" to move to the next hit
    """
    modelId=FINAL_ANALYSIS_MODEL_ID
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2000,
        "temperature": 0.1,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })

    try:
        response = bedrock.invoke_model(
        modelId=modelId,
        body=body)

        response_body = json.loads(response['body'].read().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].replace('"', "'")

    except Exception as e:
        logger.error(f"exception={e}")
        llm_response = None
    return llm_response

In [None]:
new_results_df

In [None]:
with open(ALL_EXPLANATIONS_FPATH, 'w') as file:
    for index, row in merged_df.iterrows():
        file.write(f"Selected Model: {row['selected_model']}\nExplanation: {row['explanation']}\n\n")

# Read the content back to use as analysis context
with open(ALL_EXPLANATIONS_FPATH, 'r') as file:
    analysis_context = file.read()
print(analysis_context)

In [None]:
# open the prompt template and prepare it for inference
with open(config['dir_info']['claude_final_summary_eval_prompt'], 'r') as file:
    final_summary_prompt = file.read()
    processed_summary_eval_prompt: str = final_summary_prompt.format(context=analysis_context)

endpoint_url: str = config['bedrock_ep_url'].format(region=config['aws']['region'])
bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=endpoint_url)
final_analysis: str = final_analysis_summary(bedrock, prompt=processed_summary_eval_prompt)

In [None]:
logger.info(final_analysis)

In [None]:
Path(FINAL_SUMMARY_ANALYSIS).write_text(final_analysis + "\n")