## Evaluations Model Responses using _LLM as a judge_
---

This notebook does as follows:

1. Reads all of the responses in a dataframe and runs evaluations on the responses using an `LLM as a judge` that selects the best model, corresponding best response given the question and context, and the subjective evaluation/explanation for choosing that model.

1. Records metrics like the `p90, p95` latency, as well as `explanation` files as to why a given model was selected by the `LLM as a judge` and why other's were not based on correctness and relevancy.

1. Uses a `Final LLM as a summarizer` to parse through all of the subjective evaluations/explanations provided by the `LLM as a judge` and gives a final analysis on the trends, patterns spotted across the model performance and gives a summary of which model is preferred for a given use case/dataset

In [1]:
# import the libraries
import os
import re
import ray
import json
import glob
import yaml
import time
import boto3
import logging
import botocore
import pandas as pd
from pathlib import Path
from functools import reduce
from litellm import completion
from typing import Dict, List, Optional

2024-06-05 21:07:17,580	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.6.5 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [3]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

2024-06-05 21:07:22,043	INFO worker.py:1749 -- Started a local Ray instance.


0,1
Python version:,3.11.7
Ray version:,2.11.0


In [4]:
# global constants
CONFIG_FILE_PATH = "config.yaml"

# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

[2024-06-05 21:07:23,275] p36671 {2927026569.py:8} INFO - config read from config.yaml -> {
  "app_name": "llm-as-a-judge-eval-pipeline",
  "aws": {
    "region": "us-east-1"
  },
  "pdf_dir_info": {
    "data_dir": "data",
    "dataset_dir": "source_data",
    "dataset_file_name": "data.xlsx",
    "metrics": "metrics",
    "llm_as_a_judge_dir": "eval_completions",
    "prompt_dir": "prompt_template",
    "llm_as_a_judge_completions": "llm_as_a_judge_completions.csv",
    "raw_llm_as_a_judge_completions": "raw_llm_responses.csv",
    "llm_as_a_judge_comparisons": "llm_as_a_judge_comparisons.csv",
    "llm_comparisons_txt": "llm_as_a_judge_comparisons.txt",
    "llm_as_a_judge_pick_rate": "llm_as_a_judge_pick_rate.csv",
    "eval_prompt_template": "llama3_eval_prompt.txt",
    "prompt_template": "prompt_template.txt",
    "processed_eval_prompts": "processed_eval_prompts.csv",
    "inference_latency_summary_fname": "inference_latency_summary.txt",
    "all_results_file_name": "all_resul

In [5]:
# initialize all global variables that are used across this notebook hydrated from the `config.yaml` file
# name of your csv file (containing the dataframe)
FILE_NAME: str = config['pdf_dir_info']['dataset_file_name']
# data directory
DATA_DIR: str = config['pdf_dir_info']['data_dir']

# result files
ALL_RESULTS_FPATH = os.path.join(DATA_DIR, config['pdf_dir_info']['all_results_file_name'])
INFERENCE_LATENCY_SUMMARY_FPATH = os.path.join(DATA_DIR, config['pdf_dir_info']['inference_latency_summary_fname'])
METRICS_DIR: str = config['pdf_dir_info'] ['metrics']
JSON_TXT_FILE_PATH: str = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_comparisons_txt'])
FINAL_ANALYSIS_MODEL_ID: str = config['final_analysis_summarizer']
bedrock_model_ids: List[str] = config['bedrock_fms_to_test']

### Use `LLM as a Judge` Evaluations
---
In this portion:

1. Responses generated by each model are evaluated on relevance and meaning by Your model of choice that acts as a `Judge`. Prompt for the model that acts as a judge in the loop can be viewed in: ['prompt_template/'](prompt_template/) directory. Edit and review this prompt based on the use case and criteria for subjective evaluation.

1. The role of the model acting as a judge it to compare the responses generated by each model. It provides information on the selected model, response, and an explanation of its selection, with an in depth analysis of comparison between other responses and why it chose the one it did.

*Note: For more information on the use of having a Model act as a judge, view: https://huggingface.co/learn/cookbook/en/llm_judge*

In [6]:
def prepare_eval_prompts(row):
    """
    This function evaluates the prompts by incorporating all of the titles generated by various bedrock models into the evaluation prompt template.
    """
    eval_template: Optional[str] = None
    processed_eval_template: Optional[str] = None
    model_responses: List[str] = []
    try:
        # file path to the eval template
        eval_template_path: str = config['llm_as_a_judge_info']['prompt_template']
        with open(eval_template_path, "r") as f:
            eval_template = f.read()
            logger.info(f"evaluation prompt template recorded: {eval_template}")
    except FileNotFoundError:
        print(f"Error: Evaluation template not found at {eval_template_path}")
    for column in row.index:
        if column.endswith("-response") and column != config['dataset_info']['pre_existing_response_col']:
            model_id = column.split("-response")[0]
            model_response = row[column]
            model_responses.append(f"\n<{model_id}>\n{model_response}\n</{model_id}>\n")
    print(f"model_responses: {model_responses}")
    if config['dataset_info']['system_prompt_col'] is not None:
        processed_eval_template = eval_template.format(
            context=row[config['dataset_info']['system_prompt']], 
            question=row[config['dataset_info']['user_prompt']], 
            original_answer=row[config['dataset_info']['pre_existing_response_col']],
            model_responses="\n".join(model_responses)
        )
    else:
        processed_eval_template = eval_template.format(
            context=" ", 
            question=row[config['dataset_info']['user_question_col']], 
            original_answer=row[config['dataset_info']['pre_existing_response_col']],
            model_responses="\n".join(model_responses)
        )
    return processed_eval_template

#### Retrieve all the results from the CSV file generated in step 1 containing the model responses

In [7]:
df_resp_all = pd.read_csv(ALL_RESULTS_FPATH)
df_resp_all.head(10)

Unnamed: 0,user_input,model_1,anthropic.claude-3-haiku-20240307-v1:0-response,anthropic.claude-3-haiku-20240307-v1:0-time_taken_in_seconds,anthropic.claude-3-haiku-20240307-v1:0-prompt_token_count,anthropic.claude-3-haiku-20240307-v1:0-completion_token_count,anthropic.claude-3-haiku-20240307-v1:0-exception,anthropic.claude-3-sonnet-20240229-v1:0-response,anthropic.claude-3-sonnet-20240229-v1:0-time_taken_in_seconds,anthropic.claude-3-sonnet-20240229-v1:0-prompt_token_count,anthropic.claude-3-sonnet-20240229-v1:0-completion_token_count,anthropic.claude-3-sonnet-20240229-v1:0-exception
0,Human: You are an assistant for question-answe...,The Heisenberg uncertainty principle states th...,The Heisenberg uncertainty principle states th...,1.152863,170,77,,The Heisenberg uncertainty principle is a fund...,4.446908,170,77,
1,Human: You are an assistant for question-answe...,The Schrödinger equation is a fundamental equa...,The Schrödinger equation is a fundamental equa...,0.841946,661,108,,The Schrödinger equation is a fundamental equa...,4.938234,661,117,
2,Human: You are an assistant for question-answe...,The greenhouse effect is a natural process tha...,The greenhouse effect is a natural process tha...,1.515494,604,132,,The greenhouse effect is a natural process whe...,3.419331,604,98,
3,Human: You are an assistant for question-answe...,"When light shines on a metal, electrons can be...",The photoelectric effect is a phenomenon in wh...,1.308103,588,117,,The photoelectric effect is a phenomenon where...,2.201291,588,120,
4,Human: You are an assistant for question-answe...,"Modern atomic models, based on quantum mechani...",```\nThe structure of the atom was determined ...,1.631458,582,141,,The structure of the atom was determined throu...,2.647344,582,129,
5,Human: You are an assistant for question-answe...,A catalyst is a substance that can be added to...,```\nCatalysts play a crucial role in chemical...,1.220836,544,100,,The role of catalysts in chemical reactions is...,1.971382,544,85,
6,Human: You are an assistant for question-answe...,The second law of thermodynamics states that t...,The second law of thermodynamics states that i...,0.996995,599,109,,The second law of thermodynamics states that i...,2.266173,599,107,
7,Human: You are an assistant for question-answe...,The phenomenon of nuclear fission. Fission occ...,```\nThe main difference between nuclear fissi...,1.20688,472,72,,The main difference between nuclear fission an...,3.56998,472,75,
8,Human: You are an assistant for question-answe...,Classical mechanics describes the physics of m...,The main differences between classical mechani...,0.686361,702,63,,The main differences between classical mechani...,1.775891,702,94,
9,Human: You are an assistant for question-answe...,If you touch a container that holds an endothe...,The main difference between endothermic and ex...,0.929525,679,86,,The main difference between endothermic and ex...,2.923872,679,92,


In [8]:
if df_resp_all is not None:
    df_resp_all['eval_prompt'] = df_resp_all.apply(lambda r: prepare_eval_prompts(r), axis=1)
    logger.info("preparing the evaluation prompt templates for the LLM judge....")
else:
    logger.error(f"Model evaluation dataset is not available to process.")
eval_path_df: str = os.path.join(DATA_DIR, config['pdf_dir_info']['processed_eval_prompts'])
df_resp_all.insert(0, 'prompt_id', df_resp_all.index)
df_resp_all.to_csv(eval_path_df, index=False)

[2024-06-05 21:07:23,340] p36671 {3675902893.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

model_responses: ['\n<anthropic.claude-3-haiku-20240307-v1:0>\nThe Heisenberg uncertainty principle states that there is a fundamental limit to the precision with which certain pairs of physical properties of a particle, such as position and momentum, can be known simultaneously. This principle arises from the wave-particle duality of quantum particles and has profound implications for our understanding of the behavior of matter at the atomic and subatomic scales.\n</anthropic.claude-3-haiku-20240307-v1:0>\n', '\n<anthropic.claude-3-sonnet-20240229-v1:0>\nThe Heisenberg uncertainty principle is a fundamental principle in quantum mechanics that states there is a limit to how precisely certain pairs of physical properties of a particle, such as position and momentum, can be measured simultaneously. It arises from the wave-particle duality of quantum particles and has profound implications for understanding the behavior of matter at atomic and subatomic scales.\n</anthropic.claude-3-sonne

[33m(raylet)[0m [2024-06-05 21:07:30,952 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3473813504; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:07:41,037 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3473375232; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:07:51,121 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3466485760; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:08:01,156 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3476672512;

In [9]:
df_resp_all

Unnamed: 0,prompt_id,user_input,model_1,anthropic.claude-3-haiku-20240307-v1:0-response,anthropic.claude-3-haiku-20240307-v1:0-time_taken_in_seconds,anthropic.claude-3-haiku-20240307-v1:0-prompt_token_count,anthropic.claude-3-haiku-20240307-v1:0-completion_token_count,anthropic.claude-3-haiku-20240307-v1:0-exception,anthropic.claude-3-sonnet-20240229-v1:0-response,anthropic.claude-3-sonnet-20240229-v1:0-time_taken_in_seconds,anthropic.claude-3-sonnet-20240229-v1:0-prompt_token_count,anthropic.claude-3-sonnet-20240229-v1:0-completion_token_count,anthropic.claude-3-sonnet-20240229-v1:0-exception,eval_prompt
0,0,Human: You are an assistant for question-answe...,The Heisenberg uncertainty principle states th...,The Heisenberg uncertainty principle states th...,1.152863,170,77,,The Heisenberg uncertainty principle is a fund...,4.446908,170,77,,<|begin_of_text|><|start_header_id|>user<|end_...
1,1,Human: You are an assistant for question-answe...,The Schrödinger equation is a fundamental equa...,The Schrödinger equation is a fundamental equa...,0.841946,661,108,,The Schrödinger equation is a fundamental equa...,4.938234,661,117,,<|begin_of_text|><|start_header_id|>user<|end_...
2,2,Human: You are an assistant for question-answe...,The greenhouse effect is a natural process tha...,The greenhouse effect is a natural process tha...,1.515494,604,132,,The greenhouse effect is a natural process whe...,3.419331,604,98,,<|begin_of_text|><|start_header_id|>user<|end_...
3,3,Human: You are an assistant for question-answe...,"When light shines on a metal, electrons can be...",The photoelectric effect is a phenomenon in wh...,1.308103,588,117,,The photoelectric effect is a phenomenon where...,2.201291,588,120,,<|begin_of_text|><|start_header_id|>user<|end_...
4,4,Human: You are an assistant for question-answe...,"Modern atomic models, based on quantum mechani...",```\nThe structure of the atom was determined ...,1.631458,582,141,,The structure of the atom was determined throu...,2.647344,582,129,,<|begin_of_text|><|start_header_id|>user<|end_...
5,5,Human: You are an assistant for question-answe...,A catalyst is a substance that can be added to...,```\nCatalysts play a crucial role in chemical...,1.220836,544,100,,The role of catalysts in chemical reactions is...,1.971382,544,85,,<|begin_of_text|><|start_header_id|>user<|end_...
6,6,Human: You are an assistant for question-answe...,The second law of thermodynamics states that t...,The second law of thermodynamics states that i...,0.996995,599,109,,The second law of thermodynamics states that i...,2.266173,599,107,,<|begin_of_text|><|start_header_id|>user<|end_...
7,7,Human: You are an assistant for question-answe...,The phenomenon of nuclear fission. Fission occ...,```\nThe main difference between nuclear fissi...,1.20688,472,72,,The main difference between nuclear fission an...,3.56998,472,75,,<|begin_of_text|><|start_header_id|>user<|end_...
8,8,Human: You are an assistant for question-answe...,Classical mechanics describes the physics of m...,The main differences between classical mechani...,0.686361,702,63,,The main differences between classical mechani...,1.775891,702,94,,<|begin_of_text|><|start_header_id|>user<|end_...
9,9,Human: You are an assistant for question-answe...,If you touch a container that holds an endothe...,The main difference between endothermic and ex...,0.929525,679,86,,The main difference between endothermic and ex...,2.923872,679,92,,<|begin_of_text|><|start_header_id|>user<|end_...


### Using LLM (Claude) as a judge in the loop to evaluate and narrow down the responses generated by different models of choice

In [10]:
def llm_judge_json_evaluations(model_id: str, prompt: str):
    # represents the service name
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception = None,
               user_prompt=None,
               prompt = prompt,
               completion = None,
               # initializing to 0 since none type throws an error later, this is used to calculate price per token input/output on ODT pricing
               completion_token_count = 0,
               # initializing to 0 since none type throws an error later
               prompt_token_count=0,
               input_token_cost = None, 
               output_token_cost = None,
               model_id = model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    parameters = config['inference_parameters']
    temperature = parameters.get('temperature', 0.1)
    caching = parameters.get('caching', False)
    max_tokens = parameters.get("max_tokens", 500)
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        logger.info(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{ "content": body,"role": "user"}],
                              temperature=temperature,
                              max_tokens=max_tokens,
                              caching=caching)
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens (this is the same structure for embeddings and text generation models on Amazon Bedrock)
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [11]:
def get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    print(f"row={row}")
    model_id = model_info['model']
    # create the payload for model inference
    prompt = row['eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = llm_judge_json_evaluations(model_id, prompt)
    resp[config['dataset_info']['target_response_col']] = row[config['dataset_info']['pre_existing_response_col']]
    # calculate the input and output token price for all of the calls
    resp['input_token_cost'] = (resp['prompt_token_count']/1000) * model_info['input_tokens_pricing']
    resp['output_token_cost'] = (resp['completion_token_count']/1000) * model_info['output_tokens_pricing']
    dir_path = os.path.join(config['pdf_dir_info']['llm_as_a_judge_dir'], str(row['prompt_id']), model_id.replace(":", "-"))
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"model_evaluation_{row['prompt_id']}.json")
    logger.info(f"writing response={resp} to {fpath}")
    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))
    logger.info(f"response {i}: {resp}")
    return resp

@ray.remote
def async_get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return get_inference(i, row, total, model_info)

In [12]:
df_resp_all = json.loads(df_resp_all.to_json(orient='records'))
n: int = 8
resp_list: List = []
erroneous_count = 0  # To keep track of errors
st = time.perf_counter()
EVAL_MODEL_INFO: Dict = config['llm_as_a_judge_info']
logger.info(f"------ running inference for {EVAL_MODEL_INFO.get('model')} -----")

# Split the input list
list_of_lists = [df_resp_all[i * n:(i + 1) * n] for i in range((len(df_resp_all) + n - 1) // n)]
logger.info(f"split input list of size {len(df_resp_all)} into {len(list_of_lists)} lists")

# Process each list
for idx, l in enumerate(list_of_lists):
    try:
        logger.info(f"getting inference for list {idx+1}/{len(list_of_lists)}, size of list={len(l)}")
        resp_list.extend(ray.get([async_get_inference.remote(i + 1, e, len(l), EVAL_MODEL_INFO) for i, e in enumerate(l)]))
    except Exception as e:
        logger.error(f"Error processing list {idx+1}/{len(list_of_lists)}: {e}")
        erroneous_count += 1

elapsed_time = time.perf_counter() - st
logger.info(f"------ model={EVAL_MODEL_INFO.get('model')} completed in {elapsed_time} ------")
logger.info(f"Total erroneous lists: {erroneous_count}")

[2024-06-05 21:09:00,934] p36671 {2969118206.py:7} INFO - ------ running inference for meta.llama3-70b-instruct-v1:0 -----
[2024-06-05 21:09:00,935] p36671 {2969118206.py:11} INFO - split input list of size 10 into 2 lists
[2024-06-05 21:09:00,936] p36671 {2969118206.py:16} INFO - getting inference for list 1/2, size of list=8
[33m(raylet)[0m [2024-06-05 21:09:01,513 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3480121344; capacity: 245107195904. Object creation will fail if spilling is required.
[36m(async_get_inference pid=36686)[0m [2024-06-05 21:09:02,955] p36686 {363383627.py:28} INFO - Invoking bedrock/meta.llama3-70b-instruct-v1:0......
[36m(async_get_inference pid=36686)[0m [2024-06-05 21:09:02,962] p36686 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials


[36m(async_get_inference pid=36686)[0m row={'prompt_id': 3, 'user_input': 'Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don\'t know the answer just say that you don\'t know. Use three sentences maximum and keep the answer concise.\n\n```\nThe photoelectric effect is a phenomenon in which electrons are emitted from a metal surface when light of sufficient energy (above a certain threshold frequency) shines on it. This effect was first observed by Heinrich Hertz in 1887 and later explained by Albert Einstein in 1905, for which he received the Nobel Prize in Physics in 1921.\n\nThe photoelectric effect contradicted classical physics, which predicted that the energy of the emitted electrons should increase with the intensity of the incident light. However, experiments showed that the energy of the emitted electrons depended only on the frequency of the light and no

[36m(async_get_inference pid=36688)[0m [92m21:09:03 - LiteLLM:INFO[0m: utils.py:1133 - [92m
[36m(async_get_inference pid=36688)[0m Request Sent from LiteLLM:
[36m(async_get_inference pid=36688)[0m 
[36m(async_get_inference pid=36688)[0m             response = client.invoke_model(
[36m(async_get_inference pid=36688)[0m                 body={"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nUse the following pieces of retrieved context in the section demarcated by \"```\" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.\nPut the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: 

[36m(async_get_inference pid=36687)[0m row={'prompt_id': 8, 'user_input': 'Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don\'t know the answer just say that you don\'t know. Use three sentences maximum and keep the answer concise.\n\n```\nClassical mechanics and quantum mechanics are two fundamental theories in physics that describe the behavior of particles and systems at different scales and under different conditions.\n\nClassical Mechanics:\nClassical mechanics is a branch of physics that deals with the motion and behavior of macroscopic objects, such as planets, stars, and everyday objects. It is based on the principles of Newtonian mechanics and is governed by Newton\'s laws of motion, as well as concepts like energy, momentum, and force.\n\nClassical mechanics assumes that particles and objects have well-defined positions and velocities at any given time

[36m(async_get_inference pid=36690)[0m [2024-06-05 21:09:08,850] p36690 {363383627.py:46} INFO - completion: {"best_match_answer": "The Schrödinger equation is a fundamental equation in quantum mechanics that describes the behavior of particles at the quantum level. It relates the wave function, which contains information about the particle's quantum state, to the particle's energy and potential energy. Solving the Schrödinger equation provides a probabilistic description of the particle's behavior and is essential for understanding atomic and molecular structure, solid-state physics, quantum computing, and quantum chemistry.", "selected_model": "dummy_model", "explanation": "I selected dummy_model because it provides a concise and clear answer that covers the main aspects of the Schrödinger equation, including its purpose, relation to wave function and energy, and its applications. The other models provide similar information, but dummy_model's answer is more comprehensive and well-

[36m(async_get_inference pid=36690)[0m row={'prompt_id': 9, 'user_input': 'Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don\'t know the answer just say that you don\'t know. Use three sentences maximum and keep the answer concise.\n\n```\nIn chemistry, chemical reactions can be classified as either endothermic or exothermic based on the energy changes that occur during the reaction process. The main difference between endothermic and exothermic reactions lies in the direction of energy flow and the temperature changes associated with the reaction.\n\nEndothermic Reactions:\nEndothermic reactions are chemical processes that absorb energy from the surroundings in the form of heat. In an endothermic reaction, the reactants require an input of energy to break existing bonds and form new bonds, resulting in an increase in the overall energy of the system.\n\nEndothe

2024-06-05 21:09:09,521	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::async_get_inference()[39m (pid=36691, ip=127.0.0.1)
  File "/var/folders/jy/g9mb5j5n6c11fgdj788p5rww0000gr/T/ipykernel_36671/2172491095.py", line 26, in async_get_inference
  File "/var/folders/jy/g9mb5j5n6c11fgdj788p5rww0000gr/T/ipykernel_36671/2172491095.py", line 10, in get_inference
KeyError: 'target_response_col'
[36m(async_get_inference pid=36691)[0m [2024-06-05 21:09:09,520] p36691 {363383627.py:46} INFO - completion: {"best_match_answer": "Catalysts play a crucial role in chemical reactions by providing an alternative pathway with a lower activation energy, which allows the reaction to proceed more easily and at a faster rate. They do this by interacting with the reactants and forming intermediate species that require less energy to undergo the necessary bond rearrangements. Catalysts can significantly increase the rate of a chemical reaction, sometimes b

### Visualize `LLM as a judge` completions and get more evaluation metrics

In [19]:
## Represents extracted all metric files
fpath_evaluated_files = os.path.join(config['pdf_dir_info']['llm_as_a_judge_dir'], "**", "*", "*.json")
eval_metric_files = glob.glob(fpath_evaluated_files, recursive=True)
logger.info(f"there are {len(eval_metric_files)} evaluated files by {config['llm_as_a_judge_info']['model']} LLM judge in {fpath_evaluated_files}")

[2024-06-05 21:11:37,713] p36671 {197194707.py:4} INFO - there are 10 evaluated files by meta.llama3-70b-instruct-v1:0 LLM judge in eval_completions/**/*/*.json


In [20]:
def extract_sections(text: str) -> Optional[str]:
    try:
        question_match = re.search(r'Question:(.*?)```', text, re.DOTALL)
        question = question_match.group(1).strip() if question_match else None
    except Exception as e:
        print(f"The question was not extracted: {e}")
        question = None
    return question

[33m(raylet)[0m [2024-06-05 21:11:42,534 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3475947520; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:11:52,622 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3475906560; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:12:02,716 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3474362368; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:12:12,814 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3474190336;

In [15]:
os.makedirs(config['pdf_dir_info']['metrics'], exist_ok=True)
model_evaluation_responses = []
for f in eval_metric_files:
    with open(f, 'r') as file:
        model_evaluation_responses.append(json.loads(file.read()))
# results_df will contain the evaluation responses, including the completion and the model id
results_df = pd.DataFrame(model_evaluation_responses)
raw_llm_as_a_judge_responses: str = config['pdf_dir_info']['raw_llm_as_a_judge_completions']
raw_llm_fpath: str = os.path.join(METRICS_DIR, raw_llm_as_a_judge_responses)
results_df = results_df.dropna(axis=1, how='all')
results_df.to_csv(raw_llm_fpath, index=False)
results_df.head(10)

Unnamed: 0,prompt,completion,completion_token_count,prompt_token_count,input_token_cost,output_token_cost,model_id,dummy_model_response
0,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The main difference bet...",163,1211,0.003209,0.000571,meta.llama3-70b-instruct-v1:0,If you touch a container that holds an endothe...
1,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The Heisenberg uncertai...",181,767,0.002033,0.000633,meta.llama3-70b-instruct-v1:0,The Heisenberg uncertainty principle states th...
2,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The main difference bet...",128,1017,0.002695,0.000448,meta.llama3-70b-instruct-v1:0,The phenomenon of nuclear fission. Fission occ...
3,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The second law of therm...",179,1209,0.003204,0.000626,meta.llama3-70b-instruct-v1:0,The second law of thermodynamics states that t...
4,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The Schrödinger equatio...",173,1340,0.003551,0.000605,meta.llama3-70b-instruct-v1:0,The Schrödinger equation is a fundamental equa...
5,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The main differences be...",149,1252,0.003318,0.000521,meta.llama3-70b-instruct-v1:0,Classical mechanics describes the physics of m...
6,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The structure of the at...",229,1371,0.003633,0.000802,meta.llama3-70b-instruct-v1:0,"Modern atomic models, based on quantum mechani..."
7,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The photoelectric effec...",217,1240,0.003286,0.000759,meta.llama3-70b-instruct-v1:0,"When light shines on a metal, electrons can be..."
8,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The greenhouse effect i...",189,1224,0.003244,0.000661,meta.llama3-70b-instruct-v1:0,The greenhouse effect is a natural process tha...
9,<|begin_of_text|><|start_header_id|>user<|end_...,"{""best_match_answer"": ""The role of catalysts i...",169,1126,0.002984,0.000592,meta.llama3-70b-instruct-v1:0,A catalyst is a substance that can be added to...


In [16]:
def replace_unescaped_quotes(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, str):
            value = value.replace("'", r"\'").replace('"', r'\"')
        new_pairs.append((key, value))
    return dict(new_pairs)

def clean_model_eval_json(data):
    """
    This function is to take in json data, and clean it, assign the selected title as outputted by the model evaluator
    """
    try:
        # Preprocess the input string to handle unescaped double quotes at the start
        if data.startswith('"'):
            data = "'" + data[1:-1].replace('"', '\\"') + "'"

        json_data = json.loads(data, object_pairs_hook=replace_unescaped_quotes)
        
        # Remove angle brackets from the selected_model value
        selected_model = json_data['selected_model']
        json_data['selected_model'] = re.sub(r'[<>]', '', selected_model)

        return pd.Series({
            'best_match_answer': json_data['best_match_answer'],
            'selected_model': json_data['selected_model'],
            'explanation': json_data['explanation'],
        })
    except (json.JSONDecodeError, KeyError):
        print(f"Invalid JSON data: {data}")
        return pd.Series({
            'best_match_answer': None,
            'selected_model': None,
            'explanation': None,
        })

In [17]:
def tidy_split(df, column, sep=',', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [18]:
new_results_df = results_df['completion'].apply(clean_model_eval_json)
# removing any unnecessary characters from the selected_model if any
new_results_df['selected_model'] = new_results_df['selected_model'].str.replace(r'<[^>]+>', '', regex=True)
# here we split the elements of the selected_model column using the tidy split function
new_exploded_df = tidy_split(new_results_df, 'selected_model', sep=',')
new_results_df[config['dataset_info']['pre_existing_response_col']] = results_df[config['dataset_info']['pre_existing_response_col']]
new_results_df['input_token_cost'] = results_df['input_token_cost']
new_results_df['output_token_cost'] = results_df['output_token_cost']
logger.info(f"All evaluation data is read into a dataframe of shape {results_df.shape}")
cols = new_results_df.columns.tolist()
idx = cols.index('selected_model')
cols.insert(idx + 1, cols.pop(cols.index(config['dataset_info']['pre_existing_response_col'])))
new_results_df.drop(columns=['input_token_cost', 'output_token_cost'], inplace=True)
# display the selected title, model explanation and the respective golden title in a side by side view
new_results_df.head(20)

2024-06-05 21:09:14,303	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::async_get_inference()[39m (pid=36687, ip=127.0.0.1)
  File "/var/folders/jy/g9mb5j5n6c11fgdj788p5rww0000gr/T/ipykernel_36671/2172491095.py", line 26, in async_get_inference
  File "/var/folders/jy/g9mb5j5n6c11fgdj788p5rww0000gr/T/ipykernel_36671/2172491095.py", line 10, in get_inference
KeyError: 'target_response_col'
[36m(async_get_inference pid=36687)[0m [92m21:09:14 - LiteLLM:INFO[0m: utils.py:2911 - Wrapper: Completed Call, calling success_handler
[36m(async_get_inference pid=36687)[0m [2024-06-05 21:09:14,296] p36687 {utils.py:2911} INFO - Wrapper: Completed Call, calling success_handler
[36m(async_get_inference pid=36687)[0m [2024-06-05 21:09:14,296] p36687 {363383627.py:46} INFO - completion: {"best_match_answer": "The main differences between classical mechanics and quantum mechanics are: 1. Scale: Classical mechanics applies to macroscopic objects,

KeyError: 'model_1'

[33m(raylet)[0m [2024-06-05 21:09:21,609 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3478745088; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:09:31,613 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3476393984; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:09:41,699 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3475603456; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-05 21:09:51,707 E 36682 3738248] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_21-07-19_057326_36671 is over 95% full, available space: 3475582976;

In [None]:
initial_df = pd.read_csv('data/processed_eval_prompts.csv')
# Merge the two DataFrames on 'gpt_response'
if config['dataset_info']['system_prompt_col'] is not None:
    merged_df = pd.merge(new_results_df, initial_df[[config['dataset_info']['target_response_col'], 
                                                    config['dataset_info']['user_prompt']]], on=config['dataset_info']['target_response_col'], how='left')
else:
    merged_df = pd.merge(new_results_df, initial_df[[config['dataset_info']['target_response_col'], 
                                                    config['dataset_info']['user_question_col']]], on=config['dataset_info']['target_response_col'], how='left')
cols = [col for col in merged_df.columns if col != 'user prompt']
processed_prompts_for_eval_path = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_as_a_judge_comparisons'])
merged_df.to_csv(processed_prompts_for_eval_path, index=False)
merged_df

In [None]:
# Convert the DataFrame to JSON
merged_df_json = merged_df.to_json(orient='records')

# Save the JSON to a text file
with open(JSON_TXT_FILE_PATH, 'w') as json_text_file:
    json_text_file.write(merged_df_json)
logger.info(f"CSV saved to: {processed_prompts_for_eval_path}")

In [None]:
# Compute the percentage of each model selection and reset the index
new_exploded_df['selected_model'] = new_exploded_df['selected_model'].map(lambda x: x.strip())
response_index_percentage_df = new_exploded_df['selected_model'].value_counts(normalize=True).reset_index()
response_distribution_fpath = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_as_a_judge_pick_rate'])
response_index_percentage_df['proportion'] *= 100
response_index_percentage_df.to_csv(response_distribution_fpath, index=False)
response_index_percentage_df.head(10)

### Final Summary: `LLM evaluation`

In [None]:
# simple function to get a final summary on all of the data provided from LLM as a judge
def final_analysis_summary(bedrock: botocore.client, 
                           prompt: str) -> str:
    """
    This function takes in the prompt that checks whether the text file has a response to the question and if not, 
    returns "not found" to move to the next hit
    """
    modelId=FINAL_ANALYSIS_MODEL_ID
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2000,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })

    try:
        response = bedrock.invoke_model(
        modelId=modelId,
        body=body)

        response_body = json.loads(response['body'].read().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].replace('"', "'")

    except Exception as e:
        logger.error(f"exception={e}")
        llm_response = None
    return llm_response

In [None]:
new_results_df

In [None]:
with open(config['pdf_dir_info']['all_explanations'], 'w') as file:
    for index, row in new_results_df.iterrows():
        file.write(f"Selected Model: {row['selected_model']}\nExplanation: {row['explanation']}\n\n")
# Read the content back to use as analysis context
with open(config['pdf_dir_info']['all_explanations'], 'r') as file:
    analysis_context = file.read()
print(analysis_context)

In [None]:
# open the prompt template and prepare it for inference
with open(config['pdf_dir_info']['claude_final_summary_eval_prompt'], 'r') as file:
    final_summary_prompt = file.read()
    processed_summary_eval_prompt: str = final_summary_prompt.format(context=analysis_context)

endpoint_url: str = config['bedrock_ep_url'].format(region=config['aws']['region'])
bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=endpoint_url)
final_analysis: str = final_analysis_summary(bedrock, prompt=processed_summary_eval_prompt)

In [None]:
final_analysis

In [None]:
with open(config['pdf_dir_info']['final_summary_analysis'], "a") as f:
    f.write(final_analysis + "\n")

[33m(raylet)[0m [2024-06-05 15:15:16,119 E 93545 3341252] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-05_14-39-58_695681_93492 is over 95% full, available space: 8471392256; capacity: 245107195904. Object creation will fail if spilling is required.
