## Evaluations Model Responses using _LLM as a judge_
---

This notebook does as follows:

1. Reads all the responses from the previous inference step and runs evaluations on the responses using an _LLM as a judge_ that selects the best model, corresponding best response given the question and context, and the subjective evaluation/explanation for choosing that model.

1. Records metrics like the `p90, p95` latency, as well as `explanation` files as to why a given model was selected by the _LLM as a judge_ and why other's were not based on correctness and relevancy.

1. Uses a _Final LLM as a summarizer_ to parse through all of the subjective evaluations/explanations provided by the _LLM as a judge_ and gives a final analysis on the trends, patterns spotted across the model performance and gives a summary of which model is preferred for a given use case/dataset

*The model to be used as a judge and the final analysis summarizer can be configured in the `llm_as_a_judge_info` and the `final_analysis_summarizer` sections in the [config.yaml](config.yaml) file.*

In [1]:
# import the libraries
import os
import re
import ray
import json
import glob
import yaml
import time
import boto3
import logging
import botocore
import textwrap
import pandas as pd
from pathlib import Path
from functools import reduce
from litellm import completion
from typing import Dict, List, Optional

In [2]:
# set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [3]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

2024-06-07 13:33:06,944	INFO worker.py:1752 -- Started a local Ray instance.


0,1
Python version:,3.11.7
Ray version:,2.10.0


In [4]:
# global constants
CONFIG_FILE_PATH = "config.yaml"

# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

[2024-06-07 13:33:08,059] p96314 {2927026569.py:8} INFO - config read from config.yaml -> {
  "app_name": "llm-as-a-judge-eval-pipeline",
  "aws": {
    "region": "us-east-1"
  },
  "run_steps": {
    "1_get_inference.ipynb": true,
    "2_get_llm_as_a_judge_eval.ipynb": true
  },
  "pdf_dir_info": {
    "data_dir": "data",
    "dataset_dir": "source_data",
    "dataset_file_name": "data_user_system_prompt_version.csv",
    "metrics": "results",
    "llm_as_a_judge_dir": "eval_completions",
    "prompt_dir": "prompt_template",
    "llm_as_a_judge_completions": "llm_as_a_judge_completions.csv",
    "raw_llm_as_a_judge_completions": "raw_llm_responses.csv",
    "llm_as_a_judge_comparisons": "llm_as_a_judge_comparisons.csv",
    "llm_comparisons_txt": "llm_as_a_judge_comparisons.txt",
    "llm_as_a_judge_pick_rate": "llm_as_a_judge_pick_rate.csv",
    "eval_prompt_template": "llama3_eval_prompt.txt",
    "prompt_template": "prompt_template.txt",
    "processed_eval_prompts": "processed_eva

In [5]:
# initialize all global variables that are used across this notebook hydrated from the `config.yaml` file

# name of your csv file (containing the dataframe)
FILE_NAME: str = config['pdf_dir_info']['dataset_file_name']
# data directory
DATA_DIR: str = config['pdf_dir_info']['data_dir']

# result files
INFERENCE_LATENCY_SUMMARY_FPATH = os.path.join(DATA_DIR, config['pdf_dir_info']['inference_latency_summary_fname'])
METRICS_DIR: str = os.path.join(DATA_DIR, config['pdf_dir_info'] ['metrics'])
JSON_TXT_FILE_PATH: str = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_comparisons_txt'])
ALL_EXPLANATIONS_FPATH: str = os.path.join(METRICS_DIR, config['pdf_dir_info']['all_explanations'])
FINAL_ANALYSIS_MODEL_ID: str = config['final_analysis_summarizer']
FINAL_SUMMARY_ANALYSIS: str = os.path.join(METRICS_DIR, config['pdf_dir_info']['final_summary_analysis'])
bedrock_model_ids: List[str] = config['bedrock_fms_to_test']
USER_PROMPT_COL: str = config['dataset_info']['user_question_col']
SYSTEM_PROMPT_COL: str = config['dataset_info']['system_prompt_col']
INFERENCE_PARAMETERS: Dict = config['inference_parameters']
ON_LIST = list(filter(None, [USER_PROMPT_COL, 
                            SYSTEM_PROMPT_COL]))

In [6]:
def wrap_text(df, width):
    """
    This function wraps the text in a specific cell to a given
    width
    """
    for col in df.columns:
        df[col] = df[col].apply(lambda x: '\n'.join(textwrap.wrap(str(x), width)))
    return df

### Use _LLM as a Judge_ Evaluations
---
In this portion:

1. Responses generated by each model are evaluated on relevance and meaning by your model of choice that acts as a `Judge`. Prompt for the model that acts as a judge be viewed and tweaked for different use cases in the: [prompt_template/](prompt_template/) directory. Edit and review this prompt based on the use case and criteria for subjective evaluation.

1. The role of the model acting as a judge it to compare the responses generated by each model and the already provided responses in the source dataset (if any). It provides information on the selected model, response, and an explanation of its selection, with a detailed analysis of comparison between other responses and why it chose the one it did.

*Note: For more information on the use of having a Model act as a judge, view: https://huggingface.co/learn/cookbook/en/llm_judge*

In [7]:
def prepare_eval_prompts(row):
    """
    This function evaluates the prompts by incorporating all of the responses generated by various models into the evaluation prompt template.
    """
    eval_template: Optional[str] = None
    processed_eval_template: Optional[str] = None
    model_responses: List[str] = []
    try:
        # file path to the eval template
        eval_template_path: str = config['llm_as_a_judge_info']['prompt_template']
        with open(eval_template_path, "r") as f:
            eval_template = f.read()
            logger.info(f"evaluation prompt template recorded: {eval_template}")
    except FileNotFoundError:
        print(f"Error: Evaluation template not found at {eval_template_path}")
    for column in row.index:
        if column.endswith("-response") and column != config['dataset_info']['pre_existing_response_col']:
            model_id = column.split("-response")[0]
            model_response = row[column]
            model_responses.append(f"\n<{model_id}>\n{model_response}\n</{model_id}>\n")
    print(f"model_responses: {model_responses}")

    if config['dataset_info']['system_prompt_col'] is not None:
        # if the system prompt is provided in the dataset, it is used as context
        processed_eval_template = eval_template.format(
            context=row[config['dataset_info']['system_prompt_col']], 
            question=row[config['dataset_info']['user_question_col']], 
            original_answer=row[config['dataset_info']['pre_existing_response_col']],
            model_responses="\n".join(model_responses)
        )
    else:
        # if the system prompt is not provided, the user column is assumed to have the context and so 
        # all the context is fit into the question itself
        processed_eval_template = eval_template.format(
            context=" ", 
            question=row[config['dataset_info']['user_question_col']], 
            original_answer=row[config['dataset_info']['pre_existing_response_col']],
            model_responses="\n".join(model_responses)
        )
    return processed_eval_template

#### Retrieve all the results from the `results.csv` file generated in the _Inference Step_

In [8]:
# Read the inference results
inference_results_file: str = os.path.join(METRICS_DIR, 
                                           config['pdf_dir_info']['all_results_file_name'])
df_resp_all = pd.read_csv(inference_results_file)
df_resp_all.head(10)

Unnamed: 0,user_input,anthropic.claude-3-haiku-20240307-v1:0-response,anthropic.claude-3-sonnet-20240229-v1:0-response,system_prompt,anthropic.claude-3-haiku-20240307-v1:0-time_taken_in_seconds,anthropic.claude-3-haiku-20240307-v1:0-prompt_token_count,anthropic.claude-3-haiku-20240307-v1:0-completion_token_count,anthropic.claude-3-haiku-20240307-v1:0-exception,anthropic.claude-3-sonnet-20240229-v1:0-time_taken_in_seconds,anthropic.claude-3-sonnet-20240229-v1:0-prompt_token_count,anthropic.claude-3-sonnet-20240229-v1:0-completion_token_count,anthropic.claude-3-sonnet-20240229-v1:0-exception,model_1
0,What is the Heisenberg uncertainty\nprinciple?,The Heisenberg uncertainty principle\nstates t...,"The Heisenberg uncertainty principle,\nformula...",The Heisenberg uncertainty principle is\na fun...,3.131573,98,248,,5.701876,98,303,,The Heisenberg uncertainty principle\nstates t...
1,What is the Schrödinger equation and how\nis i...,The Schrödinger equation is a\nfundamental equ...,The Schrödinger equation is a\nfundamental equ...,The Schrödinger equation is a\nfundamental equ...,4.926864,589,568,,7.400591,589,591,,The Schrödinger equation is a\nfundamental equ...
2,What is the greenhouse effect and how\ndoes it...,The greenhouse effect is a natural\nprocess th...,The greenhouse effect is a natural\nprocess th...,The greenhouse effect is a natural\nprocess th...,5.654147,532,409,,8.838915,532,513,,The greenhouse effect is a natural\nprocess th...
3,What is the photoelectric effect and how\ndid ...,The photoelectric effect is a phenomenon\nin w...,The photoelectric effect is a phenomenon\nin w...,The photoelectric effect is a phenomenon\nin w...,5.587937,516,518,,14.179667,516,572,,"When light shines on a metal, electrons\ncan b..."
4,What is the structure of the atom and\nhow was...,The structure of the atom has been\ndetermined...,The current model of the atomic\nstructure is ...,The structure of the atom has been a\nfundamen...,5.674264,510,493,,9.992657,510,499,,"Modern atomic models, based on quantum\nmechan..."
5,What is the role of catalysts in\nchemical rea...,Catalysts play a crucial role in\nchemical rea...,The primary role of catalysts in\nchemical rea...,Catalysts are substances that increase\nthe ra...,5.593801,472,387,,8.27484,472,375,,A catalyst is a substance that can be\nadded t...
6,What is the second law of thermodynamics\nand ...,The second law of thermodynamics states\nthat ...,The second law of thermodynamics is a\nfundame...,The second law of thermodynamics is one\nof th...,4.400713,527,336,,8.548479,527,462,,The second law of thermodynamics states\nthat ...
7,What is the difference between nuclear\nfissio...,The main differences between nuclear\nfission ...,The main difference between nuclear\nfission a...,Nuclear fission and nuclear fusion are\ntwo fu...,3.910536,400,332,,7.40166,400,350,,The phenomenon of nuclear fission.\nFission oc...
8,What is the difference between classical\nmech...,The main differences between classical\nmechan...,The main differences between classical\nmechan...,Classical mechanics and quantum\nmechanics are...,3.582687,630,360,,8.474581,630,373,,Classical mechanics describes the\nphysics of ...
9,What is the difference between\nendothermic an...,The main difference between endothermic\nand e...,The main difference between endothermic\nand e...,"In chemistry, chemical reactions can be\nclass...",4.25086,607,362,,4.078007,607,220,,If you touch a container that holds an\nendoth...


### Construct the ***LLM as a Judge Prompt Template***
---

In this portion of the notebook, the prompt template that is used by the LLM as a judge is prepared. This sample contains examples of evaluation prompt templates using a Llama3 evaluation prompt template [here](model-evals/llm_as_a_judge/data/prompt_template/llama3_eval_prompt.txt). There is another example of an Anthropic Claude Evaluation prompt template [here](model-evals/llm_as_a_judge/data/prompt_template/claude_eval_prompt.txt).

Information on which LLM as a judge to use can be configured in the `llm_as_a_judge_info` section of the config file.

In [9]:
if df_resp_all is not None:
    df_resp_all['eval_prompt'] = df_resp_all.apply(lambda r: prepare_eval_prompts(r), axis=1)
    logger.info("preparing the evaluation prompt templates for the LLM judge....")
else:
    logger.error(f"Model evaluation dataset is not available to process.")
eval_path_df: str = os.path.join(METRICS_DIR, config['pdf_dir_info']['processed_eval_prompts'])
df_resp_all.insert(0, 'prompt_id', df_resp_all.index)
df_resp_all = wrap_text(df_resp_all, width=40)
df_resp_all.to_csv(eval_path_df, index=False)

[2024-06-07 13:33:08,152] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,153] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,153] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,153] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,154] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,154] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,155] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,155] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,156] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,156] p96314 {3998410777.py:13} INFO - evaluation prompt template recorded: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Use the following pieces of retrieved context in the section demarcated by "```" and the question related to that task below it. There are responses from different models completing that task by answering the question below. Your task is to select the answer that best answers the question based on the task provided in terms of relevancy and correctness.
Put the selected answer (without truncating it and give the complete answer within your response), model name and explanation for selecting the answer and not selecting other answer in a JSON as within 3 elements: "best_match_answer" (which contains the full answer you select), "selected_model" (which contains the model name), and "explanation". 
Your explanation should include both model name and answer description so that it is simple to understand which answer was generated by whic

[2024-06-07 13:33:08,157] p96314 {1619896067.py:3} INFO - preparing the evaluation prompt templates for the LLM judge....


model_responses: ["\n<anthropic.claude-3-haiku-20240307-v1:0>\nThe Heisenberg uncertainty principle\nstates that there is a fundamental limit\nto the precision with which certain\npairs of physical properties of a\nparticle, such as position and momentum,\ncan be known simultaneously.  The key\npoints about the Heisenberg uncertainty\nprinciple are:  1. It applies to pairs\nof complementary variables, such as\nposition and momentum, or energy and\ntime.  2. The more precisely you measure\none variable, the more uncertain the\nother becomes. For example, the more\nprecisely you measure a particle's\nposition, the more uncertain its\nmomentum becomes.  3. The uncertainty is\nnot due to imperfect measurement\ntechniques, but is a fundamental\nproperty of quantum mechanics. It arises\nfrom the wave-particle duality of\nquantum particles.  4. The uncertainty\nprinciple is expressed mathematically\nas:  ΔxΔp ≥ h/4π  Where Δx is the\nuncertainty in position, Δp is the\nuncertainty in momentum

In [10]:
df_resp_all

Unnamed: 0,prompt_id,user_input,anthropic.claude-3-haiku-20240307-v1:0-response,anthropic.claude-3-sonnet-20240229-v1:0-response,system_prompt,anthropic.claude-3-haiku-20240307-v1:0-time_taken_in_seconds,anthropic.claude-3-haiku-20240307-v1:0-prompt_token_count,anthropic.claude-3-haiku-20240307-v1:0-completion_token_count,anthropic.claude-3-haiku-20240307-v1:0-exception,anthropic.claude-3-sonnet-20240229-v1:0-time_taken_in_seconds,anthropic.claude-3-sonnet-20240229-v1:0-prompt_token_count,anthropic.claude-3-sonnet-20240229-v1:0-completion_token_count,anthropic.claude-3-sonnet-20240229-v1:0-exception,model_1,eval_prompt
0,0,What is the Heisenberg uncertainty\nprinciple?,The Heisenberg uncertainty principle\nstates t...,"The Heisenberg uncertainty principle,\nformula...",The Heisenberg uncertainty principle is\na fun...,3.131573,98,248,,5.701876,98,303,,The Heisenberg uncertainty principle\nstates t...,<|begin_of_text|><|start_header_id|>user\n<|en...
1,1,What is the Schrödinger equation and how\nis i...,The Schrödinger equation is a\nfundamental equ...,The Schrödinger equation is a\nfundamental equ...,The Schrödinger equation is a\nfundamental equ...,4.926864,589,568,,7.400591,589,591,,The Schrödinger equation is a\nfundamental equ...,<|begin_of_text|><|start_header_id|>user\n<|en...
2,2,What is the greenhouse effect and how\ndoes it...,The greenhouse effect is a natural\nprocess th...,The greenhouse effect is a natural\nprocess th...,The greenhouse effect is a natural\nprocess th...,5.654147,532,409,,8.838915,532,513,,The greenhouse effect is a natural\nprocess th...,<|begin_of_text|><|start_header_id|>user\n<|en...
3,3,What is the photoelectric effect and how\ndid ...,The photoelectric effect is a phenomenon\nin w...,The photoelectric effect is a phenomenon\nin w...,The photoelectric effect is a phenomenon\nin w...,5.587937,516,518,,14.179667,516,572,,"When light shines on a metal, electrons\ncan b...",<|begin_of_text|><|start_header_id|>user\n<|en...
4,4,What is the structure of the atom and\nhow was...,The structure of the atom has been\ndetermined...,The current model of the atomic\nstructure is ...,The structure of the atom has been a\nfundamen...,5.674264,510,493,,9.992657,510,499,,"Modern atomic models, based on quantum\nmechan...",<|begin_of_text|><|start_header_id|>user\n<|en...
5,5,What is the role of catalysts in\nchemical rea...,Catalysts play a crucial role in\nchemical rea...,The primary role of catalysts in\nchemical rea...,Catalysts are substances that increase\nthe ra...,5.593801,472,387,,8.27484,472,375,,A catalyst is a substance that can be\nadded t...,<|begin_of_text|><|start_header_id|>user\n<|en...
6,6,What is the second law of thermodynamics\nand ...,The second law of thermodynamics states\nthat ...,The second law of thermodynamics is a\nfundame...,The second law of thermodynamics is one\nof th...,4.400713,527,336,,8.548479,527,462,,The second law of thermodynamics states\nthat ...,<|begin_of_text|><|start_header_id|>user\n<|en...
7,7,What is the difference between nuclear\nfissio...,The main differences between nuclear\nfission ...,The main difference between nuclear\nfission a...,Nuclear fission and nuclear fusion are\ntwo fu...,3.910536,400,332,,7.40166,400,350,,The phenomenon of nuclear fission.\nFission oc...,<|begin_of_text|><|start_header_id|>user\n<|en...
8,8,What is the difference between classical\nmech...,The main differences between classical\nmechan...,The main differences between classical\nmechan...,Classical mechanics and quantum\nmechanics are...,3.582687,630,360,,8.474581,630,373,,Classical mechanics describes the\nphysics of ...,<|begin_of_text|><|start_header_id|>user\n<|en...
9,9,What is the difference between\nendothermic an...,The main difference between endothermic\nand e...,The main difference between endothermic\nand e...,"In chemistry, chemical reactions can be\nclass...",4.25086,607,362,,4.078007,607,220,,If you touch a container that holds an\nendoth...,<|begin_of_text|><|start_header_id|>user\n<|en...


### Using LLM as a judge in the loop to evaluate and narrow down the responses generated by different models of choice

In [11]:
def llm_judge_json_evaluations(model_id: str, prompt: str):
    # represents the service name
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception = None,
               user_prompt=None,
               prompt = prompt,
               completion = None,
               # initializing to 0 since none type throws an error later, this is used to calculate price per token input/output on ODT pricing
               completion_token_count = 0,
               # initializing to 0 since none type throws an error later
               prompt_token_count=0,
               input_token_cost = None, 
               output_token_cost = None,
               model_id = model_id)
    
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    parameters = config['inference_parameters']
    temperature = parameters.get('temperature', 0.1)
    caching = parameters.get('caching', False)
    max_tokens = parameters.get("max_tokens", 500)

    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        logger.info(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{ "content": body,"role": "user"}],
                              temperature=temperature,
                              max_tokens=max_tokens,
                              caching=caching)
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens (this is the same structure for embeddings and text generation models on Amazon Bedrock)
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
        
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    logger.info(f"completion: {ret['completion']}")
    return ret

In [12]:
def get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    print(f"row={row}")
    model_id = model_info['model']
    # create the payload for model inference
    prompt = row['eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = llm_judge_json_evaluations(model_id, prompt)
    resp[config['dataset_info']['pre_existing_response_col']] = row[config['dataset_info']['pre_existing_response_col']]
    # calculate the input and output token price for all of the calls
    resp['input_token_cost'] = (resp['prompt_token_count']/1000) * model_info['input_tokens_pricing']
    resp['output_token_cost'] = (resp['completion_token_count']/1000) * model_info['output_tokens_pricing']
    dir_path = os.path.join(config['pdf_dir_info']['llm_as_a_judge_dir'], str(row['prompt_id']), model_id.replace(":", "-"))
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"model_evaluation_{row['prompt_id']}.json")
    logger.info(f"writing response={resp} to {fpath}")
    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))
    logger.info(f"response {i}: {resp}")
    return resp

@ray.remote
def async_get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return get_inference(i, row, total, model_info)

In [13]:
df_resp_all = json.loads(df_resp_all.to_json(orient='records'))
n: int = config.get('parallel_inference_count')
resp_list: List = []
erroneous_count = 0  # To keep track of errors
st = time.perf_counter()
EVAL_MODEL_INFO: Dict = config['llm_as_a_judge_info']
logger.info(f"------ running inference for {EVAL_MODEL_INFO.get('model')} -----")

# Split the input list
list_of_lists = [df_resp_all[i * n:(i + 1) * n] for i in range((len(df_resp_all) + n - 1) // n)]
logger.info(f"split input list of size {len(df_resp_all)} into {len(list_of_lists)} lists")

# Process each list
for idx, l in enumerate(list_of_lists):
    try:
        logger.info(f"getting inference for list {idx+1}/{len(list_of_lists)}, size of list={len(l)}")
        resp_list.extend(ray.get([async_get_inference.remote(i + 1, e, len(l), EVAL_MODEL_INFO) for i, e in enumerate(l)]))
    except Exception as e:
        logger.error(f"Error processing list {idx+1}/{len(list_of_lists)}: {e}")
        erroneous_count += 1

elapsed_time = time.perf_counter() - st
logger.info(f"------ model={EVAL_MODEL_INFO.get('model')} completed in {elapsed_time} ------")
logger.info(f"Total erroneous lists: {erroneous_count}")

[2024-06-07 13:33:08,262] p96314 {832808186.py:7} INFO - ------ running inference for meta.llama3-70b-instruct-v1:0 -----


[2024-06-07 13:33:08,263] p96314 {832808186.py:11} INFO - split input list of size 10 into 2 lists


[2024-06-07 13:33:08,263] p96314 {832808186.py:16} INFO - getting inference for list 1/2, size of list=5


[2024-06-07 13:33:25,323] p96314 {832808186.py:16} INFO - getting inference for list 2/2, size of list=5


[2024-06-07 13:33:36,299] p96314 {832808186.py:23} INFO - ------ model=meta.llama3-70b-instruct-v1:0 completed in 28.035466041998006 ------


[2024-06-07 13:33:36,300] p96314 {832808186.py:24} INFO - Total erroneous lists: 0


In [14]:
# view the raw responses from the LLM as a judge evaluation
df_resp_all

[{'prompt_id': '0',
  'user_input': 'What is the Heisenberg uncertainty\nprinciple?',
  'anthropic.claude-3-haiku-20240307-v1:0-response': "The Heisenberg uncertainty principle\nstates that there is a fundamental limit\nto the precision with which certain\npairs of physical properties of a\nparticle, such as position and momentum,\ncan be known simultaneously.  The key\npoints about the Heisenberg uncertainty\nprinciple are:  1. It applies to pairs\nof complementary variables, such as\nposition and momentum, or energy and\ntime.  2. The more precisely you measure\none variable, the more uncertain the\nother becomes. For example, the more\nprecisely you measure a particle's\nposition, the more uncertain its\nmomentum becomes.  3. The uncertainty is\nnot due to imperfect measurement\ntechniques, but is a fundamental\nproperty of quantum mechanics. It arises\nfrom the wave-particle duality of\nquantum particles.  4. The uncertainty\nprinciple is expressed mathematically\nas:  ΔxΔp ≥ h/4π 

### Visualize `LLM as a judge` completions and get more evaluation metrics

In [15]:
## Represents extracted all metric files
fpath_evaluated_files = os.path.join(config['pdf_dir_info']['llm_as_a_judge_dir'], "**", "*", "*.json")
eval_metric_files = glob.glob(fpath_evaluated_files, recursive=True)
logger.info(f"there are {len(eval_metric_files)} evaluated files by {config['llm_as_a_judge_info']['model']} LLM judge in {fpath_evaluated_files}")

[2024-06-07 13:33:36,390] p96314 {197194707.py:4} INFO - there are 10 evaluated files by meta.llama3-70b-instruct-v1:0 LLM judge in eval_completions/**/*/*.json


In [16]:
def extract_sections(text: str) -> Optional[str]:
    """
    This function is used to clean up the data generated by the LLM as a judge to get
    responses split out a JSON format
    """
    try:
        question_match = re.search(r'Question:(.*?)```', text, re.DOTALL)
        question = question_match.group(1).strip() if question_match else None
    except Exception as e:
        print(f"The question was not extracted: {e}")
        question = None
    return question

In [17]:
os.makedirs(config['pdf_dir_info']['metrics'], exist_ok=True)
model_evaluation_responses = []

for f in eval_metric_files:
    with open(f, 'r') as file:
        model_evaluation_responses.append(json.loads(file.read()))
# results_df will contain the evaluation responses, including the completion and the model id
results_df = pd.DataFrame(model_evaluation_responses)
raw_llm_as_a_judge_responses: str = config['pdf_dir_info']['raw_llm_as_a_judge_completions']
raw_llm_fpath: str = os.path.join(METRICS_DIR, raw_llm_as_a_judge_responses)
results_df = results_df.dropna(axis=1, how='all')
results_df.head(10)

Unnamed: 0,prompt,completion,completion_token_count,prompt_token_count,input_token_cost,output_token_cost,model_id,model_1
0,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The main difference bet...",426,1844,0.004887,0.001491,meta.llama3-70b-instruct-v1:0,If you touch a container that holds an\nendoth...
1,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The Heisenberg uncertai...",324,1361,0.003607,0.001134,meta.llama3-70b-instruct-v1:0,The Heisenberg uncertainty principle\nstates t...
2,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The main differences be...",381,1769,0.004688,0.001334,meta.llama3-70b-instruct-v1:0,The phenomenon of nuclear fission.\nFission oc...
3,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The second law of therm...",399,2083,0.00552,0.001397,meta.llama3-70b-instruct-v1:0,The second law of thermodynamics states\nthat ...
4,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The Schrödinger equatio...",609,2599,0.006887,0.002132,meta.llama3-70b-instruct-v1:0,The Schrödinger equation is a\nfundamental equ...
5,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The main differences be...",434,2207,0.005849,0.001519,meta.llama3-70b-instruct-v1:0,Classical mechanics describes the\nphysics of ...
6,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The structure of the at...",553,2384,0.006318,0.001936,meta.llama3-70b-instruct-v1:0,"Modern atomic models, based on quantum\nmechan..."
7,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The photoelectric effec...",554,2424,0.006424,0.001939,meta.llama3-70b-instruct-v1:0,"When light shines on a metal, electrons\ncan b..."
8,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""The greenhouse effect i...",494,2284,0.006053,0.001729,meta.llama3-70b-instruct-v1:0,The greenhouse effect is a natural\nprocess th...
9,<|begin_of_text|><|start_header_id|>user\n<|en...,"{""best_match_answer"": ""Catalysts play a crucia...",451,2052,0.005438,0.001579,meta.llama3-70b-instruct-v1:0,A catalyst is a substance that can be\nadded t...


In [18]:
def replace_unescaped_quotes(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, str):
            value = value.replace("'", r"\'").replace('"', r'\"')
        new_pairs.append((key, value))
    return dict(new_pairs)

def clean_model_eval_json(data):
    """
    This function takes in JSON data, cleans it, and assigns the selected title as outputted by the model evaluator.
    """
    try:
        # Preprocess the input string to handle unescaped double quotes at the start
        if data.startswith('"'):
            data = "'" + data[1:-1].replace('"', '\\"') + "'"
        data = data.replace('\n', ' ')

        json_data = json.loads(data, object_pairs_hook=replace_unescaped_quotes)
        
        # Remove angle brackets from the selected_model value
        selected_model = json_data.get('selected_model', '')
        json_data['selected_model'] = re.sub(r'[<>]', '', selected_model)

        return pd.Series({
            'best_match_answer': json_data.get('best_match_answer'),
            'selected_model': json_data.get('selected_model'),
            'explanation': json_data.get('explanation'),
        })
    except (json.JSONDecodeError, KeyError) as e:
        print(f"Invalid JSON data: {data} - {e}")
        return pd.Series({
            'best_match_answer': None,
            'selected_model': None,
            'explanation': None,
        })

In [19]:
def tidy_split(df, column, sep=',', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [20]:
new_results_df = results_df['completion'].apply(clean_model_eval_json)
# removing any unnecessary characters from the selected_model if any
new_results_df['selected_model'] = new_results_df['selected_model'].str.replace(r'<[^>]+>', '', regex=True)
# here we split the elements of the selected_model column using the tidy split function
new_exploded_df = tidy_split(new_results_df, 'selected_model', sep=',')
new_results_df[config['dataset_info']['pre_existing_response_col']] = results_df[config['dataset_info']['pre_existing_response_col']]
new_results_df['input_token_cost'] = results_df['input_token_cost']
new_results_df['output_token_cost'] = results_df['output_token_cost']
logger.info(f"All evaluation data is read into a dataframe of shape {results_df.shape}")
cols = new_results_df.columns.tolist()
idx = cols.index('selected_model')
cols.insert(idx + 1, cols.pop(cols.index(config['dataset_info']['pre_existing_response_col'])))
new_results_df.drop(columns=['input_token_cost', 'output_token_cost'], inplace=True)
# display the selected title, model explanation and the respective golden title in a side by side view
new_results_df.head(20)

[2024-06-07 13:33:36,494] p96314 {471329066.py:9} INFO - All evaluation data is read into a dataframe of shape (10, 8)


Unnamed: 0,best_match_answer,selected_model,explanation,model_1
0,The main difference between endothermic and ex...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,If you touch a container that holds an\nendoth...
1,The Heisenberg uncertainty principle states th...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,The Heisenberg uncertainty principle\nstates t...
2,The main differences between nuclear fission a...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The phenomenon of nuclear fission.\nFission oc...
3,The second law of thermodynamics states that t...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The second law of thermodynamics states\nthat ...
4,The Schrödinger equation is a fundamental equa...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,The Schrödinger equation is a\nfundamental equ...
5,The main differences between classical mechani...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,Classical mechanics describes the\nphysics of ...
6,The structure of the atom has been determined ...,anthropic.claude-3-haiku-20240307-v1:0,This model was selected because it provides a ...,"Modern atomic models, based on quantum\nmechan..."
7,The photoelectric effect is a phenomenon in wh...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,"When light shines on a metal, electrons\ncan b..."
8,The greenhouse effect is a natural process tha...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The greenhouse effect is a natural\nprocess th...
9,Catalysts play a crucial role in chemical reac...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,A catalyst is a substance that can be\nadded t...


In [21]:
initial_df = pd.read_csv(eval_path_df)
# Merge the two DataFrames on 'gpt_response'
merged_df = pd.merge(new_results_df, initial_df[[config['dataset_info']['pre_existing_response_col'], 
                                                config['dataset_info']['user_question_col']]], on=config['dataset_info']['pre_existing_response_col'], how='left')

cols = [col for col in merged_df.columns if col != 'user prompt']
processed_prompts_for_eval_path = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_as_a_judge_comparisons'])
merged_df.to_csv(processed_prompts_for_eval_path, index=False)
merged_df

Unnamed: 0,best_match_answer,selected_model,explanation,model_1,user_input
0,The main difference between endothermic and ex...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,If you touch a container that holds an\nendoth...,What is the difference between\nendothermic an...
1,The Heisenberg uncertainty principle states th...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,The Heisenberg uncertainty principle\nstates t...,What is the Heisenberg uncertainty\nprinciple?
2,The main differences between nuclear fission a...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The phenomenon of nuclear fission.\nFission oc...,What is the difference between nuclear\nfissio...
3,The second law of thermodynamics states that t...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The second law of thermodynamics states\nthat ...,What is the second law of thermodynamics\nand ...
4,The Schrödinger equation is a fundamental equa...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,The Schrödinger equation is a\nfundamental equ...,What is the Schrödinger equation and how\nis i...
5,The main differences between classical mechani...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,Classical mechanics describes the\nphysics of ...,What is the difference between classical\nmech...
6,The structure of the atom has been determined ...,anthropic.claude-3-haiku-20240307-v1:0,This model was selected because it provides a ...,"Modern atomic models, based on quantum\nmechan...",What is the structure of the atom and\nhow was...
7,The photoelectric effect is a phenomenon in wh...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,"When light shines on a metal, electrons\ncan b...",What is the photoelectric effect and how\ndid ...
8,The greenhouse effect is a natural process tha...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The greenhouse effect is a natural\nprocess th...,What is the greenhouse effect and how\ndoes it...
9,Catalysts play a crucial role in chemical reac...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,A catalyst is a substance that can be\nadded t...,What is the role of catalysts in\nchemical rea...


### View the LLM as a judge comparison and evaluation

In [22]:
processed_prompts_for_eval_path = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_as_a_judge_comparisons'])
merged_df = pd.read_csv(processed_prompts_for_eval_path)
merged_df

Unnamed: 0,best_match_answer,selected_model,explanation,model_1,user_input
0,The main difference between endothermic and ex...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,If you touch a container that holds an\nendoth...,What is the difference between\nendothermic an...
1,The Heisenberg uncertainty principle states th...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,The Heisenberg uncertainty principle\nstates t...,What is the Heisenberg uncertainty\nprinciple?
2,The main differences between nuclear fission a...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The phenomenon of nuclear fission.\nFission oc...,What is the difference between nuclear\nfissio...
3,The second law of thermodynamics states that t...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The second law of thermodynamics states\nthat ...,What is the second law of thermodynamics\nand ...
4,The Schrödinger equation is a fundamental equa...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,The Schrödinger equation is a\nfundamental equ...,What is the Schrödinger equation and how\nis i...
5,The main differences between classical mechani...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,Classical mechanics describes the\nphysics of ...,What is the difference between classical\nmech...
6,The structure of the atom has been determined ...,anthropic.claude-3-haiku-20240307-v1:0,This model was selected because it provides a ...,"Modern atomic models, based on quantum\nmechan...",What is the structure of the atom and\nhow was...
7,The photoelectric effect is a phenomenon in wh...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,"When light shines on a metal, electrons\ncan b...",What is the photoelectric effect and how\ndid ...
8,The greenhouse effect is a natural process tha...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The greenhouse effect is a natural\nprocess th...,What is the greenhouse effect and how\ndoes it...
9,Catalysts play a crucial role in chemical reac...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,A catalyst is a substance that can be\nadded t...,What is the role of catalysts in\nchemical rea...


In [23]:
# Convert the DataFrame to JSON
merged_df_json = merged_df.to_json(orient='records')

# Save the JSON to a text file
with open(JSON_TXT_FILE_PATH, 'w') as json_text_file:
    json_text_file.write(merged_df_json)
logger.info(f"CSV saved to: {processed_prompts_for_eval_path}")

[2024-06-07 13:33:36,579] p96314 {1042489399.py:7} INFO - CSV saved to: data/results/llm_as_a_judge_comparisons.csv


### Generate the LLM as a judge `pick rate` to show how many times a model was picked having the best response over the other models

In [24]:
# Compute the percentage of each model selection and reset the index
new_exploded_df['selected_model'] = new_exploded_df['selected_model'].map(lambda x: x.strip())
response_index_percentage_df = new_exploded_df['selected_model'].value_counts(normalize=True).reset_index()
response_distribution_fpath = os.path.join(METRICS_DIR, config['pdf_dir_info']['llm_as_a_judge_pick_rate'])
response_index_percentage_df['proportion'] *= 100
response_index_percentage_df.to_csv(response_distribution_fpath, index=False)
response_index_percentage_df.head(10)

Unnamed: 0,selected_model,proportion
0,anthropic.claude-3-haiku-20240307-v1:0,100.0


### Final Summary: `LLM evaluation`

In [25]:
# simple function to get a final summary on all of the data provided from LLM as a judge
def final_analysis_summary(bedrock: botocore.client, 
                           prompt: str) -> str:
    """
    This function takes in the prompt that checks whether the text file has a response to the question and if not, 
    returns "not found" to move to the next hit
    """
    modelId=FINAL_ANALYSIS_MODEL_ID
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2000,
        "temperature": 0.1,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })

    try:
        response = bedrock.invoke_model(
        modelId=modelId,
        body=body)

        response_body = json.loads(response['body'].read().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].replace('"', "'")

    except Exception as e:
        logger.error(f"exception={e}")
        llm_response = None
    return llm_response

In [26]:
new_results_df

Unnamed: 0,best_match_answer,selected_model,explanation,model_1
0,The main difference between endothermic and ex...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,If you touch a container that holds an\nendoth...
1,The Heisenberg uncertainty principle states th...,anthropic.claude-3-haiku-20240307-v1:0,This answer is selected because it provides a ...,The Heisenberg uncertainty principle\nstates t...
2,The main differences between nuclear fission a...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The phenomenon of nuclear fission.\nFission oc...
3,The second law of thermodynamics states that t...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The second law of thermodynamics states\nthat ...
4,The Schrödinger equation is a fundamental equa...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,The Schrödinger equation is a\nfundamental equ...
5,The main differences between classical mechani...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,Classical mechanics describes the\nphysics of ...
6,The structure of the atom has been determined ...,anthropic.claude-3-haiku-20240307-v1:0,This model was selected because it provides a ...,"Modern atomic models, based on quantum\nmechan..."
7,The photoelectric effect is a phenomenon in wh...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a comprehensive and detail...,"When light shines on a metal, electrons\ncan b..."
8,The greenhouse effect is a natural process tha...,anthropic.claude-3-haiku-20240307-v1:0,This model provides a clear and comprehensive ...,The greenhouse effect is a natural\nprocess th...
9,Catalysts play a crucial role in chemical reac...,anthropic.claude-3-haiku-20240307-v1:0,I selected this answer because it provides a c...,A catalyst is a substance that can be\nadded t...


In [27]:
with open(ALL_EXPLANATIONS_FPATH, 'w') as file:
    for index, row in merged_df.iterrows():
        file.write(f"Selected Model: {row['selected_model']}\nExplanation: {row['explanation']}\n\n")

# Read the content back to use as analysis context
with open(ALL_EXPLANATIONS_FPATH, 'r') as file:
    analysis_context = file.read()
print(analysis_context)

Selected Model: anthropic.claude-3-haiku-20240307-v1:0
Explanation: This answer is selected because it provides a clear and detailed explanation of the difference between endothermic and exothermic reactions, including the direction of energy flow, temperature changes, and examples of each type of reaction. The other models do not provide as comprehensive of an explanation, with model_1 only providing a brief and incomplete description and model anthropic.claude-3-sonnet-20240229-v1:0 providing a similar but less detailed explanation.

Selected Model: anthropic.claude-3-haiku-20240307-v1:0
Explanation: This answer is selected because it provides a clear and comprehensive explanation of the Heisenberg uncertainty principle, including its key points, mathematical expression, and implications. The other models provide shorter or more limited explanations, whereas this model provides a detailed and accurate description of the principle.

Selected Model: anthropic.claude-3-haiku-20240307-v1

In [28]:
# open the prompt template and prepare it for inference
with open(config['pdf_dir_info']['claude_final_summary_eval_prompt'], 'r') as file:
    final_summary_prompt = file.read()
    processed_summary_eval_prompt: str = final_summary_prompt.format(context=analysis_context)

endpoint_url: str = config['bedrock_ep_url'].format(region=config['aws']['region'])
bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=endpoint_url)
final_analysis: str = final_analysis_summary(bedrock, prompt=processed_summary_eval_prompt)

[2024-06-07 13:33:36,790] p96314 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials


In [29]:
final_analysis

'Based on the context provided, the model anthropic.claude-3-haiku-20240307-v1:0 was consistently selected over the other models for its ability to provide clear, comprehensive, and detailed explanations on various scientific concepts and phenomena. The key reasons for its selection include:\n\n1. Depth and Clarity: This model excelled in offering thorough and well-structured explanations, covering multiple aspects of the topic in a clear and organized manner. For instance, in explaining the difference between endothermic and exothermic reactions, it provided a detailed account of energy flow, temperature changes, and examples, making it more comprehensive than the other models.\n\n2. Comprehensive Coverage: The model demonstrated a strong grasp of complex scientific principles, providing in-depth explanations that covered key points, mathematical formulations, implications, and applications. This was evident in its explanations of concepts like the Heisenberg uncertainty principle, th

In [30]:
Path(FINAL_SUMMARY_ANALYSIS).write_text(final_analysis + "\n")

2138