### Generate metrics & Run evaluations (ROUGE, COSINE, LLM acting as a judge in the loop)
---

In this notebook:

1. We will extract the titles generated as completions from the bedrock models (claude sonnet, llama, mistral), and load these into a CSV file 

1. Generate metrics on accuracy ([ROUGE-L](https://en.wikipedia.org/wiki/ROUGE_(metric)) and [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) scores), performance, token throughput, inference, etc.

1. View all model completions to get a ***Vibe check*** on how each of the model performs. Next, have Claude Sonnet as a judge in the loop to go through each completion from multiple models, and decide which one best matches the human generated title. [Claude Sonnet](https://www.anthropic.com/claude) evaluates the most optimal model based on the [evaluation prompt](data/prompts/eval_template.txt) that is tuned into it. In this case, Sonnet acts as a judge to find the title that best captures the content of the meeting.

In [1]:
# import libraries
import os
import ray
import json
import yaml
import glob
import copy
import time
import boto3
import logging
import pandas as pd  
from numpy import dot
from pathlib import Path
from numpy.linalg import norm
from litellm import completion ## support for text generation models on bedrock
from rouge_score import rouge_scorer
from typing import Dict, Optional, List
from bedrock_utils import get_bedrock_client


2024-06-01 17:14:07,514	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.6.5 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


#### Set a logger 

In [2]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

#### Initialize the Ray Server that is used to run Asynchronous inference

In [3]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

2024-06-01 17:14:11,293	INFO worker.py:1749 -- Started a local Ray instance.


0,1
Python version:,3.11.7
Ray version:,2.11.0


#### Load the config file: Contains model information, data directory information

In [52]:
## load the config file
# global constants
CONFIG_FILE_PATH = "config.yml"

In [53]:
# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

[2024-06-01 17:33:27,913] p163 {3034282685.py:5} INFO - config read from config.yml -> {
  "app_name": "genai-chapterize-meeting-transcripts",
  "aws": {
    "region": "us-east-1"
  },
  "dir": {
    "data": "data",
    "raw": "data/source_data",
    "processed": "data/processed_data",
    "completions": "data/title_completions",
    "model_eval_completions": "data/model_eval_completions",
    "golden": "data/source_data/golden",
    "prompts": "data/prompts",
    "metrics": "data/metrics",
    "processed_file": "processed.csv",
    "chapterized_file": "chapterized.csv",
    "metrics_file": "per_request_results.csv",
    "summary_metrics_file": "summary_metrics.csv",
    "model_evals_file": "model_eval.csv",
    "processed_prompts_for_eval": "processed_evaluation_prompts.csv",
    "filtered_titles_for_eval": "filtered_titles_for_eval.csv",
    "final_report": "recommended_model.csv",
    "model_distribution": "model_distribution_count.csv",
    "overall_eval_report": "overall_evaluatio

In [6]:
## Represents extracted all metric files
fpath = os.path.join(config['dir']['completions'], "**", "*", "*.json")
metric_files = glob.glob(fpath, recursive=True)
logger.info(f"there are {len(metric_files)} files in {fpath}")

[2024-06-01 17:14:12,372] p163 {3931314738.py:4} INFO - there are 62 files in data/title_completions/**/*/*.json


#### Generate a simple CSV with metrics on title completions, chapters, and performance latency
---

1. This section of the notebook calculates metrics like title completions from each model in the config file for respective chapters, latency.

1. The CSV also contains the original title that was given as a human generated title in the original data frame if any. If the human generated title is not provided, the data frame will not have it.

In [7]:
metrics = []
for f in metric_files:
    metrics.append(json.loads(Path(f).read_text()))
df = pd.DataFrame(metrics)
df = df.drop(columns=['exception', 'prompt'])
df = df.sort_values(by=['file_name', 'model_id', 'chapter_id'])
df = df.rename(columns={'completion': 'chapter_title', 'time_taken_in_seconds': 'latency_seconds'})
logger.info(f"all metrics data is read into a dataframe of shape {df.shape}")
count = df.shape[0]

[2024-06-01 17:14:12,400] p163 {3027428405.py:8} INFO - all metrics data is read into a dataframe of shape (62, 11)


In [8]:
df.head(20)

Unnamed: 0,chapter_title,file_name,chapter_id,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,chapter_text,original_title
31,Higgs Boson and its Implications,particle_physics_meeting.json,0,amazon.titan-text-express-v1,0.95907,10,341,0.000273,1.6e-05,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications
30,Particle Physics Prospects,particle_physics_meeting.json,1,amazon.titan-text-express-v1,0.875003,6,330,0.000264,1e-05,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics
33,Exploring Quantum Gravity,particle_physics_meeting.json,2,amazon.titan-text-express-v1,1.146632,6,343,0.000274,1e-05,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity
32,Higher Dimensions,particle_physics_meeting.json,3,amazon.titan-text-express-v1,1.135567,4,330,0.000264,6e-06,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...
35,Exploring the Frontiers of Reality,particle_physics_meeting.json,4,amazon.titan-text-express-v1,1.225493,9,298,0.000238,1.4e-05,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics
34,Antimatter in Medicine and Space Exploration,particle_physics_meeting.json,5,amazon.titan-text-express-v1,1.360492,10,296,0.000237,1.6e-05,"1015 ""Dr. E"" (1234567890)\n10:04:10.100 --> 10...",Antimatter and Its Potential Applications
19,Higgs Boson Particle Discoveries,particle_physics_meeting.json,0,anthropic.claude-3-haiku-20240307-v1:0,0.729197,13,387,9.7e-05,1.6e-05,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications
18,Particle Physics Frontiers,particle_physics_meeting.json,1,anthropic.claude-3-haiku-20240307-v1:0,0.553375,9,373,9.3e-05,1.1e-05,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics
21,Quantum Gravity Theories,particle_physics_meeting.json,2,anthropic.claude-3-haiku-20240307-v1:0,0.7532,9,385,9.6e-05,1.1e-05,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity
20,Multidimensional Universe Theories,particle_physics_meeting.json,3,anthropic.claude-3-haiku-20240307-v1:0,0.802009,11,375,9.4e-05,1.4e-05,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...


#### Calculate Cosine versus ROUGE metrics for generated chapter titles

In [9]:
def sanitize_title(title):
    """
    This function sanitizes the chapter titles that are generated. To add elements you want to remove from the chapter titles, modify the 
    'response_prefix_to_remove' in the config file
    """
    if title is None:
        return title
    suffixes_to_remove: List[str] = config['response_prefix_to_remove']
    for response_to_remove in suffixes_to_remove:
        title = title.replace(response_to_remove, "")
    title = title.strip()
    title = title.split("\n")[0]
    return title
df.chapter_title = df.chapter_title.map(sanitize_title)
# view information about the type of data generated by the models, and other metrics below
df.head(10)

Unnamed: 0,chapter_title,file_name,chapter_id,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,chapter_text,original_title
31,Higgs Boson and its Implications,particle_physics_meeting.json,0,amazon.titan-text-express-v1,0.95907,10,341,0.000273,1.6e-05,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications
30,Particle Physics Prospects,particle_physics_meeting.json,1,amazon.titan-text-express-v1,0.875003,6,330,0.000264,1e-05,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics
33,Exploring Quantum Gravity,particle_physics_meeting.json,2,amazon.titan-text-express-v1,1.146632,6,343,0.000274,1e-05,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity
32,Higher Dimensions,particle_physics_meeting.json,3,amazon.titan-text-express-v1,1.135567,4,330,0.000264,6e-06,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...
35,Exploring the Frontiers of Reality,particle_physics_meeting.json,4,amazon.titan-text-express-v1,1.225493,9,298,0.000238,1.4e-05,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics
34,Antimatter in Medicine and Space Exploration,particle_physics_meeting.json,5,amazon.titan-text-express-v1,1.360492,10,296,0.000237,1.6e-05,"1015 ""Dr. E"" (1234567890)\n10:04:10.100 --> 10...",Antimatter and Its Potential Applications
19,Higgs Boson Particle Discoveries,particle_physics_meeting.json,0,anthropic.claude-3-haiku-20240307-v1:0,0.729197,13,387,9.7e-05,1.6e-05,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications
18,Particle Physics Frontiers,particle_physics_meeting.json,1,anthropic.claude-3-haiku-20240307-v1:0,0.553375,9,373,9.3e-05,1.1e-05,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics
21,Quantum Gravity Theories,particle_physics_meeting.json,2,anthropic.claude-3-haiku-20240307-v1:0,0.7532,9,385,9.6e-05,1.1e-05,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity
20,Multidimensional Universe Theories,particle_physics_meeting.json,3,anthropic.claude-3-haiku-20240307-v1:0,0.802009,11,375,9.4e-05,1.4e-05,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...


#### ROUGE & Cosine Similarity Scores for titles:
---

Here, the `amazon.titan-embed-text-v1` is used to get the embeddings of texts. To use a different embeddings model, change the `model` in the `embeddings_model_info` and modify this function.

In [10]:
from typing import Optional
MAX_TEXT_LEN_FOR_EMBEDDING: int = config['embeddings_model_info']['max_text_len_for_embedding']
bedrock: Optional[get_bedrock_client] = None

def get_embedding(text: str, modelId: str=config['embeddings_model_info'].get('model'), accept: str='application/json', contentType: str='application/json'):
    """
    Generates embeddings for the chapter titles and original titles to generate cosine similarity measures
    """
    global bedrock
    if bedrock is None:
        bedrock = get_bedrock_client()
    body = json.dumps({"inputText": text[:MAX_TEXT_LEN_FOR_EMBEDDING]})
    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    token_count = response_body.get('inputTextTokenCount')
    return embedding, token_count

def get_cosine_similarity(text1: str, text2: str) -> float:
    """
    This function calculates the cosine similarity between the chapter title generated from models, and the human generated title (if any)
    """
    A,_ = get_embedding(text1)
    B,_ = get_embedding(text2)
    cosine = dot(A, B)/(norm(A)*norm(B))
    return cosine

def get_rouge_l_score(completion: str, golden: str) -> float:
    """
    This function calculates the rouge-l score between the chapter title generated from models, and the human generated title (if any)
    """
    rouge_metric_selection: str = config['embeddings_model_info']['rouge_metric_selection']
    scorer = rouge_scorer.RougeScorer([rouge_metric_selection])
    scores = scorer.score(golden, completion)
    return round(scores[rouge_metric_selection].fmeasure, 4)

In [11]:
def compare_titles(row):
    """
    Generates the rouge and cosine similarity scores for chapter titles and original titles
    """
    if (row.get('original_title') and row.get('chapter_title') is not None) and (pd.notna(row.get('original_title')) and pd.notna(row.get('chapter_title'))):
        logger.info(f"Chapter title: {row['chapter_title']}, Original title: {row['original_title']}")
        rouge_l_score = get_rouge_l_score(row['chapter_title'], row['original_title'])
        cosine_sim = get_cosine_similarity(row['chapter_title'].lower(), row['original_title'].lower())
        return pd.Series([rouge_l_score, cosine_sim])
    else:
        logger.info(f'ROUGE scores and Cosine similarity scores cannot be computed since original titles are not provided in the chapterized dataset')
        rouge_l_score, cosine_sim = None, None

if 'original_title' in df.columns:
    df[['rouge_l_f1_score', 'cosine_similarity']] = df.apply(compare_titles, axis=1)
else:
    logger.info(f"No evaluation metrics available since Golden titles are not provided in the dataset.")

[2024-06-01 17:14:12,444] p163 {96059692.py:6} INFO - Chapter title: Higgs Boson and its Implications, Original title: Discussing the Higgs Boson and Its Implications
[2024-06-01 17:14:12,444] p163 {rouge_scorer.py:83} INFO - Using default tokenizer.
[2024-06-01 17:14:12,452] p163 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials


Create new client
  Using region: None
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


[2024-06-01 17:14:13,516] p163 {96059692.py:6} INFO - Chapter title: Particle Physics Prospects, Original title: Exploring Future Frontiers in Particle Physics
[2024-06-01 17:14:13,516] p163 {rouge_scorer.py:83} INFO - Using default tokenizer.
[2024-06-01 17:14:14,063] p163 {96059692.py:6} INFO - Chapter title: Exploring Quantum Gravity, Original title: The Quest for a Theory of Quantum Gravity
[2024-06-01 17:14:14,064] p163 {rouge_scorer.py:83} INFO - Using default tokenizer.
[2024-06-01 17:14:14,927] p163 {96059692.py:6} INFO - Chapter title: Higher Dimensions, Original title: Exploring Higher Dimensions and the Fabric of Reality
[2024-06-01 17:14:14,928] p163 {rouge_scorer.py:83} INFO - Using default tokenizer.
[2024-06-01 17:14:15,463] p163 {96059692.py:6} INFO - Chapter title: Exploring the Frontiers of Reality, Original title: The Role of Theory and Experimentation in Physics
[2024-06-01 17:14:15,465] p163 {rouge_scorer.py:83} INFO - Using default tokenizer.
[2024-06-01 17:14:16,

In [12]:
# show the number of chapter titles generated by each of the model
df_per_model_id_counts = df['model_id'].value_counts()
df_per_model_id_counts

model_id
amazon.titan-text-express-v1               10
anthropic.claude-3-haiku-20240307-v1:0     10
anthropic.claude-3-sonnet-20240229-v1:0    10
mistral.mistral-7b-instruct-v0:2           10
meta.llama3-70b-instruct-v1:0               6
meta.llama3-8b-instruct-v1:0                6
mistral.mixtral-8x7b-instruct-v0:1          6
meta.llama2-13b-chat-v1                     4
Name: count, dtype: int64

In [13]:
df.head(30)

Unnamed: 0,chapter_title,file_name,chapter_id,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,chapter_text,original_title,rouge_l_f1_score,cosine_similarity
31,Higgs Boson and its Implications,particle_physics_meeting.json,0,amazon.titan-text-express-v1,0.95907,10,341,0.000273,1.6e-05,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications,0.8333,0.856295
30,Particle Physics Prospects,particle_physics_meeting.json,1,amazon.titan-text-express-v1,0.875003,6,330,0.000264,1e-05,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics,0.4444,0.851623
33,Exploring Quantum Gravity,particle_physics_meeting.json,2,amazon.titan-text-express-v1,1.146632,6,343,0.000274,1e-05,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity,0.3636,0.833529
32,Higher Dimensions,particle_physics_meeting.json,3,amazon.titan-text-express-v1,1.135567,4,330,0.000264,6e-06,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...,0.4,0.746537
35,Exploring the Frontiers of Reality,particle_physics_meeting.json,4,amazon.titan-text-express-v1,1.225493,9,298,0.000238,1.4e-05,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics,0.3077,0.220735
34,Antimatter in Medicine and Space Exploration,particle_physics_meeting.json,5,amazon.titan-text-express-v1,1.360492,10,296,0.000237,1.6e-05,"1015 ""Dr. E"" (1234567890)\n10:04:10.100 --> 10...",Antimatter and Its Potential Applications,0.3636,0.816674
19,Higgs Boson Particle Discoveries,particle_physics_meeting.json,0,anthropic.claude-3-haiku-20240307-v1:0,0.729197,13,387,9.7e-05,1.6e-05,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications,0.3636,0.712618
18,Particle Physics Frontiers,particle_physics_meeting.json,1,anthropic.claude-3-haiku-20240307-v1:0,0.553375,9,373,9.3e-05,1.1e-05,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics,0.4444,0.880063
21,Quantum Gravity Theories,particle_physics_meeting.json,2,anthropic.claude-3-haiku-20240307-v1:0,0.7532,9,385,9.6e-05,1.1e-05,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity,0.3636,0.891044
20,Multidimensional Universe Theories,particle_physics_meeting.json,3,anthropic.claude-3-haiku-20240307-v1:0,0.802009,11,375,9.4e-05,1.4e-05,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...,0.0,0.675618


In [14]:
metrics_dir = config['dir']['metrics']
# Create the directory if it doesn't exist
os.makedirs(metrics_dir, exist_ok=True)
# Construct the file path
metrics_file_path = os.path.join(metrics_dir, config['dir']['metrics_file'])
df.to_csv(metrics_file_path, index=False)

In [15]:
df_summary = df.groupby('model_id').mean(numeric_only=True)
if 'rouge_l_f1_score' and 'cosine_similarity' in df_summary.columns:
    df_summary = df_summary.rename(columns={'rouge_l_f1_score': 'mean_rouge_l_f1_score', 'cosine_similarity': 'mean_cosine_similarity'})
df_summary['p95_latency_seconds'] = df.groupby('model_id')['latency_seconds'].quantile(0.95)
df_summary['avg_cost_per_txn'] = df_summary.input_token_price + df_summary.output_token_pricing
df_summary['p95_cost_per_txn'] = df.groupby('model_id')['input_token_price'].quantile(0.95) + \
                                 df.groupby('model_id')['output_token_pricing'].quantile(0.95)
df_summary.completion_token_count = df_summary.completion_token_count.astype(int)
df_summary.prompt_token_count = df_summary.prompt_token_count.astype(int)
df_summary['p95_completion_token_count'] = df.groupby('model_id')['completion_token_count'].quantile(0.95)
df_summary['p95_prompt_token_count'] = df.groupby('model_id')['prompt_token_count'].quantile(0.95)
df_summary = df_summary.drop(columns=['chapter_id'])
# Reset the index to make 'model_id' a column
df_summary = df_summary.reset_index()
df_summary


Unnamed: 0,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,mean_rouge_l_f1_score,mean_cosine_similarity,p95_latency_seconds,avg_cost_per_txn,p95_cost_per_txn,p95_completion_token_count,p95_prompt_token_count
0,amazon.titan-text-express-v1,1.126562,7,385,0.000309,1.2e-05,0.4521,0.720899,1.299742,0.000321,0.000508,11.65,611.9
1,anthropic.claude-3-haiku-20240307-v1:0,0.762285,14,441,0.00011,1.8e-05,0.297117,0.765014,0.962902,0.000129,0.000205,23.1,703.85
2,anthropic.claude-3-sonnet-20240229-v1:0,1.122653,13,441,0.001324,0.000202,0.380583,0.788797,1.814916,0.001526,0.00248,24.55,703.85
3,meta.llama2-13b-chat-v1,1.293597,15,488,0.000366,1.6e-05,,,1.69403,0.000382,0.000499,25.75,631.3
4,meta.llama3-70b-instruct-v1:0,0.964386,7,347,0.00092,2.4e-05,0.396567,0.750383,1.114516,0.000944,0.001002,8.75,366.5
5,meta.llama3-8b-instruct-v1:0,0.754584,6,347,0.000139,4e-06,0.389167,0.753589,0.766312,0.000143,0.000151,7.75,366.5
6,mistral.mistral-7b-instruct-v0:2,0.984149,35,401,6e-05,7e-06,0.327533,0.631595,1.88751,6.7e-05,0.000113,95.5,627.9
7,mistral.mixtral-8x7b-instruct-v0:1,0.675825,5,339,0.000153,4e-06,0.389167,0.761117,0.736683,0.000157,0.000166,6.75,358.5


#### Calculate the long short view of the completions

In [16]:
# handle if the title is given in the data frame, include it in the pivoted df, else exclude it
if 'original_title'in df.columns:
    index_cols = ['file_name', 'chapter_id', 'chapter_text', 'original_title']
else:
    index_cols = ['file_name', 'chapter_id', 'chapter_text']
    
df_pivoted = df.pivot_table(index=index_cols, columns='model_id', values='chapter_title', aggfunc='first')
cols_other_than_index_cols = [f"{c}_title" for c in df_pivoted.columns if c not in index_cols]
df_pivoted = df_pivoted.reset_index()
df_pivoted.columns = index_cols + cols_other_than_index_cols
df_pivoted.head()

Unnamed: 0,file_name,chapter_id,chapter_text,original_title,amazon.titan-text-express-v1_title,anthropic.claude-3-haiku-20240307-v1:0_title,anthropic.claude-3-sonnet-20240229-v1:0_title,meta.llama3-70b-instruct-v1:0_title,meta.llama3-8b-instruct-v1:0_title,mistral.mistral-7b-instruct-v0:2_title,mistral.mixtral-8x7b-instruct-v0:1_title
0,particle_physics_meeting.json,0,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications,Higgs Boson and its Implications,Higgs Boson Particle Discoveries,Higgs Boson Discussion,Higgs Boson Implications,Higgs Boson Discussion,Higgs Applications,Higgs Boson Discussion
1,particle_physics_meeting.json,1,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics,Particle Physics Prospects,Particle Physics Frontiers,Future of Particle Physics,Future of Physics,Future of Particle Physics,Future Physics Projects,Future of Particle Physics
2,particle_physics_meeting.json,2,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity,Exploring Quantum Gravity,Quantum Gravity Theories,Quantum Gravity Discussion,Quantum Gravity Implications,Quantum Gravity Implications,Quantum Gravity Discussions,Quantum Gravity Theories
3,particle_physics_meeting.json,3,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...,Higher Dimensions,Multidimensional Universe Theories,Dimensions Beyond Space-Time,Higher Dimensions,Higher Dimensions,Higher Dimensions Debate,Higher Dimensions
4,particle_physics_meeting.json,4,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics,Exploring the Frontiers of Reality,Probing Particle Physics Frontiers,Probing Fundamental Physics,Probing Reality's Fabric,Particle Accelerators,New Insights,Exploring Reality


In [17]:
# Construct the file path
movel_evals_fpath = os.path.join(metrics_dir, config['dir']['model_evals_file'])
df_pivoted.to_csv(movel_evals_fpath, index=False)
df_pivoted.head()

Unnamed: 0,file_name,chapter_id,chapter_text,original_title,amazon.titan-text-express-v1_title,anthropic.claude-3-haiku-20240307-v1:0_title,anthropic.claude-3-sonnet-20240229-v1:0_title,meta.llama3-70b-instruct-v1:0_title,meta.llama3-8b-instruct-v1:0_title,mistral.mistral-7b-instruct-v0:2_title,mistral.mixtral-8x7b-instruct-v0:1_title
0,particle_physics_meeting.json,0,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications,Higgs Boson and its Implications,Higgs Boson Particle Discoveries,Higgs Boson Discussion,Higgs Boson Implications,Higgs Boson Discussion,Higgs Applications,Higgs Boson Discussion
1,particle_physics_meeting.json,1,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics,Particle Physics Prospects,Particle Physics Frontiers,Future of Particle Physics,Future of Physics,Future of Particle Physics,Future Physics Projects,Future of Particle Physics
2,particle_physics_meeting.json,2,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity,Exploring Quantum Gravity,Quantum Gravity Theories,Quantum Gravity Discussion,Quantum Gravity Implications,Quantum Gravity Implications,Quantum Gravity Discussions,Quantum Gravity Theories
3,particle_physics_meeting.json,3,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...,Higher Dimensions,Multidimensional Universe Theories,Dimensions Beyond Space-Time,Higher Dimensions,Higher Dimensions,Higher Dimensions Debate,Higher Dimensions
4,particle_physics_meeting.json,4,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics,Exploring the Frontiers of Reality,Probing Particle Physics Frontiers,Probing Fundamental Physics,Probing Reality's Fabric,Particle Accelerators,New Insights,Exploring Reality


In [18]:
df_summary

Unnamed: 0,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,mean_rouge_l_f1_score,mean_cosine_similarity,p95_latency_seconds,avg_cost_per_txn,p95_cost_per_txn,p95_completion_token_count,p95_prompt_token_count
0,amazon.titan-text-express-v1,1.126562,7,385,0.000309,1.2e-05,0.4521,0.720899,1.299742,0.000321,0.000508,11.65,611.9
1,anthropic.claude-3-haiku-20240307-v1:0,0.762285,14,441,0.00011,1.8e-05,0.297117,0.765014,0.962902,0.000129,0.000205,23.1,703.85
2,anthropic.claude-3-sonnet-20240229-v1:0,1.122653,13,441,0.001324,0.000202,0.380583,0.788797,1.814916,0.001526,0.00248,24.55,703.85
3,meta.llama2-13b-chat-v1,1.293597,15,488,0.000366,1.6e-05,,,1.69403,0.000382,0.000499,25.75,631.3
4,meta.llama3-70b-instruct-v1:0,0.964386,7,347,0.00092,2.4e-05,0.396567,0.750383,1.114516,0.000944,0.001002,8.75,366.5
5,meta.llama3-8b-instruct-v1:0,0.754584,6,347,0.000139,4e-06,0.389167,0.753589,0.766312,0.000143,0.000151,7.75,366.5
6,mistral.mistral-7b-instruct-v0:2,0.984149,35,401,6e-05,7e-06,0.327533,0.631595,1.88751,6.7e-05,0.000113,95.5,627.9
7,mistral.mixtral-8x7b-instruct-v0:1,0.675825,5,339,0.000153,4e-06,0.389167,0.761117,0.736683,0.000157,0.000166,6.75,358.5


In [19]:
def create_summary(row, summary):
    return summary.format(
                model_id=row.name,
                avg_latency=round(row['latency_seconds'], 4),
                p95_latency=round(row['p95_latency_seconds'], 4),
                avg_cost=round(10000 * row['avg_cost_per_txn'], 6),
                p95_cost_per_txn=round(10000 * row['p95_cost_per_txn'], 6),
                avg_prompt_token_count=row['prompt_token_count'],
                p95_prompt_token_count=row['p95_prompt_token_count'],
                avg_completion_token_count=row['completion_token_count'],
                p95_completion_token_count=row['p95_completion_token_count'],
                mean_rouge_l_score=('None' if row.get('mean_rouge_l_f1_score') is None else round(row['mean_rouge_l_f1_score'], 4)),
                mean_cosine_similarity_score=('None, (no human generated title provided in the data)' if row.get('mean_cosine_similarity') is None else round(row['mean_cosine_similarity'], 4)),
                count=int(row['count'])
            )
df_summary = pd.merge(left=df_summary, right=df_per_model_id_counts, on="model_id", how="left")

df_summary['overall_report'] = df_summary.apply(lambda r: create_summary(r, config['report']['summary_text']), axis=1)
df_summary = df_summary.round(6)
summary_metrics_file_path = os.path.join(metrics_dir, config['dir']['summary_metrics_file'])
df_summary = df_summary.sort_values(by=['mean_cosine_similarity', 'mean_rouge_l_f1_score'], ascending=False)
df_summary.to_csv(summary_metrics_file_path, index=False)

In [20]:
# view the df_summary elements
df_summary.head(10)

Unnamed: 0,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,mean_rouge_l_f1_score,mean_cosine_similarity,p95_latency_seconds,avg_cost_per_txn,p95_cost_per_txn,p95_completion_token_count,p95_prompt_token_count,count,overall_report
2,anthropic.claude-3-sonnet-20240229-v1:0,1.122653,13,441,0.001324,0.000202,0.380583,0.788797,1.814916,0.001526,0.00248,24.55,703.85,10,The average inference latency for this workloa...
1,anthropic.claude-3-haiku-20240307-v1:0,0.762285,14,441,0.00011,1.8e-05,0.297117,0.765014,0.962902,0.000129,0.000205,23.1,703.85,10,The average inference latency for this workloa...
7,mistral.mixtral-8x7b-instruct-v0:1,0.675825,5,339,0.000153,4e-06,0.389167,0.761117,0.736683,0.000157,0.000166,6.75,358.5,6,The average inference latency for this workloa...
5,meta.llama3-8b-instruct-v1:0,0.754584,6,347,0.000139,4e-06,0.389167,0.753589,0.766312,0.000143,0.000151,7.75,366.5,6,The average inference latency for this workloa...
4,meta.llama3-70b-instruct-v1:0,0.964386,7,347,0.00092,2.4e-05,0.396567,0.750383,1.114516,0.000944,0.001002,8.75,366.5,6,The average inference latency for this workloa...
0,amazon.titan-text-express-v1,1.126562,7,385,0.000309,1.2e-05,0.4521,0.720899,1.299742,0.000321,0.000508,11.65,611.9,10,The average inference latency for this workloa...
6,mistral.mistral-7b-instruct-v0:2,0.984149,35,401,6e-05,7e-06,0.327533,0.631595,1.88751,6.7e-05,0.000113,95.5,627.9,10,The average inference latency for this workloa...
3,meta.llama2-13b-chat-v1,1.293597,15,488,0.000366,1.6e-05,,,1.69403,0.000382,0.000499,25.75,631.3,4,The average inference latency for this workloa...


### Title Evaluation: Using LLM as a Judge in the loop
---

In this portion:

1. Titles generated by each model are evaluated on relevance and meaning by [Claude](https://www.anthropic.com/news/claude-3-family) Sonnet/Your model of choice. Prompt for the model that acts as a judge in the loop can be viewed in: [eval_template.txt](data/prompts/eval_template.txt). Edit and review this prompt based on the use case and criteria for subjective evaluation.

2. The role of the model acting as a judge it to compare the titles generated by each model to a human generated title (Aka ***golden title***). It provides information on the selected model, title, and an explanation of its selection, with an in depth analysis of comparison between other titles and why it chose the one it did. In this case, the model as a judge is prompted to ***capture the most relevant aspects of the meeting*** while generating a title.

3. A final evaluation metric is calculated that shows the distribution of the selected models and their respective titles. This will give a judgement call of which model to use in production ready workloads.

***Note: For more information on the use of having a Model act as a judge, view: https://huggingface.co/learn/cookbook/en/llm_judge***

In [21]:
try:
    # convert the model evaluation metrics stored as a df 
    model_eval_df = pd.read_csv(os.path.join(config['dir']['metrics'], config['dir']['model_evals_file']))  
    logger.info(f"Model eval file found with all model completions. Ready to evaluate responses...")
    model_eval_df.head()
except Exception as e:
    logger.error(f"Model evaluation csv file not found in the directory. Error: {e}")
model_eval_df.head(10)

[2024-06-01 17:14:37,603] p163 {1587218011.py:4} INFO - Model eval file found with all model completions. Ready to evaluate responses...


Unnamed: 0,file_name,chapter_id,chapter_text,original_title,amazon.titan-text-express-v1_title,anthropic.claude-3-haiku-20240307-v1:0_title,anthropic.claude-3-sonnet-20240229-v1:0_title,meta.llama3-70b-instruct-v1:0_title,meta.llama3-8b-instruct-v1:0_title,mistral.mistral-7b-instruct-v0:2_title,mistral.mixtral-8x7b-instruct-v0:1_title
0,particle_physics_meeting.json,0,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications,Higgs Boson and its Implications,Higgs Boson Particle Discoveries,Higgs Boson Discussion,Higgs Boson Implications,Higgs Boson Discussion,Higgs Applications,Higgs Boson Discussion
1,particle_physics_meeting.json,1,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics,Particle Physics Prospects,Particle Physics Frontiers,Future of Particle Physics,Future of Physics,Future of Particle Physics,Future Physics Projects,Future of Particle Physics
2,particle_physics_meeting.json,2,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity,Exploring Quantum Gravity,Quantum Gravity Theories,Quantum Gravity Discussion,Quantum Gravity Implications,Quantum Gravity Implications,Quantum Gravity Discussions,Quantum Gravity Theories
3,particle_physics_meeting.json,3,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...,Higher Dimensions,Multidimensional Universe Theories,Dimensions Beyond Space-Time,Higher Dimensions,Higher Dimensions,Higher Dimensions Debate,Higher Dimensions
4,particle_physics_meeting.json,4,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics,Exploring the Frontiers of Reality,Probing Particle Physics Frontiers,Probing Fundamental Physics,Probing Reality's Fabric,Particle Accelerators,New Insights,Exploring Reality
5,particle_physics_meeting.json,5,"1015 ""Dr. E"" (1234567890)\n10:04:10.100 --> 10...",Antimatter and Its Potential Applications,Antimatter in Medicine and Space Exploration,Antimatter's Scientific Potential,Antimatter Applications,Antimatter Applications,Antimatter Applications,Antimatter Applications,Antimatter Applications


#### Prepare the evaluation prompt payloads

Here, the [`evaluation prompt template`](data/prompts/eval_template.txt) is used by the LLM judge to evaluate different chapter titles and suggest the most suitable title based on the evaluation criteria mentioned in the prompt template.

In [22]:
def prepare_eval_prompts(row):
    """
    This function evaluates the prompts by incorporating all of the titles generated by various bedrock models into the evaluation prompt template.
    """
    # represents the eval template used by the model judge
    eval_template: Optional[str] = None
    processed_eval_template: Optional[str] = None
    model_titles: List[str] = []
    try:
        # file path to the eval template
        eval_template_path: str = os.path.join(config['dir']['prompts'], config['eval_model_info'].get('prompt_template'))
        with open(eval_template_path, "r") as f:
            eval_template = f.read()
            logger.info(f"evaluation prompt template recorded: {eval_template}")
    except FileNotFoundError:
        print(f"Error: Evaluation template not found at {eval_template_path}")
    logger.info(f"chapter_text: {row['chapter_text']}")
    logger.info(f"original_title: {row['original_title']}")
    for column in row.index:
        if column.endswith("_title") and column != "original_title":
            model_id = column.split("_title")[0]
            model_title = row[column]
            model_titles.append(f"\n<{model_id}>\n{model_title}\n</{model_id}>\n")
    processed_eval_template = eval_template.format(
        chapter_text=row['chapter_text'], 
        original_title=row['original_title'],
        model_titles="\n".join(model_titles)
    )

    return processed_eval_template

Add `evaluation prompt` as a column into a df with respective model and chapter titles to send into the Model for further evaluation in the loop

In [23]:
if model_eval_df is not None:
    model_eval_df['eval_prompt'] = model_eval_df.apply(lambda r: prepare_eval_prompts(r), axis=1)
    logger.info("preparing the evaluation prompt templates for the LLM judge....")
else:
    logger.error(f"Model evaluation dataset is not available to process.")
model_eval_df_f_path = os.path.join(metrics_dir, config['dir']['processed_prompts_for_eval'])
model_eval_df.to_csv(model_eval_df_f_path, index=False)
model_eval_df.head(10)

[2024-06-01 17:14:37,618] p163 {2540286596.py:14} INFO - evaluation prompt template recorded: Human: here is a transcript from a meeting in the <chapter></chapter> tag followed by chapter titles generated by different models.
Your task is to select the title that best captures the content and meaning of the chapter in 1 to 4 words.
Put the selected title, model name and explanation for selecting the title and not selecting other titles in a JSON as within 3 elements: "best_match_title", "selected_model", and "explanation".
Your explanation should include both model name and title so that it is simple to understand which title was generated by which model and why it was or was not selected.

<chapter>
{chapter_text}
</chapter>

<model_x>
{original_title}
</model_x>

{model_titles}

Assistant: Here is the response in json:

[2024-06-01 17:14:37,618] p163 {2540286596.py:17} INFO - chapter_text: 1 "Dr. E" (1234567890)
00:00:00.000 --> 00:00:05.340
Have you all seen the latest results from 

Unnamed: 0,file_name,chapter_id,chapter_text,original_title,amazon.titan-text-express-v1_title,anthropic.claude-3-haiku-20240307-v1:0_title,anthropic.claude-3-sonnet-20240229-v1:0_title,meta.llama3-70b-instruct-v1:0_title,meta.llama3-8b-instruct-v1:0_title,mistral.mistral-7b-instruct-v0:2_title,mistral.mixtral-8x7b-instruct-v0:1_title,eval_prompt
0,particle_physics_meeting.json,0,"1 ""Dr. E"" (1234567890)\n00:00:00.000 --> 00:00...",Discussing the Higgs Boson and Its Implications,Higgs Boson and its Implications,Higgs Boson Particle Discoveries,Higgs Boson Discussion,Higgs Boson Implications,Higgs Boson Discussion,Higgs Applications,Higgs Boson Discussion,Human: here is a transcript from a meeting in ...
1,particle_physics_meeting.json,1,"996 ""Dr. M"" (2345678901)\n10:02:15.340 --> 10:...",Exploring Future Frontiers in Particle Physics,Particle Physics Prospects,Particle Physics Frontiers,Future of Particle Physics,Future of Physics,Future of Particle Physics,Future Physics Projects,Future of Particle Physics,Human: here is a transcript from a meeting in ...
2,particle_physics_meeting.json,2,"1001 ""Dr. S"" (3456789012)\n10:02:42.520 --> 10...",The Quest for a Theory of Quantum Gravity,Exploring Quantum Gravity,Quantum Gravity Theories,Quantum Gravity Discussion,Quantum Gravity Implications,Quantum Gravity Implications,Quantum Gravity Discussions,Quantum Gravity Theories,Human: here is a transcript from a meeting in ...
3,particle_physics_meeting.json,3,"1006 ""Dr. D"" (4567890123)\n10:03:13.400 --> 10...",Exploring Higher Dimensions and the Fabric of ...,Higher Dimensions,Multidimensional Universe Theories,Dimensions Beyond Space-Time,Higher Dimensions,Higher Dimensions,Higher Dimensions Debate,Higher Dimensions,Human: here is a transcript from a meeting in ...
4,particle_physics_meeting.json,4,"1011 ""Dr. E"" (1234567890)\n10:03:44.900 --> 10...",The Role of Theory and Experimentation in Physics,Exploring the Frontiers of Reality,Probing Particle Physics Frontiers,Probing Fundamental Physics,Probing Reality's Fabric,Particle Accelerators,New Insights,Exploring Reality,Human: here is a transcript from a meeting in ...
5,particle_physics_meeting.json,5,"1015 ""Dr. E"" (1234567890)\n10:04:10.100 --> 10...",Antimatter and Its Potential Applications,Antimatter in Medicine and Space Exploration,Antimatter's Scientific Potential,Antimatter Applications,Antimatter Applications,Antimatter Applications,Antimatter Applications,Antimatter Applications,Human: here is a transcript from a meeting in ...


#### Using LLM (Claude) as a judge in the loop to evaluate and narrow down the titles generated by different models of choice

In [24]:
def llm_judge_json_evaluations(model_id: str, prompt: str):
    # represents the service name
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name 
    # initialize the response dict
    ret = dict(exception = None,
               prompt = prompt,
               completion = None,
               file_name = None,
               original_title = None, 
               # initializing to 0 since none type throws an error later, this is used to calculate price per token input/output on ODT pricing
               completion_token_count = 0,
               # initializing to 0 since none type throws an error later
               prompt_token_count=0,
               input_token_price = None, 
               output_token_pricing = None,
               model_id = model_id)
    body = ret['prompt']
    os.environ["AWS_REGION_NAME"] = aws_region
    parameters = config['inference_parameters_for_explanations']
    temperature = parameters.get('temperature', 0.1)
    caching = parameters.get('caching', False)
    max_tokens = parameters.get("max_tokens", 500)
    try:
        # Represents calling the litellm completion/messaging api utilizing the completion/embeddings API
        logger.info(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=[{ "content": body,"role": "user"}],
                              temperature=temperature,
                              max_tokens=max_tokens,
                              caching=caching)
        
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Extract number of input and completion prompt tokens (this is the same structure for embeddings and text generation models on Amazon Bedrock)
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
    except Exception as e:
        logger.error(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    
    logger.info(f"completion: {ret['completion']}")
    return ret

In [25]:
def get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    # save all the responses from the model in a dictionary
    resp: Dict = {}
    print(f"row={row}")
    logger.info(f"row {i}/{total}, prompt_template={model_info['prompt_template']}, model_id={model_info['model']}")
    model_id = model_info['model']
    # create the payload for model inference
    prompt = row['eval_prompt']
    # generate the chapter title based on the given chapter in the prompt 
    resp = llm_judge_json_evaluations(model_id, prompt)
    resp['original_title'] = row['original_title']
    resp['file_name'] = row['file_name']
    # calculate the input and output token price for all of the calls
    resp['input_token_price'] = (resp['prompt_token_count']/1000) * model_info['input_tokens_pricing']
    logger.info(f"The price for {resp['prompt_token_count']} tokens for {model_id} for filename={row['file_name']} chapter={row['chapter_id']} is {resp['input_token_price']}")
    resp['output_token_pricing'] = (resp['completion_token_count']/1000) * model_info['output_tokens_pricing']
    logger.info(f"The price for {resp['completion_token_count']} tokens for {model_id} for filename={row['file_name']} chapter={row['chapter_id']} is {resp['output_token_pricing']}")
    dir_path = os.path.join(config['dir']['model_eval_completions'], row['file_name'], model_id.replace(":", "-"))
    os.makedirs(dir_path, exist_ok=True)
    fpath = os.path.join(dir_path, f"model_evaluation_{row['chapter_id']}.json")
    logger.info(f"writing response={resp} to {fpath}")
    Path(fpath).write_text(json.dumps(resp, default=str, indent=2))
    logger.info(f"response {i}: {resp}")
    return resp

@ray.remote
def async_get_inference(i: int, row: Dict, total: int, model_info: Dict) -> Dict:
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return get_inference(i, row, total, model_info)

In [26]:
model_eval_df = json.loads(model_eval_df.to_json(orient='records'))
n: int = config['parallel_inference_count']
from typing import List
resp_list: List = []
model_list = config['eval_model_info']
st = time.perf_counter()
logger.info(f"------ running inference for {model_list.get('model')} -----")
list_of_lists = [model_eval_df[i * n:(i + 1) * n] for i in range((len(model_eval_df) + n - 1) // n )]
logger.info(f"split input list of size {len(model_eval_df)} into {len(list_of_lists)} lists")
for idx, l in enumerate(list_of_lists):
    logger.info(f"getting inference for list {idx+1}/{len(list_of_lists)}, size of list={len(l)} ")
    resp_list.extend(ray.get([async_get_inference.remote(i+1, e, len(l), model_list) for i, e in enumerate(l)]))
elapsed_time = time.perf_counter() - st
logger.info(f"------ model={model_list.get('model')} completed in {elapsed_time} ------ ")

[2024-06-01 17:14:37,647] p163 {161412846.py:7} INFO - ------ running inference for anthropic.claude-3-sonnet-20240229-v1:0 -----
[2024-06-01 17:14:37,648] p163 {161412846.py:9} INFO - split input list of size 6 into 1 lists
[2024-06-01 17:14:37,648] p163 {161412846.py:11} INFO - getting inference for list 1/1, size of list=6 
[36m(async_get_inference pid=197)[0m [2024-06-01 17:14:39,332] p197 {3857863317.py:5} INFO - row 6/6, prompt_template=eval_template.txt, model_id=anthropic.claude-3-sonnet-20240229-v1:0
[36m(async_get_inference pid=197)[0m [2024-06-01 17:14:39,336] p197 {2625041246.py:29} INFO - Invoking bedrock/anthropic.claude-3-sonnet-20240229-v1:0......
[36m(async_get_inference pid=199)[0m [2024-06-01 17:14:39,349] p199 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[36m(async_get_inference pid=197)[0m [92m17:14:39 - LiteLLM:INFO[0m: utils.py:1133 - [92m
[36m(async_get_inference pid=197)[0m Request Sent from LiteLLM:

[36m(async_get_inference pid=197)[0m row={'file_name': 'particle_physics_meeting.json', 'chapter_id': 5, 'chapter_text': '1015 "Dr. E" (1234567890)\n10:04:10.100 --> 10:04:16.400\nTake the concept of antimatter, for instance. It was purely theoretical until we found ways to produce and study it.\n\n1016 "Dr. M" (2345678901)\n10:04:16.400 --> 10:04:22.700\nAnd now we\'re harnessing antimatter for medical applications like proton therapy and advanced imaging techniques.\n\n1017 "Dr. S" (3456789012)\n10:04:22.700 --> 10:04:29.000\nNot to mention the potential for antimatter-based propulsion systems, which could revolutionize space exploration.\n\n1018 "Dr. D" (4567890123)\n10:04:29.000 --> 10:04:35.300\nAh yes, the dream of achieving faster-than-light travel. But we\'d need to overcome some significant hurdles first.', 'original_title': 'Antimatter and Its Potential Applications', 'amazon.titan-text-express-v1_title': 'Antimatter in Medicine and Space Exploration', 'anthropic.claude-3-h

[36m(async_get_inference pid=198)[0m [92m17:14:42 - LiteLLM:INFO[0m: utils.py:2911 - Wrapper: Completed Call, calling success_handler
[36m(async_get_inference pid=198)[0m [2024-06-01 17:14:42,808] p198 {utils.py:2911} INFO - Wrapper: Completed Call, calling success_handler
[36m(async_get_inference pid=198)[0m [2024-06-01 17:14:42,808] p198 {2625041246.py:49} INFO - completion: {
[36m(async_get_inference pid=198)[0m   "best_match_title": "Higgs Boson Discussion",
[36m(async_get_inference pid=198)[0m   "selected_model": "anthropic.claude-3-sonnet-20240229-v1:0",
[36m(async_get_inference pid=198)[0m   "explanation": "The title 'Higgs Boson Discussion' generated by the anthropic.claude-3-sonnet-20240229-v1:0 model best captures the content and meaning of the chapter. The transcript is a discussion among scientists about the Higgs boson, its implications, and potential applications. The other titles like 'Higgs Boson Implications' and 'Higgs Boson Particle Discoveries' are too

#### Extract all evaluations from the model evaluator

In [27]:
## Represents extracted all metric files
fpath_evaluated_files = os.path.join(config['dir']['model_eval_completions'], "**", "*", "*.json")
eval_metric_files = glob.glob(fpath_evaluated_files, recursive=True)
logger.info(f"there are {len(eval_metric_files)} evaluated files by {config['eval_model_info']['model']} LLM judge in {fpath_evaluated_files}")

[2024-06-01 17:14:44,786] p163 {1986411679.py:4} INFO - there are 6 evaluated files by anthropic.claude-3-sonnet-20240229-v1:0 LLM judge in data/model_eval_completions/**/*/*.json


In [28]:
model_evaluation_responses = []
for f in eval_metric_files:
    with open(f, 'r') as file:
        model_evaluation_responses.append(json.loads(file.read()))
# results_df will contain the evaluation responses, including the completion and the model id
results_df = pd.DataFrame(model_evaluation_responses)
results_df = results_df.drop(columns=['exception', 'prompt', 'file_name'])
results_df.head(10)

Unnamed: 0,completion,original_title,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,model_id
0,"{\n ""best_match_title"": ""Dimensions Beyond Sp...",Exploring Higher Dimensions and the Fabric of ...,182,754,0.002262,0.00273,anthropic.claude-3-sonnet-20240229-v1:0
1,"{\n ""best_match_title"": ""Quantum Gravity Disc...",The Quest for a Theory of Quantum Gravity,232,773,0.002319,0.00348,anthropic.claude-3-sonnet-20240229-v1:0
2,"{\n ""best_match_title"": ""Antimatter Applicati...",Antimatter and Its Potential Applications,250,717,0.002151,0.00375,anthropic.claude-3-sonnet-20240229-v1:0
3,"{\n ""best_match_title"": ""Probing Fundamental ...",The Role of Theory and Experimentation in Physics,354,724,0.002172,0.00531,anthropic.claude-3-sonnet-20240229-v1:0
4,"{\n ""best_match_title"": ""Future of Particle P...",Exploring Future Frontiers in Particle Physics,316,752,0.002256,0.00474,anthropic.claude-3-sonnet-20240229-v1:0
5,"{\n ""best_match_title"": ""Higgs Boson Discussi...",Discussing the Higgs Boson and Its Implications,214,790,0.00237,0.00321,anthropic.claude-3-sonnet-20240229-v1:0


In [29]:
def clean_model_eval_json(data):
    """
    This function is to take in json data, and clean it, assign the selected title as outputted by the model evaluator
    """
    try:
        json_data = json.loads(data.replace('\\', '\\\\'))
        return pd.Series({
            'best_match_title': json_data['best_match_title'],
            'selected_model': json_data['selected_model'],
            'explanation': json_data['explanation'],
        })
    except json.JSONDecodeError:
        return pd.Series({
            'best_match_title': None,
            'selected_model': None,
            'explanation': None,
        })

In [30]:
def tidy_split(df, column, sep=',', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.
    
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [31]:
new_results_df = results_df['completion'].apply(clean_model_eval_json)
# removing any unnecessary characters from the selected_model if any
new_results_df['selected_model'] = new_results_df['selected_model'].str.replace(r'<[^>]+>', '', regex=True)
# here we split the elements of the selected_model column using the tidy split function
new_exploded_df = tidy_split(new_results_df, 'selected_model', sep=',')
new_results_df['chapter_title'] = results_df['original_title']
new_results_df['input_token_price'] = results_df['input_token_price']
new_results_df['output_token_price'] = results_df['output_token_pricing']
new_results_df = new_results_df.reindex(columns=['chapter_title', 'best_match_title', 'selected_model', 'explanation', 'input_token_price', 'output_token_price'])
logger.info(f"All evaluation data is read into a dataframe of shape {results_df.shape}")
processed_prompts_for_eval_path = os.path.join(metrics_dir, config['dir']['filtered_titles_for_eval'])
new_results_df.to_csv(processed_prompts_for_eval_path, index=False)
# display the selected title, model explanation and the respective golden title in a side by side view
new_results_df.head(10)

[2024-06-01 17:14:44,832] p163 {3809385060.py:10} INFO - All evaluation data is read into a dataframe of shape (6, 7)


Unnamed: 0,chapter_title,best_match_title,selected_model,explanation,input_token_price,output_token_price
0,Exploring Higher Dimensions and the Fabric of ...,Dimensions Beyond Space-Time,anthropic.claude-3-sonnet-20240229-v1:0,The title 'Dimensions Beyond Space-Time' gener...,0.002262,0.00273
1,The Quest for a Theory of Quantum Gravity,Quantum Gravity Discussion,anthropic.claude-3-sonnet-20240229-v1:0,The title 'Quantum Gravity Discussion' generat...,0.002319,0.00348
2,Antimatter and Its Potential Applications,Antimatter Applications,"anthropic.claude-3-sonnet-20240229-v1:0, meta....",The title 'Antimatter Applications' best captu...,0.002151,0.00375
3,The Role of Theory and Experimentation in Physics,Probing Fundamental Physics,anthropic.claude-3-sonnet-20240229-v1:0,The title 'Probing Fundamental Physics' genera...,0.002172,0.00531
4,Exploring Future Frontiers in Particle Physics,Future of Particle Physics,anthropic.claude-3-sonnet-20240229-v1:0,The title 'Future of Particle Physics' generat...,0.002256,0.00474
5,Discussing the Higgs Boson and Its Implications,Higgs Boson Discussion,anthropic.claude-3-sonnet-20240229-v1:0,The title 'Higgs Boson Discussion' generated b...,0.00237,0.00321


In [32]:
# Compute the percentage of each model selection and reset the index
new_exploded_df['selected_model'] = new_exploded_df['selected_model'].map(lambda x: x.strip())
model_percentage_df = new_exploded_df['selected_model'].value_counts(normalize=True).reset_index()
model_percentage_df['proportion'] *= 100
model_distribution_fpath = os.path.join(metrics_dir, config['dir']['model_distribution'])
model_percentage_df.to_csv(model_distribution_fpath, index=False)
model_percentage_df.rename(columns = {'selected_model':'model_id'}, inplace = True)
model_percentage_df.head(10)

Unnamed: 0,model_id,proportion
0,anthropic.claude-3-sonnet-20240229-v1:0,60.0
1,meta.llama3-70b-instruct-v1:0,10.0
2,meta.llama3-8b-instruct-v1:0,10.0
3,mistral.mistral-7b-instruct-v0:2,10.0
4,mistral.mixtral-8x7b-instruct-v0:1,10.0


In [33]:
# Identify the most frequently selected model
most_selected_index = model_percentage_df.proportion.idxmax()
report_template: str = config['report']['model_recommendation']
report: str = report_template.format(
                count=new_results_df.best_match_title.count(),
                model_id=model_percentage_df.iloc[most_selected_index]['model_id'],
                percentage_of_occurrence=model_percentage_df.proportion.max(), 
                total_evaluation_cost=round((new_results_df.input_token_price.sum() + new_results_df.output_token_price.sum()), 4))
result_data = {'model_recommendation': [report]}
results_summary_df = pd.DataFrame(result_data)
recommended_model_fpath = os.path.join(metrics_dir, config['dir']['final_report'])
# Saving to CSV
results_summary_df.to_csv(recommended_model_fpath, index=False)
print(report)

The recommended model with the best match title based on an evaluation of 6 titles is anthropic.claude-3-sonnet-20240229-v1:0, with a 60.0% selection rate. The total cost to run the LLM as a judge evaluation is $0.0368.


In [58]:
merged_df = pd.merge(df_summary, model_percentage_df, on='model_id', how='left')
merged_df.rename(columns={'proportion': 'LLM_as_a_judge_pick_rate'}, inplace=True)
merged_df['LLM_as_a_judge_pick_rate'] = merged_df['LLM_as_a_judge_pick_rate'].fillna("not available")
eval_report_template = config['report']['eval_report_explanation']

# Calculate the evaluation report for each row for the mean cosine, rouge and llm as a judge pick rate
merged_df['eval_report'] = merged_df.apply(lambda row: eval_report_template.format(
    rouge_score=row['mean_rouge_l_f1_score'],
    cosine_score=row['mean_cosine_similarity'],
    llm_as_a_judge=row['LLM_as_a_judge_pick_rate']
), axis=1)
merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
cols = merged_df.columns.tolist()
idx = cols.index('mean_cosine_similarity')
cols.insert(idx + 1, cols.pop(cols.index('LLM_as_a_judge_pick_rate')))
cols.insert(idx + 2, cols.pop(cols.index('eval_report')))
merged_df = merged_df[cols]
merged_df.to_csv(summary_metrics_file_path, index=False)
merged_df

Unnamed: 0,model_id,latency_seconds,completion_token_count,prompt_token_count,input_token_price,output_token_pricing,mean_rouge_l_f1_score,mean_cosine_similarity,LLM_as_a_judge_pick_rate,eval_report,p95_latency_seconds,avg_cost_per_txn,p95_cost_per_txn,p95_completion_token_count,p95_prompt_token_count,count,overall_report
0,anthropic.claude-3-sonnet-20240229-v1:0,1.122653,13,441,0.001324,0.000202,0.380583,0.788797,60.0,"The mean ROUGE-L score is 0.380583, mean Cosin...",1.814916,0.001526,0.00248,24.55,703.85,10,The average inference latency for this workloa...
1,anthropic.claude-3-haiku-20240307-v1:0,0.762285,14,441,0.00011,1.8e-05,0.297117,0.765014,not available,"The mean ROUGE-L score is 0.297117, mean Cosin...",0.962902,0.000129,0.000205,23.1,703.85,10,The average inference latency for this workloa...
2,mistral.mixtral-8x7b-instruct-v0:1,0.675825,5,339,0.000153,4e-06,0.389167,0.761117,10.0,"The mean ROUGE-L score is 0.389167, mean Cosin...",0.736683,0.000157,0.000166,6.75,358.5,6,The average inference latency for this workloa...
3,meta.llama3-8b-instruct-v1:0,0.754584,6,347,0.000139,4e-06,0.389167,0.753589,10.0,"The mean ROUGE-L score is 0.389167, mean Cosin...",0.766312,0.000143,0.000151,7.75,366.5,6,The average inference latency for this workloa...
4,meta.llama3-70b-instruct-v1:0,0.964386,7,347,0.00092,2.4e-05,0.396567,0.750383,10.0,"The mean ROUGE-L score is 0.396567, mean Cosin...",1.114516,0.000944,0.001002,8.75,366.5,6,The average inference latency for this workloa...
5,amazon.titan-text-express-v1,1.126562,7,385,0.000309,1.2e-05,0.4521,0.720899,not available,"The mean ROUGE-L score is 0.4521, mean Cosine ...",1.299742,0.000321,0.000508,11.65,611.9,10,The average inference latency for this workloa...
6,mistral.mistral-7b-instruct-v0:2,0.984149,35,401,6e-05,7e-06,0.327533,0.631595,10.0,"The mean ROUGE-L score is 0.327533, mean Cosin...",1.88751,6.7e-05,0.000113,95.5,627.9,10,The average inference latency for this workloa...
7,meta.llama2-13b-chat-v1,1.293597,15,488,0.000366,1.6e-05,,,not available,"The mean ROUGE-L score is nan, mean Cosine Sim...",1.69403,0.000382,0.000499,25.75,631.3,4,The average inference latency for this workloa...


In [None]:
merged_relevant = merged_df[['model_id', 'LLM_as_a_judge_pick_rate', 'mean_cosine_similarity', 'mean_rouge_l_f1_score']]

new_results_df = new_results_df.rename(columns={'selected_model': 'model_id'})

# Merging new_results_df with the relevant columns from merged_df
result_df = pd.merge(new_results_df, merged_relevant, on='model_id', how='left')

# Renaming the 'model_id' column back to 'selected_model'
result_df = result_df.rename(columns={'model_id': 'selected_model'})

# Display the first 10 rows of the resulting DataFrame
result_df.head(10)

### Compute the Recommended LLM based on a combined score of `Subjective` and `Quantitative` evaluation using `LLM as a judge`, `ROUGE` and `Cosine Similarity` metrics

In [49]:
# Fill NaN values with 0 in the normalized pick rate
merged_df['LLM_as_a_judge_pick_rate'].fillna(0, inplace=True)
merged_df['LLM_as_a_judge_pick_rate'] = merged_df['LLM_as_a_judge_pick_rate'] / 100
merged_df

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [36]:
best_llm_judge_model = merged_df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['model_id']
best_llm_judge_model

'anthropic.claude-3-sonnet-20240229-v1:0'

In [37]:
best_rouge_score_model = merged_df.sort_values(by='mean_rouge_l_f1_score', ascending=False).iloc[0]['model_id']
best_rouge_score_model

'amazon.titan-text-express-v1'

In [38]:
best_cosine_model = merged_df.sort_values(by='mean_cosine_similarity', ascending=False).iloc[0]['model_id']
best_cosine_model

'anthropic.claude-3-sonnet-20240229-v1:0'

In [39]:
best_llm_judge_model_value = merged_df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['LLM_as_a_judge_pick_rate']
best_llm_judge_model_value

0.6

In [40]:
def recommend_model(df) -> str:
    """
    This function computes the recommended model based on the three evaluation criteria.
    If a model has the highest score for all three criteria, then it becomes the best model agreed by all three.
    If not, then it is checked for the combination of the rest of the two criteria. If none of the cases satisfy,
    then a best recommended model is returned for each of the evaluation criteria.
    """
    try: 
        evaluation_report: Optional[str] = None
        # model with the highest score using LLM as a judge eval
        best_llm_judge_model = df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['model_id']
        best_llm_judge_model_value = df.sort_values(by='LLM_as_a_judge_pick_rate', ascending=False).iloc[0]['LLM_as_a_judge_pick_rate']
        # model with the highest score using the ROUGE f1 score
        best_rouge_score_model = df.sort_values(by='mean_rouge_l_f1_score', ascending=False).iloc[0]['model_id']
        best_rouge_score_model_value = df.sort_values(by='mean_rouge_l_f1_score', ascending=False).iloc[0]['mean_rouge_l_f1_score']
        # model with the highest score using the Cosine Similarity score
        best_cosine_model = df.sort_values(by='mean_cosine_similarity', ascending=False).iloc[0]['model_id']
        best_cosine_model_value = df.sort_values(by='mean_cosine_similarity', ascending=False).iloc[0]['mean_cosine_similarity']

        # check if all three models that are selected on the three criteria are the same
        if best_llm_judge_model == best_rouge_score_model == best_cosine_model:
            evaluation_report = (
                f"As per all three evaluation criteria, '{best_llm_judge_model}' is the best recommended model for your workload "
                f"based on the LLM as a judge pick rate of {best_llm_judge_model_value*100}%, Cosine Similarity of {best_cosine_model_value} and ROUGE score of {best_rouge_score_model_value}."
            )
        # Check combinations of any two criteria permutations
        elif best_llm_judge_model == best_rouge_score_model:
            evaluation_report = (
                f"As per the two evaluation criteria, '{best_llm_judge_model}' is the best recommended model for your workload "
                f"based on the LLM as a judge pick rate of {best_llm_judge_model_value*100}%, and ROUGE score of {best_rouge_score_model_value}."
            )
        elif best_llm_judge_model == best_cosine_model:
            evaluation_report = (
                f"As per the two evaluation criteria, '{best_llm_judge_model}' is the best recommended model for your workload "
                f"based on the LLM as a judge pick rate of {best_llm_judge_model_value*100}%, and Cosine Similarity score of {best_cosine_model_value}."
            )
        elif best_rouge_score_model == best_cosine_model:
            evaluation_report = (
                f"As per the two evaluation criteria, '{best_rouge_score_model}' is the best recommended model for your workload "
                f"based on the Cosine Similarity of {best_cosine_model_value} and ROUGE score of {best_rouge_score_model_value}."
            )
        # If none of the combinations match, recommend based on each individual criterion
        else:
            evaluation_report = (
                f"Based on each evaluation criteria, the following models are best recommended. "
                f"LLM as a judge selects {best_llm_judge_model} as the best recommended model. "
                f"Cosine Similarity score selects {best_cosine_model} as the best recommended model. "
                f"ROUGE score selects {best_rouge_score_model} as the best recommended model."
            )
    except Exception as e:
        logger.error(f"The best recommended model could not be provided: {e}")
        evaluation_report: Optional[str] = None
    return evaluation_report

In [41]:
# get the overall model recommendation based on the three evaluation criteria
recommendation = recommend_model(merged_df)
# Save the overall model evaluation recommendation to a csv
overall_eval_report_fpath: str = os.path.join(config['dir']['metrics'], config['dir']['overall_eval_report'])
overall_eval_data = {'overall_eval_recommendation': [recommendation]}
overall_eval_df = pd.DataFrame(overall_eval_data)
overall_eval_df.to_csv(overall_eval_report_fpath, index=False)
print(recommendation)

As per the two evaluation criteria, 'anthropic.claude-3-sonnet-20240229-v1:0' is the best recommended model for your workload based on the LLM as a judge pick rate of 60.0%, and Cosine Similarity score of 0.788797.
