## Step 1: Invoke Bedrock Models to get _Inferences_ on a user provided Dataset
---

This notebook does as follows:

1. Generates inferences on a user provided dataset, using Foundation models on Amazon Bedrock

1. Uses [Litellm](https://www.litellm.ai/) as an interface to interact with the Bedrock API

1. Uses `Ray`, which is used to run inferences in an asynchronous manner

1. Records metrics like the `p90, p95` latency, `prompt token counts`, `completion token counts`, and more.

1. Saves all the combined _model responses_ to user questions from the source dataset in a `all_results.csv` that is used later in the _evaluation step_ for the evaluation process.

In [19]:
# import the libraries
import os
import ray
import json
import yaml
import time
import boto3
import logging
import pandas as pd
from pathlib import Path
from functools import reduce
from litellm import completion
from typing import Dict, List, Optional

In [20]:
# set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [21]:
# initialize the ray service to run async calls in parallel to bedrock easily
if ray.is_initialized():
    ray.shutdown()
ray.init()

2024-06-06 11:49:44,418	INFO worker.py:1749 -- Started a local Ray instance.


0,1
Python version:,3.11.7
Ray version:,2.11.0


In [22]:
# global constants
CONFIG_FILE_PATH = "config.yaml"

# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

[2024-06-06 11:49:46,468] p72168 {2927026569.py:8} INFO - config read from config.yaml -> {
  "app_name": "llm-as-a-judge-eval-pipeline",
  "aws": {
    "region": "us-east-1"
  },
  "run_steps": {
    "1_get_inference.ipynb": true,
    "2_get_llm_as_a_judge_eval.ipynb": true
  },
  "pdf_dir_info": {
    "data_dir": "data",
    "dataset_dir": "source_data",
    "dataset_file_name": "data.csv",
    "metrics": "metrics",
    "llm_as_a_judge_dir": "eval_completions",
    "prompt_dir": "prompt_template",
    "llm_as_a_judge_completions": "llm_as_a_judge_completions.csv",
    "raw_llm_as_a_judge_completions": "raw_llm_responses.csv",
    "llm_as_a_judge_comparisons": "llm_as_a_judge_comparisons.csv",
    "llm_comparisons_txt": "llm_as_a_judge_comparisons.txt",
    "llm_as_a_judge_pick_rate": "llm_as_a_judge_pick_rate.csv",
    "eval_prompt_template": "llama3_eval_prompt.txt",
    "prompt_template": "prompt_template.txt",
    "processed_eval_prompts": "processed_eval_prompts.csv",
    "infere

In [23]:
# initialize all global variables that are used across this notebook hydrated from the `config.yaml` file

# name of your source xlsx/xls/csv file 
FILE_NAME: str = config['pdf_dir_info']['dataset_file_name']
# data directory
DATA_DIR: str = config['pdf_dir_info']['data_dir']
FILE_RELATIVE_PATH: str = os.path.join(config['pdf_dir_info']['dataset_dir'], FILE_NAME)
INPUT_FPATH: str = os.path.join(DATA_DIR, FILE_RELATIVE_PATH)
# result files
ALL_RESULTS_FPATH = os.path.join(DATA_DIR, config['pdf_dir_info']['all_results_file_name'])
INFERENCE_LATENCY_SUMMARY_FPATH = os.path.join(DATA_DIR, config['pdf_dir_info']['inference_latency_summary_fname'])
bedrock_model_ids: List[str] = config['bedrock_fms_to_test']

[33m(raylet)[0m [2024-06-06 11:49:53,376 E 79853 4076772] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_11-49-41_348605_72168 is over 95% full, available space: 4996018176; capacity: 245107195904. Object creation will fail if spilling is required.


### Get Inference for the given dataset
---

This portion of the notebook gets inference using `Ray` (which is used to handle asynchronous calls to `Litellm`) to get inferences from the user questions in the given dataset

In [24]:
def generate_task_inference(model_id: str, input: dict) -> Dict:
    """
    This function takes in a dictionary (which contains information on the user data and prompts) 
    to generate inference using a bedrock model id, and returns a dictionary containing the model 
    completion, and latency (in seconds).
    """
    print(f"input being parsed: {input}")
    # initializing the user and system prompts to empty strings which will be filled
    # if those are provided in the dataset
    user_prompt: Optional[str] = None
    system_prompt: Optional[str] = None
    # represents the service name
    service_name: str = "bedrock"
    # represents creating the bedrock model to invoke the litellm api for response for titan, llama and claude
    bedrock_model: str = f"{service_name}/{model_id}"
    # represents the current aws region
    aws_region = boto3.Session().region_name
    # if the user and system role prompts are in the df, prompt remains a dict
    # else, it uses user prompt col as prompt payload for inference
    if config['dataset_info']['system_prompt_col'] is not None:
        prompt: Dict = input
        logger.info(f"both user and system prompts provided...")
    else:
        # this column (mentioned in the config file by the user) contains the entire
        # prompt payload that is used to generate inference
        prompt: str = input.get(config['dataset_info']['user_question_col'])
        print(f"prompt: {prompt}")
    # initialize the response dict
    ret = dict(prompt=prompt,
               completion=None,
               model_id=model_id,
               time_taken_in_seconds=None,
               prompt_token_count=None,
               completion_token_count=None,
               exception=None)
    # custom messages formatting for when the user/system roles are given together
    if config['dataset_info']['system_prompt_col'] is not None:
        user_prompt = prompt.get(config['dataset_info']['user_question_col'])
        system_prompt = prompt.get(config['dataset_info']['system_prompt_col'])
        messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
        ]
        print(f"messages: {messages}")
    else:
        body = ret['prompt']
        messages=[{ "content": body, "role": "user"}]
    # set the env var for aws_region
    os.environ["AWS_REGION_NAME"] = aws_region 
    print(f"refined_prompt: {prompt}")
    try:
        print(f"Invoking {bedrock_model}......")
        response = completion(model=bedrock_model,
                              messages=messages,
                              temperature=config['inference_parameters']['temperature'],
                              max_tokens=config['inference_parameters']['max_tokens'],
                              caching=config['inference_parameters']['caching'])
        # iterate through the entire model response
        for idx, choice in enumerate(response.choices):
            print(f"choice {idx+1} of {len(response.choices)} ")
            # extract the message and the message's content from litellm
            if choice.message and choice.message.content:
                # extract the response from the dict
                ret["completion"] = choice.message.content.strip()
        # Commenting out the code below that records the number of input and output tokens.
        # Extract number of input and completion prompt tokens (this is the same structure for embeddings and text generation models on Amazon Bedrock)
        ret['prompt_token_count'] = response.usage.prompt_tokens
        ret['completion_token_count'] = response.usage.completion_tokens
        # Extract latency in seconds
        latency_ms = response._response_ms
        ret['time_taken_in_seconds']  = latency_ms / 1000
    except Exception as e:
        print(f"Exception occurred during invoking {model_id}, exception={e}")
        ret['exception'] = e
    return ret

In [25]:
@ray.remote
def async_generate_task_inference(input: Dict, model_id: str) -> Dict:
    resp = generate_task_inference(model_id, input)
    resp_this_model = {"model_id": model_id,
                       f"{model_id}-response": resp['completion'],
                       f"{model_id}-time_taken_in_seconds": resp['time_taken_in_seconds'],
                       f"{model_id}-prompt_token_count": resp['prompt_token_count'],
                       f"{model_id}-completion_token_count": resp['completion_token_count'],
                       f"{model_id}-exception": resp['exception']}
    return input | resp_this_model

In [26]:
logger.info(f"File name to be processed: {INPUT_FPATH}")
data_file = Path(INPUT_FPATH)
if data_file.suffix == '.csv':
    logger.info(f"processing the csv file: {data_file}")
    original_eval_df = pd.read_csv(data_file)
elif data_file.suffix in ['.xls', '.xlsx']:
    logger.info(f"processing the xls/xlsx file: {data_file}")
    original_eval_df = pd.read_excel(data_file)
else:
    raise ValueError(f"Unsupported file format: {data_file.suffix}")
logger.info(f"input data frame shape is {original_eval_df.shape}")
# drop the columns that have all 'NaN' values
original_eval_df = original_eval_df.dropna(axis=1, how='all')
original_eval_df.head(10)

[2024-06-06 11:50:00,258] p72168 {3661773823.py:1} INFO - File name to be processed: data/source_data/data.csv
[2024-06-06 11:50:00,260] p72168 {3661773823.py:4} INFO - processing the csv file: data/source_data/data.csv
[2024-06-06 11:50:00,317] p72168 {3661773823.py:11} INFO - input data frame shape is (10, 2)


Unnamed: 0,data_useruser_input,model_1
0,Human: You are an assistant for question-answe...,The Heisenberg uncertainty principle states th...
1,Human: You are an assistant for question-answe...,The Schrödinger equation is a fundamental equa...
2,Human: You are an assistant for question-answe...,The greenhouse effect is a natural process tha...
3,Human: You are an assistant for question-answe...,"When light shines on a metal, electrons can be..."
4,Human: You are an assistant for question-answe...,"Modern atomic models, based on quantum mechani..."
5,Human: You are an assistant for question-answe...,A catalyst is a substance that can be added to...
6,Human: You are an assistant for question-answe...,The second law of thermodynamics states that t...
7,Human: You are an assistant for question-answe...,The phenomenon of nuclear fission. Fission occ...
8,Human: You are an assistant for question-answe...,Classical mechanics describes the physics of m...
9,Human: You are an assistant for question-answe...,If you touch a container that holds an endothe...


In [27]:
original_eval_list = json.loads(original_eval_df.to_json(orient='records'))
original_eval_list

[33m(raylet)[0m [2024-06-06 11:50:03,465 E 79853 4076772] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_11-49-41_348605_72168 is over 95% full, available space: 4995411968; capacity: 245107195904. Object creation will fail if spilling is required.


[{'data_useruser_input': 'Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don\'t know the answer just say that you don\'t know. Use three sentences maximum and keep the answer concise.\n\n```\nThe Heisenberg uncertainty principle is a fundamental principle in quantum mechanics that states that there is a fundamental limit to the precision with which certain pairs of physical properties of a particle, such as position and momentum, can be known simultaneously. This principle arises from the wave-particle duality of quantum particles and has profound implications for our understanding of the behavior of matter at the atomic and subatomic scales\n```\n\nQuestion: What is the Heisenberg uncertainty principle?\n\nAssistant:',
  'model_1': 'The Heisenberg uncertainty principle states that the position and momentum of a particle cannot be measured precisely at the same tim

[33m(raylet)[0m [2024-06-06 11:50:13,560 E 79853 4076772] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_11-49-41_348605_72168 is over 95% full, available space: 4993236992; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-06 11:50:23,659 E 79853 4076772] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_11-49-41_348605_72168 is over 95% full, available space: 4993867776; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-06 11:50:33,748 E 79853 4076772] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_11-49-41_348605_72168 is over 95% full, available space: 4993495040; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-06 11:50:43,840 E 79853 4076772] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_11-49-41_348605_72168 is over 95% full, available space: 4987985920;

In [None]:
# list of the bedrock model ids that are used in generating inferences
bedrock_model_ids: List[str] =[d['model_id'] for d in config['bedrock_fms_to_test']]
bedrock_model_ids

### Run the inferences to get model responses in parallel using `Ray`

In [None]:
erroneous_count: int = 0
resp_list = []
n: int = config['parallel_inference_count']
st_overall = time.perf_counter()
# Iterate over each bedrock model ID
for model_id in bedrock_model_ids:
    logger.info(f"going to get inference from model={model_id}")
    list_of_lists = [original_eval_list[i * n:(i + 1) * n] for i in range((len(original_eval_list) + n - 1) // n )]
    st = time.perf_counter()
    for idx, sublist in enumerate(list_of_lists):
        logger.info(f"processing sublist={idx+1}/{len(list_of_lists)} for model_id={model_id}")
        for input in sublist:
            print(f"input logged: {input}")
            try:
                result = ray.get(async_generate_task_inference.remote(input, model_id))
                resp_list.append(result)
            except Exception as e:
                logger.error(f"Error processing input: {input} for model_id={model_id}, error: {e}")
                erroneous_count += 1

    elapsed_time = time.perf_counter() - st
    logger.info(f"total time taken for {len(original_eval_list)} with model={model_id} is {elapsed_time:0.2f}")
elapsed_time = time.perf_counter() - st_overall
logger.info(f"total time taken for {len(original_eval_list)} with models={bedrock_model_ids} is {elapsed_time:0.2f}")
logger.info(f"total erroneous count: {erroneous_count}")

In [None]:
# view some responses generated
resp_list[:5]

In [None]:
df_list = []
for model_id in bedrock_model_ids:
    df_list.append(pd.DataFrame([r for r in resp_list if r['model_id'] == model_id]).drop(['model_id'], axis=1))    
from functools import reduce
try:
    # if the system prompt is separately provided, merge on that column too else, just use the user
    # column for the merge
    if config['dataset_info']['system_prompt_col'] is not None and config['dataset_info']['pre_existing_response_col'] is not None:
            df_resp = reduce(lambda x, y: pd.merge(x, y, on=[config['dataset_info']['user_question_col'], 
                                                            config['dataset_info']['system_prompt_col'], 
                                                            config['dataset_info']['pre_existing_response_col']]), 
                            df_list)
    else:
            df_resp = reduce(lambda x, y: pd.merge(x, y, on=[config['dataset_info']['user_question_col'],
                                                            config['dataset_info']['pre_existing_response_col']]), 
                            df_list)
except Exception as e:
    logger.error(f"df was not merged: {e}")
    print(f"Error during merging: {e}")
logger.info(f"shape of response data frame={df_resp.shape}")

In [None]:
# view the data frame
df_resp.head(10)

In [None]:
# get the original/target responses if any and merge it with the current df
try: 
    if df_resp is not None and config['dataset_info']['pre_existing_response_col'] is not None:
        if config['dataset_info']['system_prompt_col'] is not None:
            df_resp_all = pd.merge(left=df_resp, right=original_eval_df, how="left",
                                left_on=[config['dataset_info']['user_question_col'], 
                                            config['dataset_info']['system_prompt_col'], 
                                            config['dataset_info']['pre_existing_response_col']], 
                                right_on=[config['dataset_info']['user_question_col'], 
                                            config['dataset_info']['system_prompt_col'], 
                                            config['dataset_info']['pre_existing_response_col']])
        else:
            df_resp_all = pd.merge(left=df_resp, right=original_eval_df, how="left", 
                                left_on=[config['dataset_info']['user_question_col'], 
                                            config['dataset_info']['pre_existing_response_col']], 
                                right_on=[config['dataset_info']['user_question_col'], 
                                            config['dataset_info']['pre_existing_response_col']])
except Exception as e:
    logger.error(f"Could not perform the merge with the original data frame: {e}")


In [None]:
# view the current data in the df
df_resp_all.head(10)

### Record the `p50` and `p95` inference latencies in a `txt` file

In [None]:
time_taken_in_seconds_cols = [c for c in df_resp_all.columns if 'time_taken_in_seconds' in c]
Latency_cols = [c for c in df_resp_all.columns if 'Latency ' in c]
all_latency_cols_of_interest = time_taken_in_seconds_cols + Latency_cols
summary = ""
for c in all_latency_cols_of_interest:
    quantiles = list(df_resp_all[c].quantile([0.5, 0.95]))
    s = f"[p50, p95] for {c}={quantiles}\n"
    summary += s
    logger.info(s)
Path(INFERENCE_LATENCY_SUMMARY_FPATH).write_text(summary)

### Upload the overall results to a `results.csv` file

In [None]:
cols = list(df_resp_all.columns)
user_input_index = df_resp_all.columns.get_loc(config['dataset_info']['user_question_col'])
response_cols = [col for col in df_resp_all.columns if col.endswith('-response')]
for col in response_cols:
    cols.pop(cols.index(col))
# Reinsert the response columns right after the user_input column
for col in reversed(response_cols):
    cols.insert(user_input_index + 1, col)
df_resp_all = df_resp_all[cols]
df_resp_all.to_csv(ALL_RESULTS_FPATH, index=False)
df_resp_all.head(10)

[33m(raylet)[0m [2024-06-06 10:48:03,169 E 72185 4007856] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_10-46-20_621795_72168 is over 95% full, available space: 6166204416; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-06 10:48:13,258 E 72185 4007856] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_10-46-20_621795_72168 is over 95% full, available space: 6166593536; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-06 10:48:23,351 E 72185 4007856] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_10-46-20_621795_72168 is over 95% full, available space: 6166007808; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-06-06 10:48:33,439 E 72185 4007856] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-06-06_10-46-20_621795_72168 is over 95% full, available space: 6165618688;