# Generate Data: Gather data, create prompts/payloads of different sizes
---------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

### This part of our solution design includes 

- running and downloading our specific dataset

- generating prompts as payloads of different sizes that we will send to our different model endpoints with different combinations of concurrency levels that we will later use to run inference and generate benchmarking metrics and visualizations.

#### This file will generate all data on wikiqa (english version) with prompt sizes 300 - 4000 token lengths in different payload sizes to send to the model endpoint during the inference pipeline. You will also be able to generate the normal wikiqa dataset from the actual 'long bench dataset'. This notebook then focuses on 3 main deliverables:

1. Loading the dataset that is stored within the dataset in the data directory.


2. Generating payloads: This notebook also converts the loaded datasets into payloads based on the input question and records teh context length of the prompt to send as a part of the payload during running inferences on the deployed endpoints.

    - All of the prompts are saved in this data directory in a file named all_prompts.csv.
    

3. Constructing different sized payloads

#### Import all of the necessary libraries below to run this notebook

In [1]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [2]:
import io
import copy
import json
import logging
import itertools
import pandas as pd
from fmbench.utils import *
from fmbench.globals import *
from typing import Dict, List
import importlib.resources as pkg_resources

config file current -> configs/config-claude-models.yml, None
Loaded config: {'general': {'name': 'fmbench-claude', 'model_name': 'claude'}, 'aws': {'region': 'us-east-1', 'sagemaker_execution_role': 'arn:aws:iam::121797993273:user/ab3', 'bucket': 'sagemaker-fmbench-write-121797993273'}, 'dir_paths': {'data_prefix': 'data', 'prompts_prefix': 'prompts', 'all_prompts_file': 'all_prompts.csv', 'metrics_dir': 'metrics', 'models_dir': 'models', 'metadata_dir': 'metadata'}, 's3_read_data': {'read_bucket': 'sagemaker-fmbench-read-121797993273', 'scripts_prefix': 'scripts', 'script_files': ['hf_token.txt'], 'source_data_prefix': 'source_data', 'source_data_files': ['2wikimqa_e.jsonl', '2wikimqa.jsonl', 'hotpotqa_e.jsonl', 'hotpotqa.jsonl', 'narrativeqa.jsonl', 'triviaqa_e.jsonl', 'triviaqa.jsonl'], 'tokenizer_prefix': 'tokenizer', 'prompt_template_dir': 'prompt_template', 'prompt_template_file': 'prompt_template_claude.txt'}, 'run_steps': {'0_setup.ipynb': True, '1_generate_data.ipynb': True, 

No files found in S3 Bucket: 'sagemaker-fmbench-read-121797993273' with Prefix: 'tokenizer'


CustomTokenizer, based on HF transformers


#### Pygmentize globals.py to view and use any of the globally initialized variables 

#### Set up a logger to log all messages while the code runs

In [3]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [4]:
## config.yml file contains information that is used across this benchmarking environment, 
## such as information about the aws account, prompts, payloads to be used for invocations
config = load_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

[2024-03-25 13:19:11,502] p43628 {2001288519.py:4} INFO - {
  "general": {
    "name": "fmbench-claude",
    "model_name": "claude"
  },
  "aws": {
    "region": "us-east-1",
    "sagemaker_execution_role": "arn:aws:iam::121797993273:user/ab3",
    "bucket": "sagemaker-fmbench-write-121797993273"
  },
  "dir_paths": {
    "data_prefix": "data",
    "prompts_prefix": "prompts",
    "all_prompts_file": "all_prompts.csv",
    "metrics_dir": "metrics",
    "models_dir": "models",
    "metadata_dir": "metadata"
  },
  "s3_read_data": {
    "read_bucket": "sagemaker-fmbench-read-121797993273",
    "scripts_prefix": "scripts",
    "script_files": [
      "hf_token.txt"
    ],
    "source_data_prefix": "source_data",
    "source_data_files": [
      "2wikimqa_e.jsonl",
      "2wikimqa.jsonl",
      "hotpotqa_e.jsonl",
      "hotpotqa.jsonl",
      "narrativeqa.jsonl",
      "triviaqa_e.jsonl",
      "triviaqa.jsonl"
    ],
    "tokenizer_prefix": "tokenizer",
    "prompt_template_dir": "prompt

#### Define the file path for the prompt template

In [5]:
s3_file_path = "/".join([config['s3_read_data']['prompt_template_dir'],
                         config['s3_read_data']['prompt_template_file']])

## download the file from s3 else check locally and use that version
prompt_template_from_s3: str = read_from_s3(config['s3_read_data']['read_bucket'], s3_file_path)

prompt_template_dir = Path(pkg_resources.files(FMBENCH_PACKAGE_NAME), config['s3_read_data']['prompt_template_dir'])
logger.info(f"Using fmbench.{config['s3_read_data']['prompt_template_dir']} directory: {prompt_template_dir}")

if prompt_template_from_s3 is None:
    promtp_template_fpath: str = os.path.join(prompt_template_dir, config['s3_read_data']['prompt_template_file'])
    prompt_template = Path(promtp_template_fpath).read_text()
    logger.info(f"Using the default local prompt template --> {prompt_template}")
else:
    prompt_template = prompt_template_from_s3
    logger.info(f"Using the prompt template from S3 --> {prompt_template}")
prompt_template = prompt_template.strip()

# Calculate the number of tokens in the prompt template
prompt_template_keys = config['datasets']['prompt_template_keys']
args = {}
for k in prompt_template_keys:
    args[k] = ""
empty_prompt_template = prompt_template.format(**args)
logger.info(f"empty prompt template = \"{empty_prompt_template}\"")
empty_prompt_len_in_tokens = count_tokens(empty_prompt_template)

# Log the number of tokens
logger.info(f"prompt template length={empty_prompt_len_in_tokens} tokens")

[2024-03-25 13:19:11,646] p43628 {utils.py:189} ERROR - read_from_s3, An error occurred: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
[2024-03-25 13:19:11,648] p43628 {3708492309.py:8} INFO - Using fmbench.prompt_template directory: /Users/madhurpt/Documents/foundation-model-benchmarking-tool-12/src/fmbench/prompt_template
[2024-03-25 13:19:11,649] p43628 {3708492309.py:13} INFO - Using the default local prompt template --> Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don't know the answer just say that you don't know. Use three sentences maximum and keep the answer concise.

```
{context}
```

Question: {input}

Assistant:


[2024-03-25 13:19:11,649] p43628 {3708492309.py:25} INFO - empty prompt template = "Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved cont

In [6]:
def list_files():
    response = s3_client.list_objects_v2(Bucket=config['s3_read_data']['read_bucket'], Prefix=config['s3_read_data']['source_data_prefix'])
    return [obj['Key'] for obj in response['Contents']]

# List all files in the bucket and prefix
s3_files = list_files()
logger.info(f"s3 paths of the data set -> {s3_files}")

# Log the files you're going to read
logger.info(f"dataset files = {s3_files}")

# Read and concatenate DataFrames

jsonl_files = [file_key for file_key in s3_files if file_key.replace(config['s3_read_data']['source_data_prefix'] + "/", "") in config['s3_read_data']['source_data_files']]
logger.info(f"jsonl_files={jsonl_files}")
# Read and concatenate only the .jsonl files
df = pd.concat([pd.read_json(io.BytesIO(s3_client.get_object(Bucket=config['s3_read_data']['read_bucket'], Key=file_key)['Body'].read()), lines=True) 
                for file_key in jsonl_files])

# Log the source of the dataset and its shape
logger.info(f"dataset read from {s3_files}\nhas shape {df.shape}")

[2024-03-25 13:19:11,748] p43628 {963073385.py:7} INFO - s3 paths of the data set -> ['source_data/2wikimqa.jsonl', 'source_data/2wikimqa_e.jsonl', 'source_data/hotpotqa.jsonl', 'source_data/hotpotqa_e.jsonl', 'source_data/narrativeqa.jsonl', 'source_data/triviaqa.jsonl', 'source_data/triviaqa_e.jsonl']
[2024-03-25 13:19:11,748] p43628 {963073385.py:10} INFO - dataset files = ['source_data/2wikimqa.jsonl', 'source_data/2wikimqa_e.jsonl', 'source_data/hotpotqa.jsonl', 'source_data/hotpotqa_e.jsonl', 'source_data/narrativeqa.jsonl', 'source_data/triviaqa.jsonl', 'source_data/triviaqa_e.jsonl']
[2024-03-25 13:19:11,749] p43628 {963073385.py:15} INFO - jsonl_files=['source_data/2wikimqa.jsonl', 'source_data/2wikimqa_e.jsonl', 'source_data/hotpotqa.jsonl', 'source_data/hotpotqa_e.jsonl', 'source_data/narrativeqa.jsonl', 'source_data/triviaqa.jsonl', 'source_data/triviaqa_e.jsonl']
[2024-03-25 13:19:38,808] p43628 {963073385.py:21} INFO - dataset read from ['source_data/2wikimqa.jsonl', 'sou

#### View a portion of the df to view inputs, contexts, and more information on the data

In [7]:
df.head()

Unnamed: 0,input,context,answers,length,dataset,language,all_classes,_id
0,Where was the wife of Francis I Rákóczi born?,Passage 1:\nWaldrada of Lotharingia\nWaldrada ...,[Ozalj],4696,2wikimqa,en,,41ac2a4beb0af8f58d01863a62b90692f7c7d74b5e3a58d9
1,Who is Sobe (Sister Of Saint Anne)'s grandchild?,Passage 1:\nJim Ramel Kjellgren\nJim Love Rame...,[John the Baptist],4776,2wikimqa,en,,3924e4ac5039ce3fadda49604bfcb0f5238af81774616e53
2,Where does the director of film Man At Bath wo...,Passage 1:\nJason Moore (director)\nJason Moor...,[Cahiers du cinéma],4274,2wikimqa,en,,2c952e3e1ca394df975103b3135b3c38e0ee16e25d860258
3,Do both Beauty And The Bad Man and Wild Child ...,Passage 1:\nBetty Hall\nBeatrice Perin Barker ...,[no],8125,2wikimqa,en,,aec83da1f2faf6ec8badfd53d632f525c9ef2090d99d1c6c
4,"What is the date of birth of William Paulet, 3...","Passage 1:\nHenry, Lord Paulet\nLord Henry Pau...",[1510],4621,2wikimqa,en,,4b28d517ce1c1e3cfec9282ca7b212c1cb87c254781d7c86


#### Display basic statistics on the existing dataset: including count, mean, std, min, etc.

In [8]:
logger.info(f"distribution of the length field in the dataset is as follows ->\n{df.describe()}")

[2024-03-25 13:19:38,837] p43628 {1912450148.py:1} INFO - distribution of the length field in the dataset is as follows ->
             length  all_classes
count   1700.000000          0.0
mean    8221.461176          NaN
std     5876.876131          NaN
min      111.000000          NaN
25%     3892.500000          NaN
50%     7131.500000          NaN
75%    10760.000000          NaN
max    36418.000000          NaN


### Convert the dataset elements into prompts as payloads for inference purposes

Now, we will focus on converting the existing data within our datasets, and extract the information to convert it into prompts to be able to send to our deployed model endpoints during the process of testing and benchmarking for results and various metrics

In [9]:
%%time
df['prompt'] = df.apply(lambda row: process_item(row, config['datasets']['prompt_template_keys'], prompt_template), axis=1)
df['prompt_len'] = df.prompt.map(lambda x: x['prompt_len'])

CPU times: user 6min 17s, sys: 763 ms, total: 6min 18s
Wall time: 6min 19s


In [10]:
# Convert DataFrame to a CSV format string
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
csv_data = csv_buffer.getvalue()
all_prompts_file = config['dir_paths']['all_prompts_file']

# Write to S3 using the write_to_s3 function
write_to_s3(csv_data, config['aws']['bucket'], DATA_DIR, config['dir_paths']['prompts_prefix'], all_prompts_file)

# Log where the prompts are saved
logger.info(f"all prompts dataframe of shape {df.shape} saved to s3://{config['aws']['bucket']}/{DATA_DIR}/{os.path.join(config['dir_paths']['prompts_prefix'], all_prompts_file)}")

[2024-03-25 13:26:13,888] p43628 {1138732359.py:11} INFO - all prompts dataframe of shape (1700, 10) saved to s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/all_prompts.csv


In [11]:
## View some of the prompts 
df.head()

Unnamed: 0,input,context,answers,length,dataset,language,all_classes,_id,prompt,prompt_len
0,Where was the wife of Francis I Rákóczi born?,Passage 1:\nWaldrada of Lotharingia\nWaldrada ...,[Ozalj],4696,2wikimqa,en,,41ac2a4beb0af8f58d01863a62b90692f7c7d74b5e3a58d9,{'input': 'Where was the wife of Francis I Rák...,8182
1,Who is Sobe (Sister Of Saint Anne)'s grandchild?,Passage 1:\nJim Ramel Kjellgren\nJim Love Rame...,[John the Baptist],4776,2wikimqa,en,,3924e4ac5039ce3fadda49604bfcb0f5238af81774616e53,{'input': 'Who is Sobe (Sister Of Saint Anne)'...,8051
2,Where does the director of film Man At Bath wo...,Passage 1:\nJason Moore (director)\nJason Moor...,[Cahiers du cinéma],4274,2wikimqa,en,,2c952e3e1ca394df975103b3135b3c38e0ee16e25d860258,{'input': 'Where does the director of film Man...,7784
3,Do both Beauty And The Bad Man and Wild Child ...,Passage 1:\nBetty Hall\nBeatrice Perin Barker ...,[no],8125,2wikimqa,en,,aec83da1f2faf6ec8badfd53d632f525c9ef2090d99d1c6c,{'input': 'Do both Beauty And The Bad Man and ...,13231
4,"What is the date of birth of William Paulet, 3...","Passage 1:\nHenry, Lord Paulet\nLord Henry Pau...",[1510],4621,2wikimqa,en,,4b28d517ce1c1e3cfec9282ca7b212c1cb87c254781d7c86,{'input': 'What is the date of birth of Willia...,8710


### Convert Prompts into Payloads for inference purposes
------
Now we will prepare data for model inference. It involves converting prompts, created and stored in a specific format, into payloads for inference. We will utilize the prompt file for our model and incorporate the prompt into a payload using that. 

These payloads are tailored to the needs of deployed model endpoints. The conversion considers prompt sizes and specific configurations to further make our benchmarking more detailed and comprehensive. 

The goal is to have a set of well-formatted and parameterized payload requests of various sizes ready to be sent to the model endpoints for inference, with the responses to be used for further analysis

In [12]:
# Function to construct a single request payload based on row prompt data and configuration
def construct_request_payload(row, config: Dict) -> Dict:
    
    # Deep copy inference parameters from the config.yml file - feel free to change this based on the model type you are using
    parameters = copy.deepcopy(config['inference_parameters'])
    truncate = parameters.get('truncate', None)
    if truncate == TRUNCATE_POLICY.AT_PROMPT_TOKEN_LENGTH:
        parameters['truncate'] = row['prompt_len']
        
    # Return the constructed payload
    return dict(inputs=row['prompt']['prompt'], parameters=parameters)

# Function to create a dataset payload files from the given dataset file we have
def create_dataset_payload_file(df: pd.DataFrame, dataset_info: Dict, config: Dict) -> str:
    
    # First, log the dataset existing information
    logger.info(f"going to create a payload file as dataset_info={json.dumps(dataset_info, indent=2)}")
    
    # Filter the DataFrame based on prompt length and language given below for constructing payloads of various sizes
    df['prompt_len_in_range'] = df.prompt.map(lambda x: x['prompt_len'] >= dataset_info['min_length_in_tokens'] and \
                                                        x['prompt_len'] <= dataset_info['max_length_in_tokens'])
    
    # select prompts between pre-configured threshold lengths and are in the selected language
    if 'language' in df.columns:
        df_filtered = df[(df.language == dataset_info['language']) & (df.prompt_len_in_range)]
    else:
        df_filtered = df[df.prompt_len_in_range]
        
    logger.info(f"after filtering for {json.dumps(dataset_info, indent=2)}, shape of dataframe is {df_filtered.shape}")
    if df_filtered.shape[0] == 0:
        logger.error(f"did not find any prompts in the dataframe that matched the filtering criteria, exiting")
        return None
    # df_filtered.head()

    # Here, we construct request payloads for each row in the filtered DataFrame
    df_filtered['request'] = df_filtered.apply(lambda r: construct_request_payload(r, config), axis=1)
    logger.info(f"payload request entry looks like this -> {json.dumps(df_filtered['request'].iloc[0], indent=2)}")
    
     # Convert the 'request' column of the filtered DataFrame to a JSON Lines string
    json_lines_str = df_filtered['request'].to_json(orient='records', lines=True)
    
    
    lang = dataset_info['language']
    min_len = dataset_info['min_length_in_tokens']
    max_len = dataset_info['max_length_in_tokens']
    file_name = dataset_info['payload_file'].format(lang=lang, min=min_len, max=max_len)

    prompts_path = os.path.join(DATA_DIR, config['dir_paths']['prompts_prefix'])

    ## defining the s3_path these prompts will go to
    s3_file_path = os.path.join(prompts_path, file_name)

    # Write the JSON Lines string to S3
    # get the bucket name, config vars from config file
    write_to_s3(json_lines_str, config['aws']['bucket'], DATA_DIR, config['dir_paths']['prompts_prefix'], file_name)

    logger.info(f"dataset of different payload file structures saved to s3://{config['aws']['bucket']}/{s3_file_path}")
    return f"s3://{config['aws']['bucket']}/{s3_file_path}"

In [13]:
items = ((df, d, config) for d in config['datasets']['filters'])

# This results in the creation of payload files for each dataset
paths: List = list(itertools.starmap(create_dataset_payload_file, items))

[2024-03-25 13:26:13,923] p43628 {2925969590.py:17} INFO - going to create a payload file as dataset_info={
  "language": "en",
  "min_length_in_tokens": 1,
  "max_length_in_tokens": 500,
  "payload_file": "payload_en_1-500.jsonl"
}
[2024-03-25 13:26:13,928] p43628 {2925969590.py:29} INFO - after filtering for {
  "language": "en",
  "min_length_in_tokens": 1,
  "max_length_in_tokens": 500,
  "payload_file": "payload_en_1-500.jsonl"
}, shape of dataframe is (1, 11)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['request'] = df_filtered.apply(lambda r: construct_request_payload(r, config), axis=1)
[2024-03-25 13:26:13,929] p43628 {2925969590.py:37} INFO - payload request entry looks like this -> {
  "inputs": "Human: You are an assistant for question-answering 

In [14]:
print("\n".join([p for p in paths if p]))

s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/payload_en_1-500.jsonl
s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/payload_en_500-1000.jsonl
s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/payload_en_1000-2000.jsonl
s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/payload_en_2000-3000.jsonl
s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/payload_en_3000-4000.jsonl
s3://sagemaker-fmbench-write-121797993273/fmbench-claude-ab3/data/prompts/payload_en_305-3997.jsonl
