# Bring your own dataset

---------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

### This part of our solution design includes 

- Creating your own `fmbench` compatible dataset from a [HuggingFace dataset](https://huggingface.co/docs/datasets/en/index).

- Creating a prompt payload template compatible with your dataset.

- Upload the dataset and the prompt payload to Amazon S3 from where it can be used by `fmbench`.

In [11]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [12]:
import pandas as pd
from fmbench.utils import *
from fmbench.globals import *
from datasets import load_dataset
config = load_config(CONFIG_FILE)

region_name=us-east-1
role_arn_from_env=None, using current sts caller identity to set arn_string
the sts role is an assumed role, setting arn_string to arn:aws:iam::471112568442:role/fmbench-us-east-1-role


## Convert HuggingFace dataset to jsonl format

`fmbench` works with datasets in the [`JSON Lines`](https://jsonlines.org/) format. So here we show how to convert a HuggingFace dataset into JSON lines format.

Set the `ds_name` to the HuggingFace dataset id, for example [`THUDM/LongBench`](https://huggingface.co/datasets/THUDM/LongBench), [`rajpurkar/squad_v2`](https://huggingface.co/datasets/rajpurkar/squad_v2), [`banking77`](https://huggingface.co/datasets/banking77) or other text datasets.

In [13]:
# ds_id: str = "rajpurkar/squad"
# ds_name: str = "plain_text"
# ds_split: str = "train"
# # Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# # size of random subset of the data
# ds_N: int = 100

# another example
# ds_id: str = "THUDM/LongBench"
# ds_name: str = "2wikimqa"
# ds_split: str = "test"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
# ds_N: int = 200

# another example
ds_id: str = "banking77"
ds_name: str = "default"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

# ds_id: str = "Open-Orca/OpenOrca"
# ds_name: str = "default"
# ds_split: str = "train"
# # Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# # size of random subset of the data
# ds_N: int = 100

In [14]:
# Load the dataset from huggingface
dataset = load_dataset(ds_id, name=ds_name)

Downloading readme:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/298k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3080 [00:00<?, ? examples/s]

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10003
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3080
    })
})

In [16]:
# convert the dataset to a dataframe, for print it out and easy conversion to jsonl
df = pd.DataFrame(dataset[ds_split])

# some datasets contain a field called column, we would like to call it
# input to match it to the prompt template
df.rename(columns={"question": "input"}, inplace=True)

In [17]:
df.head()

Unnamed: 0,text,label
0,I am still waiting on my card?,11
1,What can I do if my card still hasn't arrived ...,11
2,I have been waiting over a week. Is the card s...,11
3,Can I track my card while it is in the process...,11
4,"How do I know if I will get my card, or if it ...",11


Subset the data

In [18]:
print(f"dataset shape before random subset = {df.shape}")
df = df.sample(n=ds_N)
print(f"dataset shape before random subset = {df.shape}")

dataset shape before random subset = (10003, 2)
dataset shape before random subset = (100, 2)


Convert to json lines format

In [19]:
jsonl_content = df.to_json(orient='records', lines=True)
print(jsonl_content[:1000])

{"text":"I'm still expecting the transaction to be finished","label":66}
{"text":"What are the limits to where my card will be accepted?","label":10}
{"text":"I am not sure where my phone is.","label":42}
{"text":"Still waiting on my refund...","label":51}
{"text":"I'm not satisfied with the services that you are providing.  I would like to end my services and delete my account.","label":55}
{"text":"Should I be seeing a fee applied for my money transfer?","label":64}
{"text":"Why did my card payment not work?","label":25}
{"text":"Can I choose when my card is delivered?","label":12}
{"text":"Would I be charged any fees if I added money to my account using an international card?","label":57}
{"text":"There is a payment made with my card that I don't recognize at all.","label":16}
{"text":"How soon do cards arrive after I order them?","label":12}
{"text":"Why did the ATM swallow my card?","label":18}
{"text":"Hey, I have my card, how do I get it to show in the app?","label":13}
{"text":

## Upload the dataset to s3

In [20]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['source_data_prefix']
file_name: str = f"{ds_id}.jsonl"
json(jsonl_content, bucket, prefix, "", file_name)

's3://sagemaker-fmbench-read-us-east-1-471112568442/source_data/banking77.jsonl'

## Create a prompt template and upload it to S3
The prompt template is specific to the model under test and also the dataset being used. The variables used in the template, such as `context` and `input` must exist in the dataset being used so that this prompt template can be converted into an actual prompt.

In [21]:
# dictionary containing the prompt template, it has a key by the name
# of the dataset id which forces you to explicitly add your dataset here
# otherwise no new prompt template will be uploaded and it wont accidently
# end up overwriting an existing prompt template
prompt_template = {}

In [22]:
# LongBench
prompt_template['THUDM-LongBench-llama2-mistral'] = """<s>[INST] <<SYS>>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don't know the answer just say that you don't know. Use three sentences maximum and keep the answer concise.
<</SYS>>

```
{context}
```

Question: {input}

[/INST]
Answer:
"""

In [23]:
# Open Orca
prompt_template['Open-Orca-OpenOrca-llama2-mistral'] = """<s>[INST] <<SYS>>

{system_prompt}

<</SYS>>

Context and task: {input}

[/INST]
"""

In [24]:
prompt_template['Open-Orca-OpenOrca-llama3'] = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{system_prompt}

Context and task: {input} 

<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

In [25]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['prompt_template_dir']
for k in prompt_template.keys():
    file_name: str = f"prompt_template_{k}.txt"
    print(f"writing {file_name} to s3://{bucket}/{prefix}/{file_name}")
    write_to_s3(prompt_template[k], bucket, prefix, "", file_name)

writing prompt_template_THUDM-LongBench-llama2-mistral.txt to s3://sagemaker-fmbench-read-us-east-1-471112568442/prompt_template/prompt_template_THUDM-LongBench-llama2-mistral.txt
writing prompt_template_Open-Orca-OpenOrca-llama2-mistral.txt to s3://sagemaker-fmbench-read-us-east-1-471112568442/prompt_template/prompt_template_Open-Orca-OpenOrca-llama2-mistral.txt
writing prompt_template_Open-Orca-OpenOrca-llama3.txt to s3://sagemaker-fmbench-read-us-east-1-471112568442/prompt_template/prompt_template_Open-Orca-OpenOrca-llama3.txt


## Scratchpad

### Utility function for converting a line from container log to JSON format

The following is a line from CW log from a model container that provides all the information about the model that is not available anywhere else (not in Model or EndpointConfig or Endpoint description). This information is often necessary to know the low level settings about the model which may have been set while compiling the model.

In [26]:
line="""model_id_or_path='/tmp/.djl.ai/download/ae03dd100c208acd82b5dbed563c971de864c408' rolling_batch=<RollingBatchEnum.auto: 'auto'> tensor_parallel_degree=8 trust_remote_code=False enable_streaming=<StreamingEnum.false: 'false'> batch_size=4 max_rolling_batch_size=4 dtype=<Dtype.f16: 'fp16'> revision=None output_formatter=None waiting_steps=None is_mpi=False draft_model_id=None spec_length=0 neuron_optimize_level=None enable_mixed_precision_accumulation=False enable_saturate_infinity=False n_positions=4096 unroll=None load_in_8bit=False low_cpu_mem_usage=False load_split_model=True context_length_estimate=None amp='f16' quantize=None compiled_graph_path=None task=None save_mp_checkpoint_path=None group_query_attention=None model_loader=<TnXModelLoaders.tnx: 'tnx'> rolling_batch_strategy=<TnXGenerationStrategy.continuous_batching: 'continuous_batching'> fuse_qkv=False on_device_embedding=False attention_layout=None collectives_layout=None cache_layout=None partition_schema=None all_reduce_dtype=None cast_logits_dtype=None"""
import re
import json
pattern = r' (?=[^\'"])'


# Split the string using the pattern
result = re.split(pattern, line)
print("\n".join([r for r in result]))
params= {}
for kv in result:
    #print(kv.split('='))
    k,v = kv.split('=')
    params[k] = v
print(json.dumps(params, indent=2, default=str))

model_id_or_path='/tmp/.djl.ai/download/ae03dd100c208acd82b5dbed563c971de864c408'
rolling_batch=<RollingBatchEnum.auto: 'auto'>
tensor_parallel_degree=8
trust_remote_code=False
enable_streaming=<StreamingEnum.false: 'false'>
batch_size=4
max_rolling_batch_size=4
dtype=<Dtype.f16: 'fp16'>
revision=None
output_formatter=None
waiting_steps=None
is_mpi=False
draft_model_id=None
spec_length=0
neuron_optimize_level=None
enable_mixed_precision_accumulation=False
enable_saturate_infinity=False
n_positions=4096
unroll=None
load_in_8bit=False
low_cpu_mem_usage=False
load_split_model=True
context_length_estimate=None
amp='f16'
quantize=None
compiled_graph_path=None
task=None
save_mp_checkpoint_path=None
group_query_attention=None
model_loader=<TnXModelLoaders.tnx: 'tnx'>
rolling_batch_strategy=<TnXGenerationStrategy.continuous_batching: 'continuous_batching'>
fuse_qkv=False
on_device_embedding=False
attention_layout=None
collectives_layout=None
cache_layout=None
partition_schema=None
all_reduce_d