# Bring your own dataset

---------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

### This part of our solution design includes 

- Creating your own `fmbench` compatible dataset from a [HuggingFace dataset](https://huggingface.co/docs/datasets/en/index).

- Creating a prompt payload template compatible with your dataset.

- Upload the dataset and the prompt payload to Amazon S3 from where it can be used by `fmbench`.

In [None]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
import logging
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [None]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
import pandas as pd
from fmbench.utils import *
from fmbench.globals import *
from datasets import load_dataset
config = load_config(CONFIG_FILE)

## Convert HuggingFace dataset to jsonl format

`fmbench` works with datasets in the [`JSON Lines`](https://jsonlines.org/) format. So here we show how to convert a HuggingFace dataset into JSON lines format.

Set the `ds_name` to the HuggingFace dataset id, for example [`THUDM/LongBench`](https://huggingface.co/datasets/THUDM/LongBench), [`rajpurkar/squad_v2`](https://huggingface.co/datasets/rajpurkar/squad_v2), [`banking77`](https://huggingface.co/datasets/banking77) or other text datasets.

#### Standard Text Datasets
---
 
If you are using FMBench to benchmark models on standard text datasets, run the cells below to perform appropriate data preprocessing.

In [None]:
ds_id: str = "rajpurkar/squad"
ds_name: str = "plain_text"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

# another example
# ds_id: str = "THUDM/LongBench"
# ds_name: str = "2wikimqa"
# ds_split: str = "test"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
# ds_N: int = 200

# another example
# ds_id: str = "banking77"
# ds_name: str = "default"
# ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
# ds_N: int = 10000

ds_id: str = "Open-Orca/OpenOrca"
ds_name: str = "default"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

In [None]:
# Helper function to load the dataset and return the list of the dataset to convert into a df and then into a jsonl file
import logging
import itertools
from datasets import load_dataset, Dataset

def load_hf_ds_subset(
    ds_id: str,
    ds_split: str = "train",
    ds_N: int = 100
) -> Dataset:
    """
    Loads a subset of the Dolly dataset in streaming mode, taking the first ds_N examples.
    
    Args:
        ds_id (str): The dataset identifier (default: 'databricks/databricks-dolly-15k').
        ds_split (str): The split of the dataset to load (default: 'train').
        ds_N (int): Number of examples to take from the dataset (default: 100).

    Returns:
        Dataset: A Hugging Face Dataset object containing ds_N examples.
    """
    logger.info(
        f"Starting to load dataset with id='{ds_id}', split='{ds_split}', "
        f"taking the first {ds_N} examples in streaming mode."
    )
    dataset_stream = load_dataset(
        ds_id,
        split=ds_split,
        streaming=True
    )
    dataset_iter = itertools.islice(dataset_stream, ds_N)
    # Convert to a list
    dataset_list = list(dataset_iter)
    # Convert the list to a regular (in-memory) Hugging Face Dataset
    subset_dataset = Dataset.from_list(dataset_list)
    logger.info(f"Loaded {len(subset_dataset)} examples from the dataset.")
    return subset_dataset

### Preprocess the Dolly Dataset
---

In this section of the notebook, we will pre process the Dolly dataset. The `databricks/databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset contains the `instruction` and the `context`. However, for some of the categories for example open QA, the context is not provided. To make this consistent, we will add the context in a new line within the instruction field so that we can have the Foundation Models (FMs) consistently evaluate all content without having empty contexts in some of the calls.

In [None]:
ds_id: str = "databricks/databricks-dolly-15k"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

In [None]:
dataset = load_hf_ds_subset(ds_id, ds_split, ds_N)

In [None]:
# convert the dataset to a dataframe, for print it out and easy conversion to jsonl
df = pd.DataFrame(dataset)
df.head(10)

In [None]:
print(f"dataset shape before random subset = {df.shape}")
df = df.sample(n=ds_N)
print(f"dataset shape before random subset = {df.shape}")

#### Merge the instruction and context
---

Now, we will add the context as a new line to the instruction column in the dataset.

In [None]:
# 1. Append context to instruction only if context is not empty
df["instruction"] = df.apply(
    lambda row: row["instruction"] + "\n" + row["context"] 
                if isinstance(row["context"], str) and row["context"].strip() != "" 
                else row["instruction"],
    axis=1
)
df.drop(columns=["context"], inplace=True)
df.head(10)

In [None]:
logger.info(f"Preprocessed and loaded the dolly dataset. Going to load the jsonl line in the s3 bucket")
jsonl_content = df.to_json(orient='records', lines=True)
print(jsonl_content[:1000])

In [None]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['source_data_prefix']
file_name: str = f"{ds_id}.jsonl"
write_to_s3(jsonl_content, bucket, prefix, "", file_name)

### Preprocess the OpenOrca Dataset
---

In this section of the notebook, we will pre process the OpenOrca dataset. This dataset contains the `system prompt` and the `question`. However, for some of the questions, the system prompt is not provided. To make this consistent, we will add the system prompt as a prefix to the question field so that we can have the Foundation Models (FMs) consistently evaluate all content without having empty system prompts in some of the calls.

In [None]:
ds_id: str = "Open-Orca/OpenOrca"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

In [None]:
dataset = load_hf_ds_subset(ds_id, ds_split, ds_N)

In [None]:
# convert the dataset to a dataframe, for print it out and easy conversion to jsonl
df = pd.DataFrame(dataset)
df.head(10)

In [None]:
df["question"] = df.apply(
    lambda row: row["system_prompt"] + "\n" + row["question"] 
                if isinstance(row["system_prompt"], str) and row["system_prompt"].strip() != "" 
                else row["question"],
    axis=1
)
df.drop(columns=["system_prompt"], inplace=True)
df.head(10)

In [None]:
logger.info(f"Preprocessed and loaded the open orca dataset. Going to load the jsonl line in the s3 bucket")
jsonl_content = df.to_json(orient='records', lines=True)
print(jsonl_content[:1000])

In [None]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['source_data_prefix']
file_name: str = f"{ds_id}.jsonl"
write_to_s3(jsonl_content, bucket, prefix, "", file_name)

#### Image Datasets
---

If you are using FMBench to benchmark models on an image dataset, run the cells below to load the image dataset and send the data to s3. This data will be used in the `1_generate_data.ipynb` notebook to convert the available image column (specified by the user) in the configuration file into `base64`. This will be used later in the benchmarking test while running inferences against the model endpoint.

In [None]:
# Marqo/marqo-GS-10M DATASET: This is an image dataset without any questions

# import itertools
# from datasets import load_dataset, Dataset

# # ds_id: str = "HuggingFaceM4/WebSight"
# ds_id: str = "Marqo/marqo-GS-10M"
# ds_name: str = "default"
# # ds_name: str = "v0.2"
# ds_split: str = "in_domain"
# ds_N: int = 100

# # Load the dataset in streaming mode so you don't have to load the entire dataset
# dataset = load_dataset(ds_id, name=ds_name, split=ds_split, streaming=True)

# # Take only the first ds_N examples
# dataset_iter = itertools.islice(dataset, ds_N)

# # Convert to a list and then to a regular dataset
# dataset_list = list(dataset_iter)
# dataset = Dataset.from_list(dataset_list)

In [None]:
import itertools
from datasets import load_dataset, Dataset

ds_id: str = "derek-thomas/ScienceQA"
ds_name: str = "default"
ds_split: str = "test"
ds_N: int = 100

# Load the dataset in streaming mode so you don't have to load the entire dataset
dataset = load_dataset(ds_id, name=ds_name, split=ds_split, streaming=True)

# Take only the first ds_N examples
dataset_iter = itertools.islice(dataset, ds_N)

# Convert to a list and then to a regular dataset
dataset_list = list(dataset_iter)
dataset = Dataset.from_list(dataset_list)

logger.info(f"Loaded {len(dataset)} examples")

In [None]:
dataset

In [None]:
# convert the dataset to a dataframe, for print it out and easy conversion to jsonl
df = pd.DataFrame(dataset)
df.head(10)

In [None]:
# view one of the images in the dataset
df.image[10]

In [None]:
# some datasets contain a field called column, we would like to call it
# input to match it to the prompt template
df.rename(columns={"question": "input"}, inplace=True)
df.head()

In [None]:
import io
import json
from PIL import Image

# This is a custom JSON encoder class called PILImageEncoder that extends the built-in json.JSONEncoder class. 
# The purpose of this class is to enable JSON serialization of PIL (Python Imaging Library) Image objects, which are 
# not natively JSON serializable.

# It checks if the object ( obj) is an instance of Image.Image (a PIL Image object).

# If it is an Image object:
# a. It creates a BytesIO buffer.
# b. Saves the image to this buffer in JPEG format.
# c. Converts the binary data in the buffer to a hexadecimal string.
# d. Returns a dictionary with two keys: the hexadecimal string and the JPEG format

class PILImageEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Image.Image):
            buffered = io.BytesIO()
            obj.save(buffered, format="JPEG")
            hex_data = buffered.getvalue().hex()
            return {
                'hex_data': hex_data,
                'format': 'JPEG'
            }
        return super(PILImageEncoder, self).default(obj)

### Subset the data

In [None]:
print(f"dataset shape before random subset = {df.shape}")
df = df.sample(n=ds_N)
print(f"dataset shape before random subset = {df.shape}")

Convert to json lines format

In [None]:
# if the image column is provided in the dataset, then use the PIL image encoder
if config['datasets'].get('image_col') is not None:
    logger.info(f"The data is multimodal. Using the PILImageEncoder to encode the PIL image into jsonl files")
    jsonl_content = df.to_json(orient='records', lines=True, default_handler=PILImageEncoder().default)
else:
    logger.info(f"The data is standard text data, will convert to jsonl files.")
    jsonl_content = df.to_json(orient='records', lines=True)
print(jsonl_content[:1000])

## Upload the dataset to s3

In [None]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['source_data_prefix']
file_name: str = f"{ds_id}.jsonl"
write_to_s3(jsonl_content, bucket, prefix, "", file_name)

## Create a prompt template and upload it to S3
The prompt template is specific to the model under test and also the dataset being used. The variables used in the template, such as `context` and `input` must exist in the dataset being used so that this prompt template can be converted into an actual prompt.

In [None]:
# dictionary containing the prompt template, it has a key by the name
# of the dataset id which forces you to explicitly add your dataset here
# otherwise no new prompt template will be uploaded and it wont accidently
# end up overwriting an existing prompt template
prompt_template = {}

In [None]:
# LongBench
prompt_template['THUDM-LongBench-llama2-mistral'] = """<s>[INST] <<SYS>>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don't know the answer just say that you don't know. Use three sentences maximum and keep the answer concise.
<</SYS>>

```
{context}
```

Question: {input}

[/INST]
Answer:
"""

In [None]:
# Open Orca
prompt_template['Open-Orca-OpenOrca-llama2-mistral'] = """<s>[INST] <<SYS>>

{system_prompt}

<</SYS>>

Context and task: {input}

[/INST]
"""

In [None]:
prompt_template['Open-Orca-OpenOrca-llama3'] = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{system_prompt}

Context and task: {input} 

<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

In [None]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['prompt_template_dir']
for k in prompt_template.keys():
    file_name: str = f"prompt_template_{k}.txt"
    print(f"writing {file_name} to s3://{bucket}/{prefix}/{file_name}")
    write_to_s3(prompt_template[k], bucket, prefix, "", file_name)

## Scratchpad

### Utility function for converting a line from container log to JSON format

The following is a line from CW log from a model container that provides all the information about the model that is not available anywhere else (not in Model or EndpointConfig or Endpoint description). This information is often necessary to know the low level settings about the model which may have been set while compiling the model.

In [None]:
line="""model_id_or_path='/tmp/.djl.ai/download/ae03dd100c208acd82b5dbed563c971de864c408' rolling_batch=<RollingBatchEnum.auto: 'auto'> tensor_parallel_degree=8 trust_remote_code=False enable_streaming=<StreamingEnum.false: 'false'> batch_size=4 max_rolling_batch_size=4 dtype=<Dtype.f16: 'fp16'> revision=None output_formatter=None waiting_steps=None is_mpi=False draft_model_id=None spec_length=0 neuron_optimize_level=None enable_mixed_precision_accumulation=False enable_saturate_infinity=False n_positions=4096 unroll=None load_in_8bit=False low_cpu_mem_usage=False load_split_model=True context_length_estimate=None amp='f16' quantize=None compiled_graph_path=None task=None save_mp_checkpoint_path=None group_query_attention=None model_loader=<TnXModelLoaders.tnx: 'tnx'> rolling_batch_strategy=<TnXGenerationStrategy.continuous_batching: 'continuous_batching'> fuse_qkv=False on_device_embedding=False attention_layout=None collectives_layout=None cache_layout=None partition_schema=None all_reduce_dtype=None cast_logits_dtype=None"""
import re
import json
pattern = r' (?=[^\'"])'


# Split the string using the pattern
result = re.split(pattern, line)
print("\n".join([r for r in result]))
params= {}
for kv in result:
    #print(kv.split('='))
    k,v = kv.split('=')
    params[k] = v
print(json.dumps(params, indent=2, default=str))