# Bring your own dataset

---------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

### This part of our solution design includes 

- Creating your own `fmbench` compatible dataset from a [HuggingFace dataset](https://huggingface.co/docs/datasets/en/index).

- Creating a prompt payload template compatible with your dataset.

- Upload the dataset and the prompt payload to Amazon S3 from where it can be used by `fmbench`.

In [None]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [None]:
import pandas as pd
from fmbench.utils import *
from fmbench.globals import *
from datasets import load_dataset
config = load_config(CONFIG_FILE)

## Convert HuggingFace dataset to jsonl format

`fmbench` works with datasets in the [`JSON Lines`](https://jsonlines.org/) format. So here we show how to convert a HuggingFace dataset into JSON lines format.

Set the `ds_name` to the HuggingFace dataset id, for example [`THUDM/LongBench`](https://huggingface.co/datasets/THUDM/LongBench), [`rajpurkar/squad_v2`](https://huggingface.co/datasets/rajpurkar/squad_v2), [`banking77`](https://huggingface.co/datasets/banking77) or other text datasets.

In [None]:
ds_id: str = "rajpurkar/squad"
ds_name: str = "plain_text"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

# another example
# ds_id: str = "THUDM/LongBench"
# ds_name: str = "2wikimqa"
# ds_split: str = "test"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
# ds_N: int = 200

# another example
# ds_id: str = "banking77"
# ds_name: str = "default"
# ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
# ds_N: int = 10000

ds_id: str = "Open-Orca/OpenOrca"
ds_name: str = "default"
ds_split: str = "train"
# Take a random subset of the dataframe, adjust the value of `N` below as appropriate.
# size of random subset of the data
ds_N: int = 100

In [None]:
# Load the dataset from huggingface
dataset = load_dataset(ds_id, name=ds_name)

In [None]:
dataset

In [None]:
# convert the dataset to a dataframe, for print it out and easy conversion to jsonl
df = pd.DataFrame(dataset[ds_split])

# some datasets contain a field called column, we would like to call it
# input to match it to the prompt template
df.rename(columns={"question": "input"}, inplace=True)

In [None]:
df.head()

Subset the data

In [None]:
print(f"dataset shape before random subset = {df.shape}")
df = df.sample(n=ds_N)
print(f"dataset shape before random subset = {df.shape}")

Convert to json lines format

In [None]:
jsonl_content = df.to_json(orient='records', lines=True)
print(jsonl_content[:1000])

## Upload the dataset to s3

In [None]:
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['source_data_prefix']
file_name: str = f"{ds_id}.jsonl"
write_to_s3(jsonl_content, bucket, prefix, "", file_name)

## Create a prompt template and upload it to S3
The prompt template is specific to the model under test and also the dataset being used. The variables used in the template, such as `context` and `input` must exist in the dataset being used so that this prompt template can be converted into an actual prompt.

In [None]:
# dictionary containing the prompt template, it has a key by the name
# of the dataset id which forces you to explicitly add your dataset here
# otherwise no new prompt template will be uploaded and it wont accidently
# end up overwriting an existing prompt template
prompt_template = {}

In [None]:
# LongBench
prompt_template['THUDM-LongBench-llama2-mistral'] = """<s>[INST] <<SYS>>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don't know the answer just say that you don't know. Use three sentences maximum and keep the answer concise.
<</SYS>>

```
{context}
```

Question: {input}

[/INST]
Answer:
"""

In [None]:
# Open Orca
prompt_template['Open-Orca-OpenOrca-llama2-mistral'] = """<s>[INST] <<SYS>>

{system_prompt}

<</SYS>>

Context and task: {input}

[/INST]
"""

In [None]:
prompt_template['Open-Orca-OpenOrca-llama3'] = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{system_prompt}

Context and task: {input} 

<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

In [None]:
# prompt template for BERT
bucket: str = config['s3_read_data']['read_bucket']
prefix: str = config['s3_read_data']['prompt_template_dir']
for k in prompt_template.keys():
    file_name: str = f"prompt_template_{k}.txt"
    print(f"writing {file_name} to s3://{bucket}/{prefix}/{file_name}")
    write_to_s3(prompt_template[k], bucket, prefix, "", file_name)