# Preparing Data for Fine-Tuning a Large Language Model

It is critical to prepare quality data in the correct format to fine-tune a large language model.


<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>

<ul>
    <li><b>Part 1:</b> Preparing a sample dataset.</li>
    <li><b>Part 2:</b> Introduction to Ray Data.</li>
    <li><b>Part 3:</b> Migrating to a scalable pipeline.</li>
    <li><b>Part 4:</b> Using the Anyscale Datasets registry.</li>
</ul>
</div>

## Imports

In [None]:
import os
import uuid
from typing import Any

import anyscale
import pandas as pd
import ray
from datasets import load_dataset

In [None]:
ctx = ray.data.DataContext.get_current()
ctx.enable_operator_progress_bars = False

## 1. Preparing a sample dataset

Let's start by preparing a small dataset for fine-tuning a large language model. 

### Dataset

We'll be using the [ViGGO dataset](https://huggingface.co/datasets/GEM/viggo) dataset, where the input (`meaning_representation`) is a structured collection of the overall intent (ex. `inform`) and entities (ex. `release_year`) and the output (`target`) is an unstructured sentence that incorporates all the structured input information. 

But for our task, we'll **reverse** this dataset where the input will be the unstructured sentence and the output will be the structured information.

```python
# Input (unstructured sentence):
"Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac."

# Output (function + attributes): 
"inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])"
```

### Schema

The preprocessing we'll do involves formatting our dataset into the schema required for fine-tuning (`system`, `user`, `assistant`) conversations.

- `system`: description of the behavior or personality of the model. As a best practice, this should be the same for all examples in the fine-tuning dataset, and should remain the same system prompt when moved to production.
- `user`: user message, or "prompt," that provides a request for the model to respond to.
- `assistant`: stores previous responses but can also contain examples of intended responses for the LLM to return.

```python
conversations = [
    {"messages": [
        {'role': 'system', 'content': system_content},
        {'role': 'user', 'content': item['target']},
        {'role': 'assistant', 'content': item['meaning_representation']}
    ]},
    {"messages": [...]},
    ...
]
```

### Loading a sample dataset

We will make use of the `datasets` library to load the ViGGO dataset and prepare a sample dataset for fine-tuning.

In [None]:
dataset = load_dataset("GEM/viggo", trust_remote_code=True)

Let's inspect the data splits available in the dataset:

In [None]:
# Data splits
train_set = dataset['train']
val_set = dataset['validation']
test_set = dataset['test']
print (f"train: {len(train_set)}")
print (f"val: {len(val_set)}")
print (f"test: {len(test_set)}")

Here is a single row of the dataset

In [None]:
for row in test_set:
    break
row

Here is a function that will transform the row into a format that can be used by the model.

In [None]:
def to_schema(row: dict[str, Any], system_content: str) -> dict[str, Any]:
    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": row["target"]},
        {"role": "assistant", "content": row["meaning_representation"]},
    ]
    return {"messages": messages}

We will use the following system prompt:

In [None]:
# System content
system_content = (
    "Given a target sentence construct the underlying meaning representation of the input "
    "sentence as a single function with attributes and attribute values. This function "
    "should describe the target string accurately and the function must be one of the "
    "following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', "
    "'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes "
    "must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', "
    "'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', "
    "'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']")


We can now convert the data to the schema format.

In [None]:
converted_data = []

for row in train_set:
    row["schema"] = to_schema(row, system_content)
    converted_data.append(row["schema"])

Here is how the schema looks like for a single row

In [None]:
row["schema"]

We can then make use of pandas to first view our dataset

In [None]:
converted_df = pd.DataFrame(converted_data)
converted_df.head()

we then store our training dataset which is now ready for finetuning via LLMForge

In [None]:
converted_df.to_json("train.jsonl", orient="records", lines=True)

## 2. Introduction to Ray Data

<!-- One liner about Ray Data -->
Ray Data is a scalable data processing library for ML workloads, particularly suited for the following workloads:


<!-- Diagram showing streaming and heterogenous cluster -->
Ray Data is particularly useful for streaming data on a heterogenous cluster:

<img src="https://docs.ray.io/en/latest/_images/stream-example.png" width="600">

Your production pipeline for preparing data for fine-tuning a large language model could require:
1. Loading mutli-modal datasets
2. Inferencing against guardrail models to remove low-quality and PII data.
3. Preprocessing data to the schema required for fine-tuning.

You will want to make the most efficient use of your cluster to process this data. Ray Data can help you do this.

### Ray Data's API

Here are the steps to make use of Ray Data:
1. Create a Ray Dataset usually by pointing to a data source.
2. Apply transformations to the Ray Dataset.
3. Write out the results to a data source.



#### Loading Data

Ray Data has a number of [IO connectors](https://docs.ray.io/en/latest/data/api/input_output.html) to most commonly used formats.

For purposes of this introduction, we will use the `from_huggingface` function to read the dataset we prepared in the previous section but this time we enable streaming.

In [None]:
train_streaming_ds = load_dataset(
    path="GEM/viggo",
    name="default",
    streaming=True, # Enable streaming
    split="train",
)

train_ds = ray.data.from_huggingface(train_streaming_ds)
train_ds

<div class="alert alert-block alert-warning">

<b>Note</b> that we can also stream data directly from huggingface or from any other source (e.g. parquet on S3)

</div>

### Transforming Data

Datasets can be transformed by applying a row-wise `map` operation. We do this by providing a user-defined function that takes a row as input and returns a row as output.

In [None]:
def to_schema_map(row: dict[str, Any]) -> dict[str, Any]:
    return to_schema(row, system_content=system_content)

train_ds_with_schema = train_ds.map(to_schema_map)

### Lazy execution

By default, `map` is lazy, meaning that it will not actually execute the function until you consume it. This allows for optimizations like pipelining and fusing of operations.

To inspect a few rows of the dataset, you can use the `take` method:

In [None]:
train_ds_with_schema.take(2)

### Writing Data

We can then write out the data to disk using the avialable IO connector methods.

In [None]:
uuid_ = str(uuid.uuid4())
storage_path =  f"/mnt/cluster_storage/ray_summit/e2e_llms/{uuid_}"
storage_path


We make use of the `write_json` method to write the dataset to the storage path in a distributed manner.

In [None]:
train_ds_with_schema.write_json(f"{storage_path}/train")

Let's inspect the generated files:

In [None]:
!ls {storage_path}/train/ --human-readable

### Recap of our Ray Data pipeline

Here is our Ray data pipeline condensed into the following chained operations:

```python
(
    ray.data.from_huggingface(train_streaming_ds)
    .map(to_schema_map)
    .write_json(f"{storage_path}/train")
)
```

<div class="alert alert-block alert-info">

### Lab activity: Apply more elaborate preprocessing

Assume you have a function that you would like to apply to remove all `give_opinion` messages to avoid finetuning on sensitive user opinions.

In a production setting, think of this as applying a Guardrail model that you use to detect and filter out poor quality data or PII data.

i.e. given this code:

```python
def is_give_opinion(conversation):
    sys, user, assistant = conversation
    return "give_opinion" in assistant["content"]


def filter_opinions(row) -> bool:
    # Hint: call is_give_opinion on the row
    ...

(
    ray.data.from_huggingface(train_streaming_ds)
    .map(to_schema_map)
    .filter(filter_opinions)
    .write_json(f"{storage_path}/train_without_opinion")
)
```


</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">

<details>
<summary>Click here to view the solution</summary>

```python
def is_give_opinion(conversation):
    sys, user, assistant = conversation
    return "give_opinion" in assistant["content"]


def filter_opinions(row) -> bool:
    return not is_give_opinion(row["messages"])

(
    ray.data.from_huggingface(train_streaming_ds)
    .map(to_schema_map)
    .filter(filter_opinions)
    .write_json(f"{storage_path}/train_without_opinion")
)
```


</details>

</div>

### Using Anyscale Datasets

Anyscale Datasets is a managed dataset registry and discovery service that allows you to:

- Centralize dataset storage
- Version datasets
- Track dataset usage
- Manage dataset access

Let's upload our training data to the Anyscale Datasets registry.

In [None]:
anyscale_dataset = anyscale.llm.dataset.upload(
    "train.jsonl",
    name="viggo_train",
    description=(
        "VIGGO dataset for E2E LLM template: train split"
    ),
    )

anyscale_dataset

The dataset is now saved to the Anyscale Datasets registry.

To load the Anyscale Dataset back into a Ray Dataset, you can do:

In [None]:
anyscale_dataset = anyscale.llm.dataset.get("viggo_train")
train_ds_with_schema = ray.data.read_json(anyscale_dataset.storage_uri)
train_ds_with_schema

You may also want to download the contents of the Dataset file directly, in this case, a `.jsonl` file.

In [None]:
dataset_contents: bytes = anyscale.llm.dataset.download("viggo_train")
lines = dataset_contents.decode().splitlines()
print("# of rows:", len(lines))

Or version the Dataset:

In [None]:
anyscale_dataset = anyscale.llm.dataset.get("viggo_train")
latest_version = anyscale_dataset.version
anyscale_dataset = anyscale.llm.dataset.upload(
    "train.jsonl",
    name="viggo_train",
    description=(
        f"VIGGO dataset for E2E LLM template: train split, version {latest_version + 1}"
    ),
)

print("Latest version:", anyscale.llm.dataset.get("viggo_train"))
print("Second latest version:", anyscale.llm.dataset.get("viggo_train", version=-1))

Finally, you can use the Anyscale dataset in your LLMForge fine-tuning jobs.