# Amazon Nova Lite 2.0 Data Preparation (for Model Customization)

Preparing high-quality, properly formatted data is a critical first step in the fine-tuning process for large language models. Whether you're using supervised fine-tuning (SFT) or Direct Preference Optimization (DPO), with either full-rank or low-rank adaptation (LoRA) approaches, your data must adhere to specific format requirements to ensure successful model training. This section outlines the necessary data formats, validation methods, and best practices to help you prepare your datasets effectively for fine-tuning Amazon Nova models.

Data preparation is the critical process of collecting, cleaning, formatting, and organizing datasets for machine learning model training. It involves transforming raw data into a structured format that models can effectively learn from, ensuring data quality, consistency, and relevance to the target task.

In this notebook, the tasks below will take a raw data set and transform it into a training, validation, and eval datasets compatible to Nova 2.0 models.

## 1. Getting Started
For this notebook, we are going to build training, validation, and test datasets to be used with SageMaker Training Jobs (SMTJ) in order to fine tune a Nova Lite 2.0 model, and then evaluate that model.

This notebook will use the public [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) dataset.  

This dataset is an open-source dataset and model suite focused on enabling and improving function calling capabilities for large language models (LLMs). 

We will use this dataset to train our model, but we must transform this dataset into a schema consumable by SMTJ and Nova.

## 2. Prerequisites and Dependencies

### Dependencies
Several python packages will need to be installed in order to execute this notebook.  Please review the packages in requirements.txt. 

boto3, sagemaker are required for the training jobs, while the other packages are used to help visualize results.

In [None]:
! pip install -r ./requirements.txt --upgrade

### Credentials, Sessions, Roles, and more!

This section sets up the necessary AWS credentials and SageMaker session to run the notebook. You'll need proper IAM permissions to use SageMaker.


If you are going to use Sagemaker in a local environment, you will need access to an IAM Role with the required permissions for Sagemaker. Learn more about it here [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

For more details on other Nova pre-requisites needed check out [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-general-prerequisites.html)

The code initializes a SageMaker session, sets up the IAM role, and configures the S3 bucket for storing training data artifacts.

In [None]:
import sagemaker
import boto3

# we use the SageMaker session as we need to place the training and eval data into a bucket location where SageMaker has access.

session_sagemaker = sagemaker.Session(boto_session=boto3.Session(region_name='us-east-1'))

sagemaker_session_bucket = None

if sagemaker_session_bucket is None and session_sagemaker is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = session_sagemaker.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]


bucket_name = session_sagemaker.default_bucket()
default_prefix = session_sagemaker.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {session_sagemaker.default_bucket()}")
print(f"sagemaker session region: {session_sagemaker.boto_region_name}")

## 3. Data Preparation
Prepare high-quality prompt-response pairs for training. Data should be:
- Consistent in format
- Representative of desired behavior
- Deduplicated and cleaned

### Train and Validation Datasets
Our first task will be to prepare the datasets into a required schema to train using SMTJ. 

That schema is describe below, with a system property and messages property. Both property take arrays.  The messages property use the familar "user" -> "assistant" turns paradigm.

For custom training, each data record must be wrangled into a specific format. Then, all records that make up the training dataset are written to a JSONL file.  Here is the schema that represents a single record.  We call this the `Converse` format.

```
{
    "system": [{"text": Content of the System prompt}],
    "messages": [
        {
            "role": "user",
            "content": [ { "text": Content of the user prompt }]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "reasoningContent": {
                        "reasoningText": { "text": <reasoning text -> Nova Pro COT AND 2.0 reasoning> }
                    }
                },
                { "text": Content of the answer }
            ]
        }
    ]
}
```

"reasoningContent" is optional content.  It is content that explain why the assistant response is the golden response. By adding reasoningContent, models will train better than those without reasoning content.

### Test (Eval) Dataset
The schema for an test (evaluation) record is different when using the SageMaker evaluation container.  For SageMaker training jobs that use the evaluation container, the JSON schema looks like this:

For reference, here is the schema that represents a single record in the test data.  

```
{
    "system": "",
    "query": "",
    "response": ""
}
```

Each of these single records are all then aggregated into a single jsonl file, which must be named as `gen_qa.jsonl`.

### Building the Datasets
The data preparation activity within this notebook will
- take a data set as input
- wrangle the data and transform each record into the above schema
- aggreate all of these records into a single JSONL file
- write this JSONL file to S3 to be consumed by the customization process as found in the respective technique folder  

### 3.1. Data Loading

This code loads the first 10,000 examples from the glaive-function-calling-v2 dataset from Hugging Face. Show a few rows of this data set


In [None]:
from datasets import load_dataset

dataset = load_dataset("glaiveai/glaive-function-calling-v2", split="train[:1000]")

dataset

Converting the dataset to a pandas DataFrame makes it easier to work with and manipulate.

In [None]:
from utils.preprocessing import glaive_to_standard_format

processed_dataset = glaive_to_standard_format(dataset)

In [None]:
import pandas as pd

df = pd.DataFrame(processed_dataset)

df.head()

### 3.2. Train/Val/Test Split

The dataset is split into training (72%), validation (18%), and test (10%) sets to properly evaluate the model. 

In [None]:
from sklearn.model_selection import train_test_split

temp, val = train_test_split(df, test_size=0.1, random_state=42)
train, test = train_test_split(temp, test_size=0.01, random_state=42)

print("Number of train elements: ", len(train))
print("Number of test elements: ", len(val))
print("Number of val elements: ", len(test))

### 3.3: Data Preprocessing 

Recall, our objective is to transform each record of the training data into the schema required for SMTJ and the Amazon Nova Lite 2.0 model.

Once, the schema is accomplished for each record, then all transformed records will be written into a single JSONL file. 

As a reminder, here is the target schema for the training and validation data:

```
{
    "system": [{ "text": Content of the System prompt }],
    "messages": [
        {
            "role": "user",
            "content": [{ "text": Content of the user prompt }]
        },
        {
            "role": "assistant",
            "content": [{ "text": Content of the answer }]
        }
    ]
}
```

To help achieve this transformation goal, the notebook defines utility functions to clean the dataset content by removing prefixes and handling special cases:

```python
def clean_prefix(content):
    # Removes prefixes like "USER:", "ASSISTANT:", etc.
    ...

def clean_message_list(message_list):
    # Cleans message lists from None values and converts to proper format
    ...

def clean_numbered_conversation(message_list):
    # Cleans message lists from None values and converts to proper format
    ...
```

These functions transform the dataset into the format required by Nova models, handling tool calls and formatting:

```python

def transform_tool_format(tool):
    # Transforms tool format to Nova's expected format
    ...

def prepare_dataset(sample):
    # Prepares dataset in the required format for Nova models
    ...

def prepare_dataset_test(sample):
    # Formats validation dataset for evaluation
    ...
```


### 3.4: Wrangle data set into Converse Format
Create Train, Validation, and Test Datasets

- Train data is data that the model will train upon
- Eval data is data that used to measure metrics of the trained model
- Test data is data that the model has never seen nor been trained on

In [None]:
from datasets import Dataset, DatasetDict
from random import randint
from utils.data_prep_utils import (
    prepare_dataset,
    clean_message_list,
    prepare_dataset_test,
    make_eval_compatible
)

train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
intermediate_test_dataset = Dataset.from_pandas(test)


dataset = DatasetDict(
    {"train": train_dataset, "test": intermediate_test_dataset, "val": val_dataset}
)

train_dataset = dataset["train"].map(
    prepare_dataset, remove_columns=train_dataset.features
)

train_dataset = train_dataset.to_pandas()

train_dataset["messages"] = train_dataset["messages"].apply(clean_message_list)

print(train_dataset.iloc[randint(0, len(train_dataset))].to_json())

val_dataset = dataset["val"].map(prepare_dataset, remove_columns=val_dataset.features)

val_dataset = val_dataset.to_pandas()

val_dataset["messages"] = val_dataset["messages"].apply(clean_message_list)

print(val_dataset.iloc[randint(0, len(val_dataset))].to_json())

intermediate_test_dataset = dataset["test"].map(
    prepare_dataset_test, remove_columns=intermediate_test_dataset.features
)


# Transform test_dataset into Evaluation compatible
# test_dataset looks like this:

# {"messages":[{"query":"1","response":"","system":""}]}
# {"messages":[{"query":"2","response":"","system":""}]}
# {"messages":[{"query":"3","response":"","system":""}]}

# and we transform it into this for Evaluation

# {"query":"1","response":"","system":""}
# {"query":"2","response":"","system":""}
# {"query":"3","response":"","system":""}

test_data = make_eval_compatible(intermediate_test_dataset)
test_dataset = Dataset.from_list(make_eval_compatible(intermediate_test_dataset))

print(test_dataset[randint(0, len(test_dataset))])

### 3.5: Upload all 3 curated datasets (train, val, test) to Amazon S3

The processed datasets are saved locally and then uploaded to Amazon S3 for use in SageMaker training.

#### Save files locally for review

In [None]:
import os

# Save datasets to s3
folder = "tmp"
os.makedirs(f"./{folder}/train", exist_ok=True)
os.makedirs(f"./{folder}/val", exist_ok=True)
os.makedirs(f"./{folder}/test", exist_ok=True)

train_dataset.to_json(f"./{folder}/train/dataset.jsonl", orient="records", lines=True)
val_dataset.to_json(f"./{folder}/val/dataset.jsonl", orient="records", lines=True)
test_dataset.to_json(f"./{folder}/test/gen_qa.jsonl")


#### Upload Datasets to S3

Define the S3 bucket paths

In [None]:
import boto3

s3_client = boto3.client('s3')

# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f"{default_prefix}/datasets/nova-sft-peft"
else:
    input_path = f"datasets/nova-sft-peft"

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.jsonl"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/val/dataset.jsonl"
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test/gen_qa.jsonl"

print(f"Training data uploaded to:")
print(train_dataset_s3_path)

print(f"\nValidation data uploaded to:")
print(val_dataset_s3_path)

print(f"\nTest data uploaded to:")
print(test_dataset_s3_path)

Push to S3

In [None]:
import shutil

s3_client.upload_file(
    f"./{folder}/train/dataset.jsonl", bucket_name, f"{input_path}/train/dataset.jsonl"
)

s3_client.upload_file(
    f"./{folder}/val/dataset.jsonl", bucket_name, f"{input_path}/val/dataset.jsonl"
)

s3_client.upload_file(
    f"./{folder}/test/gen_qa.jsonl", bucket_name, f"{input_path}/test/gen_qa.jsonl"
)

# comment the line of code below if would like to see data files locally.
# shutil.rmtree(f"./{folder}")

## 4. Model Training and Customization

Excellent!  All done with the Date Preparation!

Now, it is time to train the model.  Please go to the SFT notebook in order to take the next steps on this journey.

**---------- BEFORE YOU GO!! ----------**<br><br>
The below values are needed for the customization notebook.  Copy these values as they are needed in the SFT notebook.

In [None]:
print(f"Training data uploaded to:")
print(train_dataset_s3_path)

print(f"\nValidation data uploaded to:")
print(val_dataset_s3_path)

print(f"\nTest data uploaded to:")
print(test_dataset_s3_path)

In [21]:
%store train_dataset_s3_path
%store val_dataset_s3_path
%store test_dataset_s3_path

Stored 'train_dataset_s3_path' (str)
Stored 'val_dataset_s3_path' (str)
Stored 'test_dataset_s3_path' (str)


## 5. Deployment / Inference

Deployment and Inference are found in a different notebook.  Please see the Deployment notebook.