# Data Preparation for Anthropic Claude-3 Haiku Fine-Tuning

This notebook will guide you through the process of creating the necessary resources and preparing the datasets for fine-tuning the Anthropic Claude-3 Haiku model using Amazon Bedrock. By the end of this notebook, you will have created an IAM role, an S3 bucket, and training, validation, and testing datasets in the required format for the fine-tuning process.

### Pre-requisites

#### Custom job role

The notebook allows you to either create a Bedrock role for running customization jobs in the **Create IAM customisation job role** section or you can skip this section and create Bedrock Service role for customization jobs following [instructions on managing permissions for customization jobs](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-iam-role.html). If you want to using an existing custom job role please edit the variable **customization_role** and also ensure it has access to the S3 bucket which is created containing the dataset. 

#### Create IAM Pre-requisites

This notebook requires permissions to:

- create and delete Amazon IAM roles
- create, update and delete Amazon S3 buckets
- access Amazon Bedrock

If you are running this notebook without an Admin role, make sure that your role include the following managed policies:

- IAMFullAccess

- AmazonS3FullAccess

- AmazonBedrockFullAccess

You can also create a custom model in the Bedrock console following the instructions [here](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-submit.html).

### Setup

Install and import all the needed libraries and dependencies to complete this notebook.

<div class="alert alert-block alert-warning">
<b>Warning:</b> Please ignore error messages related to pip's dependency resolver.
</div>

In [None]:
!pip install --upgrade pip
!pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"
!pip install -qU --force-reinstall langchain typing_extensions pypdf urllib3==2.1.0
!pip install -qU ipywidgets>=7,<8
!pip install jsonlines
!pip install datasets==2.15.0
!pip install pandas==2.1.3
!pip install matplotlib==3.8.2
!pip install py7zr

In [None]:
# restart kernel for packages to take effect
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import warnings
warnings.filterwarnings('ignore')
import json
import os
import sys
import boto3 
import time
import pprint
from datasets import load_dataset
import random
import jsonlines

In [None]:
session = boto3.session.Session()
region = session.region_name
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
s3_suffix = f"{region}-{account_id}"
bucket_name = f"bedrock-haiku-customization-{s3_suffix}"
s3_client = boto3.client('s3')
bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")
iam = boto3.client('iam', region_name=region)

In [None]:
import uuid
suffix = str(uuid.uuid4())
role_name = "BedrockRole-" + suffix
s3_bedrock_finetuning_access_policy="BedrockPolicy-" + suffix
customization_role = f"arn:aws:iam::{account_id}:role/{role_name}"

### Testing boto3 connection

We will list the foundation models to test the boto3 connection and make sure bedrock client has been successfully created. 

In [None]:
for model in bedrock.list_foundation_models(
    byCustomizationType="FINE_TUNING")["modelSummaries"]:
    for key, value in model.items():
        print(key, ":", value)
    print("-----\n")

### Create S3 Bucket

In this step we will create a S3 bucket, which will be used to store data for Claude-3 Haiku fine-tuning notebook. 

In [None]:
# Create S3 bucket for knowledge base data source
s3bucket = s3_client.create_bucket(
    Bucket=bucket_name,
    ## Uncomment the following if you run into errors
    # CreateBucketConfiguration={
    #     'LocationConstraint':region,
    # },
)

### Creating Role and Policies Required to Run Customization Jobs with Amazon Bedrock

This JSON object defines the trust relationship that allows the bedrock service to assume a role that will give it the ability to talk to other required AWS services. The conditions set restrict the assumption of the role to a specfic account ID and a specific component of the bedrock service (model_customization_jobs)

In [None]:
ROLE_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Principal": {{
                "Service": "bedrock.amazonaws.com"
            }},
            "Action": "sts:AssumeRole",
            "Condition": {{
                "StringEquals": {{
                    "aws:SourceAccount": "{account_id}"
                }},
                "ArnEquals": {{
                    "aws:SourceArn": "arn:aws:bedrock:{region}:{account_id}:model-customization-job/*"
                }}
            }}
        }}
    ]
}}
"""

This JSON object defines the permissions of the role we want bedrock to assume to allow access to the S3 bucket that we created that will hold our fine-tuning datasets and allow certain bucket and object manipulations.

In [None]:
ACCESS_POLICY_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetBucketAcl",
                "s3:GetBucketNotification",
                "s3:ListBucket",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::{bucket_name}",
                "arn:aws:s3:::{bucket_name}/*"
            ]
        }}
    ]
}}"""

In [None]:
response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=ROLE_DOC,
    Description="Role for Bedrock to access S3 for haiku finetuning",
)
pprint.pp(response)

In [None]:
role_arn = response["Role"]["Arn"]
pprint.pp(role_arn)

In [None]:
response = iam.create_policy(
    PolicyName=s3_bedrock_finetuning_access_policy,
    PolicyDocument=ACCESS_POLICY_DOC,
)
pprint.pp(response)

In [None]:
policy_arn = response["Policy"]["Arn"]
pprint.pp(policy_arn)

In [None]:
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy_arn,
)

### Prepare Dataset for Claude-3 Haiku fine-tuning and Evaluation

The dataset that will be used is a collection of messenger-like conversations with summaries. 

In [None]:
#Load samsum dataset from huggingface
dataset = load_dataset("knkarthick/samsum")

In [None]:
print(dataset)

To fine-tune the Claude-3 Haiku model, the training data must be in `JSONL (JSON Lines)` format, where each line represents a single training record. Specifically, the training data format aligns with the [MessageAPI](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude-messages.html):

<pre style="background-color: #e0e0e0;">
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
</pre>


In each line, the `system` message is optional information, which is a way of providing context and instructions to Haiku model, such as specifying a particular goal or role, and often known as [system prompt](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts). 
The `user` input corresponds to the user’s instruction, and the `assistant` input is the desired response that the fine-tuned Haiku model should provide. 

A common prompt structure for instruction fine-tuning includes a system prompt, an instruction, and an input which provides additional context. Here we define the system prompt which will be added to the MessageAPI, and an intruction header that will be added before each article and together will be the user content of each datapoint.

In [None]:
system_string = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."

In [None]:
instruction = """instruction:

Summarize the conversation provided below.

input:
"""

For the 'assistant' component we will refer the summary/highlights of the article. The transformation of each datapoint is performed with the code below

In [None]:
# Process the training dataset
datapoints_train=[]
for dp in dataset['train']:
    temp_dict={}
    temp_dict["system"] = system_string
    temp_dict["messages"] = [
        {"role": "user", "content": instruction+dp['dialogue']},
        {"role": "assistant", "content": dp['summary']}
    ]
    datapoints_train.append(temp_dict)


An example of a processed datapoint can be printed below

In [None]:
print(datapoints_train[4])

The same processing is done for the validation and test sets as well.

In [None]:
# Process validation and test sets
datapoints_valid=[]
for dp in dataset['validation']:
    temp_dict={}
    temp_dict["system"] = system_string
    temp_dict["messages"] = [
        {"role": "user", "content": instruction+dp['dialogue']},
        {"role": "assistant", "content": dp['summary']}
    ]
    datapoints_valid.append(temp_dict)


In [None]:
datapoints_test=[]
for dp in dataset['test']:
    temp_dict={}
    temp_dict["system"] = system_string
    temp_dict["messages"] = [
        {"role": "user", "content": instruction+dp['dialogue']},
        {"role": "assistant", "content": dp['summary']}
    ]
    datapoints_test.append(temp_dict)

Here we define some helper functions to process our datapoints further by modifying the number of datapoints we want to include in each set and the max string length of the datapoints we want to include. The final function will convert our datasets into JSONL files.

In [None]:
def dp_transform(data_points,num_dps,max_dp_length):
    """
    This function filters and selects a subset of data points from the provided list based on the specified maximum length 
    and desired number of data points.
    """ 
    lines=[]
    for dp in data_points:
        if len(dp['system']+dp['messages'][0]['content']+dp['messages'][1]['content'])<=max_dp_length:
            lines.append(dp)
    random.shuffle(lines)
    lines=lines[:num_dps]
    return lines

In [None]:
def jsonl_converter(dataset,file_name):
    """
    This function writes the provided dataset to a JSONL (JSON Lines) file.
    """
    print(file_name)
    with jsonlines.open(file_name, 'w') as writer:
        for line in dataset:
            writer.write(line)

Claude-3 Haiku fine-tuning has following requirements on your datasets:

- Context length can be up to 32,000 tokens
- Training dataset can not have greater than 10,000 records
- Validation dataset can not have great than 1,000 records

For simplicity, we will process the datasets as follow

In [None]:
train=dp_transform(datapoints_train,1000,20000)
validation=dp_transform(datapoints_valid,100,20000)
test=dp_transform(datapoints_test,10,20000)

### Create Local Directory for Datasets

Save the processed data locally and convert them into JSONL formats

In [None]:
dataset_folder="haiku-fine-tuning-datasets-samsum"
train_file_name="train-samsum-1K.jsonl"
validation_file_name="validation-samsum-100.jsonl"
test_file_name="test-samsum-10.jsonl"
!mkdir haiku-fine-tuning-datasets-samsum
abs_path=os.path.abspath(dataset_folder)

In [None]:
jsonl_converter(train,f'{abs_path}/{train_file_name}')
jsonl_converter(validation,f'{abs_path}/{validation_file_name}')
jsonl_converter(test,f'{abs_path}/{test_file_name}')

### Upload Datasets to S3 Bucket

These code blocks upload the created training, validation and test datasets to S3 bucket. Training and validation datasets will be used for Haiku fine-tuning job, and testing dataset will be used to evaluate the performance between fine-tuned Haiku and base Haiku models. 

In [None]:
s3_client.upload_file(f'{abs_path}/{train_file_name}', bucket_name, f'haiku-fine-tuning-datasets/train/{train_file_name}')
s3_client.upload_file(f'{abs_path}/{validation_file_name}', bucket_name, f'haiku-fine-tuning-datasets/validation/{validation_file_name}')
s3_client.upload_file(f'{abs_path}/{test_file_name}', bucket_name, f'haiku-fine-tuning-datasets/test/{test_file_name}')

In [None]:
s3_train_uri=f's3://{bucket_name}/haiku-fine-tuning-datasets/train/{train_file_name}'
s3_validation_uri=f's3://{bucket_name}/haiku-fine-tuning-datasets/validation/{validation_file_name}'
s3_test_uri=f's3://{bucket_name}/haiku-fine-tuning-datasets/test/{test_file_name}'

### Storing Variables

Please make sure to use the same kernel on fine-tuning Haiku notebook

In [None]:
%store role_arn
%store bucket_name
%store role_name
%store policy_arn
%store s3_train_uri
%store s3_validation_uri
%store s3_test_uri