# Setup for running customization notebooks for fine-tuning using Amazon Bedrock with Nova Micro

In this notebook, we will create a set of roles and an S3 bucket which will be used for Nova Micro fine-tuning. We'll also prepare the dataset in the required format for Nova Micro.

> This notebook should work well with the **`Data Science 3.0`**, **`Python 3`**, and **`ml.t3.medium`** kernel in SageMaker Studio

## Prerequisites

### Custom job role

The notebook allows you to either create a Bedrock role for running customization jobs in the **Create IAM customisation job role** section or you can skip this section and create Bedrock Service role for customization jobs following [instructions on managing permissions for customization jobs](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-iam-role.html). If you want to use an existing custom job role please edit the variable **customization_role** and also ensure it has access to the S3 bucket which is created containing the dataset.

#### Create IAM Pre-requisites

This notebook requires permissions to:
- create and delete Amazon IAM roles
- create, update and delete Amazon S3 buckets
- access Amazon Bedrock

If you are running this notebook without an Admin role, make sure that your role includes the following managed policies:
- IAMFullAccess
- AmazonS3FullAccess
- AmazonBedrockFullAccess

- You can also create a custom model in the Bedrock console following the instructions [here](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html).

## Setup

Install and import all the needed libraries and dependencies to complete this notebook.

<div class="alert alert-block alert-warning">
<b>Warning:</b> Please ignore error messages related to pip's dependency resolver.
</div>

In [None]:
!pip install --upgrade pip

%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

!pip install -qU --force-reinstall langchain typing_extensions pypdf urllib3==2.1.0
!pip install -qU ipywidgets>=7,<8
!pip install jsonlines
!pip install datasets==2.15.0
!pip install pandas==2.1.3
!pip install matplotlib==3.8.2

In [None]:
# restart kernel for packages to take effect
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import warnings
warnings.filterwarnings('ignore')

import json
import os
import sys
import boto3 
import time
import pprint
from datasets import load_dataset
import random
import jsonlines

In [None]:
session = boto3.session.Session()
region = "us-west-2" # Region needs to be us-west-2
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
s3_suffix = f"{region}-{account_id}"
bucket_name = f"bedrock-customization-{s3_suffix}"
s3_client = boto3.client('s3', region_name=region)
bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name=region) 
iam = boto3.client('iam', region_name=region)

In [None]:
role_name = "AmazonBedrockCustomizationRole1"
s3_bedrock_finetuning_access_policy="AmazonBedrockCustomizationPolicy1"
customization_role = f"arn:aws:iam::{account_id}:role/{role_name}"

## Testing boto3 connection

We will list the foundation models to test the boto3 connection and make sure bedrock client has been successfully created.

In [None]:
for model in bedrock.list_foundation_models(
    byCustomizationType="FINE_TUNING")["modelSummaries"]:
    for key, value in model.items():
        print(key, ":", value)
    print("-----\n")

## Create S3 bucket

In this step we will create an S3 bucket, which will be used to store data for fine-tuning with Nova Micro.

In [None]:
# Create S3 bucket for knowledge base data source
s3bucket = s3_client.create_bucket(
    Bucket=bucket_name,
    ## Uncomment the following if you run into errors
    CreateBucketConfiguration={
        'LocationConstraint':region,
    },
)

## Creating role and policies required to run customization jobs with Amazon Bedrock
This JSON object defines the trust relationship that allows the bedrock service to assume a role that will give it the ability to talk to other required AWS services. The conditions set restrict the assumption of the role to a specific account ID and a specific component of the bedrock service (model_customization_jobs)

In [None]:
ROLE_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Principal": {{
                "Service": "bedrock.amazonaws.com"
            }},
            "Action": "sts:AssumeRole",
            "Condition": {{
                "StringEquals": {{
                    "aws:SourceAccount": "{account_id}"
                }},
                "ArnEquals": {{
                    "aws:SourceArn": "arn:aws:bedrock:{region}:{account_id}:model-customization-job/*"
                }}
            }}
        }}
    ]
}}
"""

This JSON object defines the permissions of the role we want bedrock to assume to allow access to the S3 bucket that we created that will hold our fine-tuning datasets and allow certain bucket and object manipulations.

In [None]:
ACCESS_POLICY_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetBucketAcl",
                "s3:GetBucketNotification",
                "s3:ListBucket",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::{bucket_name}",
                "arn:aws:s3:::{bucket_name}/*"
            ]
        }}
    ]
}}"""  

response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=ROLE_DOC,
    Description="Role for Bedrock to access S3 for finetuning",
)

pprint.pp(response)

In [None]:
role_arn = response["Role"]["Arn"]
pprint.pp(role_arn)

In [None]:
response = iam.create_policy(
    PolicyName=s3_bedrock_finetuning_access_policy,
    PolicyDocument=ACCESS_POLICY_DOC,
)

pprint.pp(response)

In [None]:
policy_arn = response["Policy"]["Arn"]
pprint.pp(policy_arn)

In [None]:
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy_arn,
)

Setup for running other notebooks on fine-tuning with Nova Micro is complete.

## Prepare CNN news article dataset for fine-tuning job and evaluation

The dataset that will be used is a collection of news articles from CNN and the associated highlights from that article. More information can be found at huggingface: https://huggingface.co/datasets/cnn_dailymail

In [None]:
#Load cnn dataset from huggingface
dataset = load_dataset("cnn_dailymail",'3.0.0')

In [None]:
# View the structure of the dataset
print(dataset)

## Prepare the Fine-tuning Dataset for Nova Micro

For Nova Micro, we need to use the `bedrock-conversation-2024` schema format:

```json
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "System instruction here"
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "User message here"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Assistant response here"
        }
      ]
    }
  ]
}
```

We'll convert our CNN dataset to this format.

In [None]:
# Define the system instruction for summarization
system_instruction = "You are a helpful assistant that summarizes news articles accurately and concisely."

In [None]:
# Function to convert dataset to Nova Micro format
def convert_to_nova_micro_format(data_point):
    return {
        "schemaVersion": "bedrock-conversation-2024",
        "system": [
            {
                "text": system_instruction
            }
        ],
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "text": f"Summarize the following news article:\n\n{data_point['article']}"
                    }
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "text": data_point['highlights']
                    }
                ]
            }
        ]
    }

In [None]:
# Process the datasets
datapoints_train = [convert_to_nova_micro_format(dp) for dp in dataset['train']]
datapoints_valid = [convert_to_nova_micro_format(dp) for dp in dataset['validation']]
datapoints_test = [convert_to_nova_micro_format(dp) for dp in dataset['test']]

In [None]:
# Print an example of the processed datapoint
import json
print(json.dumps(datapoints_train[4], indent=2))

## Process and filter the dataset

We'll filter the dataset based on length and limit the number of samples. For Nova Micro, we'll cap the dataset at 20,000 samples as specified.

In [None]:
def dp_transform(data_points, num_dps, max_dp_length):
    lines = []
    for dp in data_points:
        # Calculate total length of text in the datapoint
        total_length = len(dp['system'][0]['text']) + \
                       len(dp['messages'][0]['content'][0]['text']) + \
                       len(dp['messages'][1]['content'][0]['text'])
        
        if total_length <= max_dp_length:
            lines.append(dp)
    
    random.shuffle(lines)
    lines = lines[:min(num_dps, 20000)]  # Cap at 20,000 samples as specified
    return lines

In [None]:
def jsonl_converter(dataset, file_name):
    print(file_name)
    with jsonlines.open(file_name, 'w') as writer:
        for line in dataset:
            writer.write(line)

In [None]:
# Process data partitions with a character limit of 3,000
train = dp_transform(datapoints_train, 5000, 3000)
validation = dp_transform(datapoints_valid, 999, 3000)
test = dp_transform(datapoints_test, 10, 3000)

### Create local directory for datasets

Please note that your training dataset for fine-tuning cannot be greater than 20K records for Nova Micro, and validation dataset has a maximum limit of 1K records.

In [None]:
dataset_folder = "fine-tuning-datasets"
train_file_name = "train-cnn-nova-micro.jsonl"
validation_file_name = "validation-cnn-nova-micro.jsonl"
test_file_name = "test-cnn-nova-micro.jsonl"

!mkdir -p fine-tuning-datasets
abs_path = os.path.abspath(dataset_folder)

In [None]:
# Create JSONL format datasets for Nova Micro fine-tuning
jsonl_converter(train, f'{abs_path}/{train_file_name}')
jsonl_converter(validation, f'{abs_path}/{validation_file_name}')
jsonl_converter(test, f'{abs_path}/{test_file_name}')

### Upload datasets to S3 bucket

Uploading both training and test dataset. 

We will use the training and validation datasets for fine-tuning the model. The test dataset will be used for evaluating the performance of the model on unseen input.

In [None]:
s3_client.upload_file(f'{abs_path}/{train_file_name}', bucket_name, f'fine-tuning-datasets/train/{train_file_name}')
s3_client.upload_file(f'{abs_path}/{validation_file_name}', bucket_name, f'fine-tuning-datasets/validation/{validation_file_name}')
s3_client.upload_file(f'{abs_path}/{test_file_name}', bucket_name, f'fine-tuning-datasets/test/{test_file_name}')

In [None]:
s3_train_uri = f's3://{bucket_name}/fine-tuning-datasets/train/{train_file_name}'
s3_validation_uri = f's3://{bucket_name}/fine-tuning-datasets/validation/{validation_file_name}'
s3_test_uri = f's3://{bucket_name}/fine-tuning-datasets/test/{test_file_name}'

## Storing variables to be used in other notebooks

> Please make sure to use the same kernel as used for 01_setup_nova_micro.ipynb for other notebooks on fine-tuning with Nova Micro.

In [None]:
%store role_arn
%store bucket_name
%store role_name
%store policy_arn
%store s3_train_uri
%store s3_validation_uri
%store s3_test_uri

### We are now ready to create a fine-tuning job with Nova Micro on Amazon Bedrock!