# Continued Pre-training Amazon Titan Text G1 Express with Bedrock

This notebook provides steps requried to customize a model with continued pre-training. Using Amazon Bedrock you can perform continued pre-training to adapt the model with your domain knowledge that's not present when the base model was trained. You can train a model with your domain data (business documents, or other business corpus). You can continue to improve the model by retraining the model with more unlabeled data as it becomes available.

## Pre-requisites

In [None]:
#Check Python version is greater than 3.8 which is required by Langchain if you want to use Langchain
import sys
sys.version

## Install the SDK

NOTE:
This notebook requires Bedrock Python SDK. Install Bedrock SDK if you haven't done yet. Refer to 00_bedrock_onboarding.ipynb notebook for steps to install and uninstall previous version if any.

In [None]:
!pip install datasets

In [None]:
!pip install jinja2

# Create IAM Role and assign Permissions

We need to create two roles

Role A- Notebook execution
--
This notebook requires permissions to invoke Bedrock service. Ensure to add a policy to the role listed above similar to
    
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Bedrock",
            "Effect": "Allow",
            "Action": "bedrock:*",
            "Resource": "*"
        }
    ]
}
```

Role B- Data access
--
To start the continued pre-training job, the role created above will pass execution to another IAM role that has access to continued pre-training job. Create a role that has following Trust relationship

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "bedrock.amazonaws.com"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "<account>"
        },
        "ArnEquals": {
          "aws:SourceArn": "arn:aws:bedrock:<region>:<account>:model-customization-job/*"
        }
      }
    }
  ]
}
```
The role created above needs to have permissions to S3 bucket where the training & validation datasets are located and access to S3 location where continued pre-training job output will be written

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:ListObjects"
            ],
            "Resource": [
                "arn:aws:s3:::<train_set_bucket>",
                "arn:aws:s3:::<train_set_bucket>/*",
                "arn:aws:s3:::<test_set_bucket>",
                "arn:aws:s3:::<test_set_bucket>/*",
                "arn:aws:s3:::<job_output_bucket>",
                "arn:aws:s3:::<job_output_bucket>/*"
            ]
        }
    ]
}

```


Also ensure Role A has permissions to pass IAM Role to Role B. This needs to be defined as an IAM policy similar to below

```
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"iam:GetRole",
				"iam:PassRole"
			],
			"Resource": "arn:aws:iam::<account>:role/*"
		}
	]
}

```


## Restart Kernel

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)  

In [1]:
import sagemaker
import boto3
session = boto3.Session()
sagemaker_session = sagemaker.Session()
studio_region = sagemaker_session.boto_region_name 
#sagemaker_session.get_caller_identity_arn()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
#Check if basic commands work with bedrock client
bedrock = boto3.client('bedrock')
bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': 'ecde1195-5a74-4afe-8a44-bbd4ef99d911',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Tue, 23 Jan 2024 15:26:24 GMT',
   'content-type': 'application/json',
   'content-length': '17086',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'ecde1195-5a74-4afe-8a44-bbd4ef99d911'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large',
   'modelName': 'Titan Text Large',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': [],
   'inferenceTypesSupported': ['ON_DEMAND'],
   'modelLifecycle': {'status': 'ACTIVE'}},
  {'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-g1-text-02',
   'modelId': 'amazon.titan-embed-g1-text-02',
   'modelName': 'Titan Text Embeddings v2',
   'providerName': 'Amazon',
   'inp

## Download datasets
In this step, we will use [SQUAD dataset](https://arxiv.org/abs/1606.05250) for continued pre-training. This dataset has a list of questions under different categories (title). We will filter for a category and use the filtered set for continued pre-training

In [3]:
from datasets import load_dataset
raw_datasets = load_dataset("squad")

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
category = 'Solar_energy'
raw_train_set = raw_datasets['train']
raw_test_set = raw_datasets['validation']
train_set = raw_train_set.filter(lambda x: x['title'] == category)
test_set = raw_test_set.filter(lambda x: x['title'] == category)

if len(test_set) == 0: #If there is no test data set available, split the train set for this
    split_set = train_set.train_test_split(test_size=0.1)
    train_set,test_set = split_set['train'],split_set['test']

train_set, test_set

(Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 225
 }),
 Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 25
 }))

## Prepare Training Dataset
In this step, we will upload training data to an S3 bucket. Bedrock customization job expects training and validation data to be in JSONL format and there should be any new line character at the end of the file.
We will convert the dataset downloaded into a JSONL file

In [6]:
import pathlib , os, json
continued_pretraining_dataset_path = 'data/continued-pretraining'
pathlib.Path(continued_pretraining_dataset_path).mkdir(parents=True, exist_ok=True)

In [7]:
import jinja2

env = jinja2.Environment(loader=jinja2.FileSystemLoader('templates'))
squad_template = env.get_template('continued_pretraining.txt')

def create_data_file(data,file_name):
    data_len = len(data)
    with open(file_name,'w') as f:
        for i, item in enumerate(data):
            c = item['context']
            c = c.replace('"','') #Remove any double quotes in context
            jsonl = squad_template.render(context=c)
            f.write(jsonl)
            if i < (data_len -1):
                f.write('\n')
    print(f'File {file_name} created with {data_len} rows')

In [8]:
from io import StringIO 
from urllib.parse import urlparse
import boto3
import sys
import pandas as pd

def split_s3_path(s3_uri):
    parse_result = urlparse(s3_uri, allow_fragments=False)
    return parse_result.netloc,parse_result.path.lstrip('/')

def s3_csv_to_df(s3_uri):
    s3 = boto3.client('s3')
    bucket, object_key = split_s3_path(s3_uri)
    csv = s3.get_object(Bucket=bucket, Key=object_key)
    csvs = csv['Body'].read().decode('utf-8')
    df = pd.read_csv(StringIO(csvs),sep=',')
    return df


In [9]:
train_file_name = f'{continued_pretraining_dataset_path}/train_data.jsonl'
test_file_name = f'{continued_pretraining_dataset_path}/test_data.jsonl'

In [10]:
create_data_file(train_set,train_file_name)
create_data_file(test_set,test_file_name)

File data/continued-pretraining/train_data.jsonl created with 225 rows
File data/continued-pretraining/test_data.jsonl created with 25 rows


### Upload the files created locally yo S3 and get the URIs

In [None]:
bucket = sagemaker_session.default_bucket()
continued_pretraining_prefix = 'bedrock_continued_pretraining'
train_data_s3_path = sagemaker_session.upload_data(train_file_name, bucket=bucket, key_prefix=continued_pretraining_prefix)
test_data_s3_path = sagemaker_session.upload_data(test_file_name, bucket=bucket, key_prefix=continued_pretraining_prefix)

train_data_s3_path,test_data_s3_path

## Start the customization job
We will prepare the inputs to the Continued pretraining (model customization job) and start the job.
Please make sure you have provided necessary IAM permissions explained in previous steps before proceeding further.

Besisdes the Hyperparameters, we define the names (Job name, Custom model name), attach tags to job and custom model for tracking purposes, location of Training and test set and the output path where the continued pre-training job would save the training & validaition metrics.  

We can optionally supply VPC configuation to ensure calls from the continued pre-training job to fetch training/ validation data are routed through VPC/ Private Links to S3. You can define security groups to define access controls for the job.

Continued pre-training job supports following Hyper parameters

- Epochs
- Batch size
- Learning Rate
- Learning rate warmup steps

In [None]:
from uuid import uuid4
from datetime import datetime

#Define Hyperparameters
hyper_params = {"epochCount" : "1","batchSize":"1", "learningRate": "0.00005","learningRateWarmupSteps":"0"}

#Define names (Job, Model)
base_model = "amazon.titan-text-express-v1"
client_token = str(uuid4())
custom_model_name= category.lower() + '_pre_trained_model'
job_name = f"bedrock-cp-titan-{datetime.now().strftime('%Y%m-%d%H-%M%S')}"

#Define tahs for tracking purposes
model_tags = [{"key": "custom_model_type","value": category.lower()}]
job_tags = [{"key": "base_model_type","value": base_model.replace(".","-")}]


output_s3_path = f's3://{bucket}/{continued_pretraining_prefix}/output/'

continued_pre_training_role = sagemaker_session.get_caller_identity_arn() 
#continued_pre_training_role = '' #Provide Role here

#Optional Setup VPC configuration
# vpc_config = {
#     "securityGroupIds":["sg-1","sg-2"],
#     "subnetIds":["subnet-a", "subnet-b"]
# }

client_token, custom_model_name, job_name, continued_pre_training_role

### Start the job using create_model_customization_job API
NOTE: This might take upto 30 minutes to complete depending on the training and validation dataset and other hyperparameters like Epochs etc.

In [None]:
bedrock.create_model_customization_job(
    customizationType="CONTINUED_PRE_TRAINING",
    baseModelIdentifier = base_model,
    clientRequestToken = client_token,
    customModelName = custom_model_name,
    customModelTags = model_tags,
    jobTags=job_tags, 
    hyperParameters = hyper_params,
    jobName=job_name,
    outputDataConfig = {"s3Uri": output_s3_path},
    trainingDataConfig = {"s3Uri" : train_data_s3_path},
    #validationDataConfig =  {"validators": [ {"s3Uri": test_data_s3_path}]},
    roleArn = continued_pre_training_role
    #vpcConfig = vpc_config
)

## Monitor the Continued Pre-training job
Once the job is submitted we will get the Job ARN. We will be able to monitor the job with APIs list_model_customization_jobs & list_model_customization_jobs

In [None]:
bedrock.list_model_customization_jobs(nameContains=job_name)

In [None]:
#job_name = 'bedrock-titan-202308-2316-5550'
job_detail = bedrock.get_model_customization_job(jobIdentifier=job_name)
job_detail

In [None]:
job_detail['status'], job_detail['jobArn']

## Stop Model customization job (Optional)
We can stop a model customization job in progress

In [None]:
#Optional- Uncomment to stop the job

#stop_model_customization_job(jobIdentifier=job_detail['jobArn'])

## View Training and Validation Metrics
Wait until the status of the job changes to "Completed" before proceeding further
In this step we will view the Training and Validation metrics

In [None]:
job_metrics_out_s3_prefix = f"{job_detail['outputDataConfig']['s3Uri']}model-customization-job-{job_detail['jobArn'].split('/')[-1]}"
job_metrics_out_s3_prefix

### Print Step wise training metrics

In [None]:

training_metrics_csv = f'{job_metrics_out_s3_prefix}/training_artifacts/step_wise_training_metrics.csv'
train_metrics_df = s3_csv_to_df(training_metrics_csv)
train_metrics_df

### Print Validation metrics

In [None]:
validation_metrics_csv = f'{job_metrics_out_s3_prefix}/validation_artifacts/post_fine_tuning_validation/validation/validation_metrics.csv'
validation_metrics_df = s3_csv_to_df(validation_metrics_csv)
validation_metrics_df

## List Custom Models

In [None]:
bedrock.list_custom_models()

## Query custom model activation status
In this step we will check the activation status of the model and check if it is ready for taking real-time inference requests

In [None]:
custom_model = bedrock.get_custom_model(modelIdentifier=custom_model_name)
custom_model

## Invoke Custom Model
In this step we will invoke teh custom model trained with the Domain specific data. We will select a random item from the test set to get results.
NOTE: Wait until the model is "ACTIVE" in the real time inference status. It might take upto 30 minutes for the model to become active (depending on volume of test and training datasets).

Amazon Bedrock allows you to run inference on custom models by purchasing provisioned throughput. This guarantees a consistent level of throughput in exchange for a term commitment. You specify the number of model units needed to meet your application’s performance needs. For evaluating custom models initially, you can purchase provisioned throughput hourly with no long-term commitment. With no commitment, a quota of one model unit is available per provisioned throughput. You can create up to two provisioned throughputs per account.

In [None]:
import random
item_no = random.randint(0,len(test_set) - 1)
test_item = test_set[item_no]
test_item

In [None]:
import json
prompt_template = "Given the context below, answer the specified question. The answer should be extracted directly from the context verbatim. CONTEXT: {context} QUESTION: {question}"

prompt = prompt_template.format(context=test_item['context'],question=test_item['question'])

body = json.dumps({"inputText": prompt})
modelId = custom_model_name
accept = "application/json"
contentType = "application/json"

bedrock_runtime = boto3.client('bedrock-runtime' )

response = bedrock_runtime.invoke_model(
    body=body, modelId=modelId, accept=accept, contentType=contentType
)
response_body = json.loads(response.get("body").read())

print(response_body.get("results")[0].get("outputText"))

## Delete Custom Model (Optional)
We can delete the custom model created above 

In [None]:
#Optional. Uncomment below line to remove the custom model created
#bedrock.delete_custom_model(modelIdentifier = custom_model['modelArn'])