# Continued Pre-Training with Amazon Bedrock

#### GitHub repo
https://github.com/generative-ai-on-aws/generative-ai-on-aws

![](images/github.png)

_Note: This notebook was tested in Amazon SageMaker Studio with Python 3 (Data Science 3.0) kernel with the ml.t3.medium kernel._

-----------

### 1. Setup
### 2. Test the base model
### 3. Prepare the dataset for continued pre-training
### 4. Upload the dataset to S3
### 5. Customize the model with continued pre-training
### 6. Provision the custom model for inference
### 7. Test the custom model
### 8. Delete the provisioned model to save cost

-----------

## 1. Setup

In [3]:
%pip install -q -U --force-reinstall \
    boto3 \
    pandas==2.1.2 \
    langchain==0.0.324 \
    typing_extensions==4.7.1 \
    pypdf==3.16.4

In [4]:
import boto3
import json
import time
from pprint import pprint
from IPython.display import display, HTML
import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 2)
pd.set_option('display.max_colwidth', 1000)

In [5]:
# Amazon Bedrock control plane including fine-tuning
bedrock = boto3.client(service_name="bedrock")

# Amazon Bedrock data plane including model inference
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

In [6]:
base_model_id = "amazon.titan-text-express-v1"

# 2. Test the base model

In [7]:
prompt = "Describe the book, 'Generative AI on AWS' by O'Reilly Media."

body = {
    "inputText": prompt,
    "textGenerationConfig": {
        "maxTokenCount": 512,
        "stopSequences": [],
        "temperature": 1,
        "topP": 0.9
    }
}

![](images/gaia_cover_sm.png)

### Amazon Titan Text model

In [8]:
response = bedrock_runtime.invoke_model(
    modelId="amazon.titan-text-express-v1", # Amazon Titan Text model
    body=json.dumps(body)
)

response_body = response["body"].read().decode('utf8')
print(json.loads(response_body)["results"][0]["outputText"])


Amazon Web Services (AWS) is at the forefront of machine learning (ML) and artificial intelligence (AI), and this book shows you how to leverage the power of AWS to create, train, and deploy generative AI (GAI) models.

Written by AI and ML experts from AWS, this comprehensive guide provides a step-by-step roadmap for building, training, and deploying GAI models using AWS services such as Amazon SageMaker, Amazon Transcribe, and Amazon Rekognition.

The book begins by introducing the basics of GAI and its applications in fields such as natural language processing, image generation, and speech recognition. It then provides an overview of AWS services that can be used to build, train, and deploy GAI models, including Amazon SageMaker, Amazon Transcribe, Amazon Rekognition, and Amazon Transcribe.

Each chapter in the book focuses on a specific aspect of GAI and provides hands-on examples and code snippets that demonstrate how to use AWS services to build, train, and deploy GAI models. Th

# 3. Prepare the dataset for continued pre-training

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/Generative_AI_on_AWS_OReilly.pdf")
document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 20000, # 4096 token context window * 6 chars per token ~= 24,576 
    chunk_overlap = 2000, # overlap for continuity across chunks
)

docs = text_splitter.split_documents(document)
docs[0]

Document(page_content='Generative AI  \n on AWS\nBuilding Context-Aware  \nMultimodal Reasoning  \nApplications\nChris Fregly, \nAntje Barth &  \nShelbee Eigenbrode', metadata={'source': 'data/Generative_AI_on_AWS_OReilly.pdf', 'page': 0})

In [10]:
contents = ""
for doc in docs:
    content = {"input": doc.page_content}
    contents += (json.dumps(content) + "\n")

In [11]:
with open("./train-continual-pretraining.jsonl", "w") as file:
    file.writelines(contents)
    file.close()

In [12]:
import pandas as pd
df = pd.read_json("./train-continual-pretraining.jsonl", lines=True)
df

Unnamed: 0,input
0,"Generative AI \n on AWS\nBuilding Context-Aware \nMultimodal Reasoning \nApplications\nChris Fregly, \nAntje Barth & \nShelbee Eigenbrode"
1,"DATA“I am very excited about \nthis book —it has a great \nmix of all-important \nbackground/theoretical \ninfo and detailed, \nhands-on code, scripts, \nand walk-throughs. I \nenjoyed reading it, and \nI know that you will too!” \n—Jeff Barr\nVP and Chief Evangelist @ AWSGenerative AI on AWS\nTwitter: @oreillymedia\nlinkedin.com/company/oreilly-media\nyoutube.com/oreillymedia Companies today are moving rapidly to integrate generative \nAI into their products and services. But there’s a great deal \nof hype (and misunderstanding) about the impact and \npromise of this technology. With this book, Chris Fregly, \nAntje Barth, and Shelbee Eigenbrode from AWS help CTOs, \nML practitioners, application developers, business analysts, \ndata engineers, and data scientists find practical ways to \nuse this exciting new technology.\nYou’ll learn the generative AI project life cycle including \nuse case definition, model selection, model fine-tuning, \nretrieval-augmented generation, ..."
2,"Praise for Generative AI on AWS\nI am very excited about this book—it has a great mix of all-important background/\ntheoretical info and detailed, hands-on code, scripts, and walk-throughs. I enjoyed\nreading it, and I know that you will too! Starting from the basics, you will learn about\ngenerative foundation models, prompt engineering, and much more. From there you will\nproceed to large language models (LLMs) and will see how to use them from within\nAmazon SageMaker. After you master the basics, you will have the opportunity to learn\nabout multiple types of fine-tuning, and then you will get to the heart of the book and\nlearn to build applications that have the power to perform context-aware reasoning with\ngenerative models of different modalities including text and images.\n—Jeff Barr, VP and Chief Evangelist @ AWS\nThis book is a comprehensive resource for building generative AI–based solutions\non AWS. Using real-world examples, Chris, Antje, and Shelbee have done a\nspe..."
3,"It’s very rare to find a book that comprehensively covers the full end-to-end process of\nmodel development and deployment! If you’re an ML practitioner, this book is a must!\n—Alejandro Herrera, Data Scientist @ Snowflake\nThis book goes deep into how GenAI models are actually built and used. And it covers\nthe whole life cycle, not just prompt engineering or tuning. If you’re thinking about using\nGenAI for anything nontrivial, you should read this book to understand what skill sets\nand tools you’ll need to be successful.\n—Randy DeFauw, Sr. Principal Solution Architect @ AWS\nThere’s no better book to get started with generative AI. With all the information\non the internet about the topic, it’s extremely overwhelming for anyone. But this\nbook is a clear and structured guide: it goes from the basics all the way to\nadvanced topics like parameter-efficient fine-tuning and LLM deployment. It’s also\nvery practical and covers deployment on AWS too. This book is an extremely\nvalu..."
4,"Generative AI on AWS provides an in-depth look at the innovative techniques for creating\napplications that comprehend diverse data types and make context-driven decisions.\nReaders get a comprehensive view, bridging both the theoretical aspects and practical\ntools needed for generative AI applications. This book is a must-read for those wanting to\nharness the full potential of AWS in the realm of generative AI.\n—Kesha Williams, Director @ Slalom Consulting\nand AWS Machine Learning Hero\nThe generative AI landscape evolves so fast that it’s incredible to see so much relevant\nknowledge condensed into a comprehensive book. Well done!\n—Francesco Mosconi, Head of Data Science @ Catalit"
...,...
300,"greedy versus random sampling, 30\nmax new tokens, 30\ntemperature, 33\ntop-p and top-k random sampling, 31-32\ninput_id as token in model’s vocabulary, 37\nnumeric representations of each token,\n37\nscaling laws, 49-51\ntokenizers, 37\nembeddings, 38-40\ntraining large-language foundation models\nabout, 35\ncompute-optimal models, 51\nembeddings, 38-40\nend-of-sequence (EOS) token, 45\npretrained models publicly available, 36\npretraining datasets, 48\npretraining datasets optimal size, 51\npublicly available, pretrained models’\nvocabulary, 37\nscaling laws, 49-51\ntokenizers, 37\nTransformer architecture, 40-45\ntypes of Transformer-based foundation\nmodels, 46-48\ntoken_ids from tokenizer, 37\ntop-p and top-k random sampling, 31-32\ntoxic language detection\nabout toxic language reduction, 122\nexample model, 115\nreward model by Meta, 123\nreward model by Meta used to fine-tune,\n123, 126-128\nfine-tuning with RLHF, 124-131\ntoxicity baseline compared to fine-tuned\nmodel, 13..."
301,"Transformers-NeuronX library, 145\n“tricking” a model to learn incorrect answer, 23\nTRL library, 127\nPPOTrainer, 127\nU\nU-Net models, 223\nunidirectional causal language modeling\n(CLM), 47\nunimodal models defined, 195\nupdate and deployment strategies, 147\nA/B testing, 148-149\nshadow deployment, 149\nupscaling in image generation, 203\nuse cases and tasks for generative AI, 1-4\nV\nvector stores\nabout, 168, 254\nAWS generative AI, 9\nRAG document search and retrieval, 9, 161,\n167\nexample with FAISS and LangChain,\n168-171\nvector embedding storage options, 168\nsupported by LangChain, 169vectors (see embedding vectors or “embed‐\ndings”)\nvirtual tokens as soft prompts, 106\ngenerating virtual tokens, 107\nvisual language models (VLMs), 209\n(see also multimodal large language models)\nvisual question answering (VQA), 211-216\nevaluating, 219\nVizWiz-VQA dataset, 219\nVQAv2 dataset, 219\nW\nweights (see parameters)\nWiki-40B as a pretraining dataset, 49\nWikipedia as a pr..."
302,"About the Authors\nChris Fregly is a Principal Solutions Architect for generative AI at Amazon Web\nServices based in San Francisco, California. Chris holds every AWS certification. He\nis also cofounder of the global Generative AI on AWS Meetup. Chris regularly speaks\nat AI and machine learning meetups and conferences across the world. Previously,\nChris was an engineer at Databricks and Netflix, where he worked on scalable big\ndata and machine learning products and solutions. He is also coauthor of the O’Reilly\nbook Data Science on AWS .\nAntje Barth is a Principal Developer Advocate for generative AI at Amazon Web\nServices based in San Francisco, California. She is also cofounder of the global Gener‐\native AI on AWS Meetup and the Düsseldorf chapter of Women in Big Data. Antje\nfrequently speaks at AI and machine learning conferences and meetups around the\nworld. Prior to joining AWS, Antje worked in solutions engineering roles at MapR\nand Cisco, helping developers leve..."
303,"with their ability to create novel song patterns—to the great classical composers and\nmodern artists like Kendrick Lamar.\nThough its population size has not been precisely quantified, the brown-backed\nmockingbird has been categorized by the IUCN as a species of least concern. Many\nof the animals on O’Reilly covers are endangered; all of them are important to the\nworld.\nThe cover illustration is by Karen Montgomery, based on an antique line engraving\nfrom Cassell’s Natural History . The cover fonts are Gilroy Semibold and Guardian\nSans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Con‐\ndensed; and the code font is Dalton Maag’s Ubuntu Mono."


In [13]:
data = "./train-continual-pretraining.jsonl"

# 4. Upload dataset to S3

Next, we need to upload our training dataset to S3:

In [14]:
import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session_bucket = sess.default_bucket()

s3_location = f"s3://{sagemaker_session_bucket}/bedrock/finetuning/train-continual-pretraining.jsonl"
s3_output = f"s3://{sagemaker_session_bucket}/bedrock/finetuning/output"

In [15]:
!aws s3 cp train-continual-pretraining.jsonl $s3_location

upload: ./train-continual-pretraining.jsonl to s3://sagemaker-us-east-1-079002598131/bedrock/finetuning/train-continual-pretraining.jsonl


# 5. Customize the model with continued pre-training

In [16]:
timestamp = int(time.time())

job_name = "titan-{}".format(timestamp)
job_name

custom_model_name = "custom-{}".format(job_name)
custom_model_name

'custom-titan-1701126158'

In [17]:
bedrock.create_model_customization_job(
    customizationType="CONTINUED_PRE_TRAINING",
    jobName=job_name,
    customModelName=custom_model_name,
    roleArn=role,
    baseModelIdentifier=base_model_id,
    hyperParameters = {
        "epochCount": "10",
        "batchSize": "1",
        "learningRate": "0.000001"
    },
    trainingDataConfig={"s3Uri": s3_location},
    outputDataConfig={"s3Uri": s3_output},
)

In [21]:
status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]

while status == "InProgress":
    print(status)
    time.sleep(300)
    status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]
    
print(status)

Completed


In [22]:
custom_model_arn = bedrock.get_custom_model(modelIdentifier=custom_model_name)['modelArn']

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/bedrock/home?region={}#/custom-models/item?arn={}">Custom Model</a></b>'.format(
            region, custom_model_arn
        )
    )
)

# 6. Provision the custom model for inference

In [23]:
provisioned_model_name = "{}-provisioned".format(custom_model_name)

base_model_arn = bedrock.get_custom_model(modelIdentifier=custom_model_name)['baseModelArn']

# Must do this manually through the console.  
# Use the value of "provisioned_model_name" for continuity.
#
# bedrock.create_provisioned_model_throughput(
#     modelUnits = 1,
#     provisionedModelName = provisioned_model_name,
#     modelId = base_model_arn
# ) 

In [24]:
deployment_status = bedrock.get_provisioned_model_throughput(
    provisionedModelId=provisioned_model_name)["status"]

while deployment_status == "Creating":    
    print(deployment_status)
    time.sleep(120)
    deployment_status = bedrock.get_provisioned_model_throughput(
        provisionedModelId=provisioned_model_name)["status"]  
    
print(deployment_status)

Creating
Creating
Creating
Creating
Creating
Creating
InService


In [25]:
provisioned_model_arn = bedrock.get_provisioned_model_throughput(
     provisionedModelId=provisioned_model_name)["provisionedModelArn"]

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/bedrock/home?region={}#/provisioned-throughput/details?arn={}">Custom Model Inference</a></b>'.format(
            region, provisioned_model_arn
        )
    )
)

# 7. Test the custom model

In [26]:
prompt = "Describe the book, 'Generative AI on AWS' by O'Reilly Media."

body = {
    "inputText": prompt,
    "textGenerationConfig": {
        "maxTokenCount": 512,
        "stopSequences": [],
        "temperature": 1,
        "topP": 0.9
    }
}

### Amazon Titan Text model

In [27]:
response = bedrock_runtime.invoke_model(
    modelId="amazon.titan-text-express-v1", # Amazon Titan Text model
    body=json.dumps(body)
)

response_body = response["body"].read().decode('utf8')
print(json.loads(response_body)["results"][0]["outputText"])


Amazon Web Services (AWS) is at the forefront of machine learning (ML) and artificial intelligence (AI), and this book shows you how to leverage the power of AWS to create, train, and deploy generative AI (GAI) models.

Written by AI and ML experts from AWS, this comprehensive guide provides a step-by-step roadmap for building, training, and deploying GAI models using AWS services such as Amazon SageMaker, Amazon Transcribe, and Amazon Rekognition.

The book begins by introducing the basics of GAI and its applications in fields such as natural language processing, image generation, and speech recognition. It then provides an overview of AWS services that can be used to build, train, and deploy GAI models, including Amazon SageMaker, Amazon Transcribe, Amazon Rekognition, and Amazon Transcribe.

Each chapter in the book focuses on a specific aspect of GAI and provides hands-on examples and code snippets that demonstrate how to use AWS services to build, train, and deploy GAI models. Th

### Our custom pre-trained model

In [28]:
response = bedrock_runtime.invoke_model(
    modelId=provisioned_model_arn, # custom pre-trained model
    body=json.dumps(body)
)

response_body = response["body"].read().decode('utf8')
print(json.loads(response_body)["results"][0]["outputText"])


The book is a thorough introduction to the most advanced generative AI techniques and platforms available, including Amazon Bedrock, for building and deploying useful generative AI applications. The book is targeted at technical professionals who want to learn and deploy AI to real-world applications, providing in-depth explanations of generative AI concepts, methodologies, and frameworks using Python code examples.
Generative AI on AWS is organized into three parts:
1. Data and Data Preparation: This section provides an introduction to the importance of high-quality data for generative AI applications, including how to evaluate and prepare datasets for training and inference. It covers text data, image data, audio data, and video data, as well as common data preparation techniques such as data augmentation and tokenization.
2. Build, Train, and Evaluate Generative Models: This section covers the core concepts of generative AI, including supervised, semi-supervised, and reinforcement 

# 8. Delete provisioned model to save cost

In [29]:
bedrock.delete_provisioned_model_throughput(
    provisionedModelId = provisioned_model_name
)

# GitHub repo
https://github.com/generative-ai-on-aws/generative-ai-on-aws

![](images/github.png)