# Continued Pre-Training notebook

## Prepare the dataset

The dataset is based on a PDF file containing 1,625 European CapMarkets and Bank Finance terms.
Let's first download the dataset from [Latham & Watkins' website](https://www.lw.com/en/book-of-jargon/boj-european-capital-markets-and-bank-finance).

In [83]:
import urllib.request

url = "https://www.lw.com/admin/Upload/Documents/Books%20of%20Jargon/Book-of-Jargon-European-Capital-Markets-and-Bank-Finance-2nd-Edition.pdf"
filename = "capmarkets-jargon"
filename_pdf = "capmarkets-jargon.pdf"

urllib.request.urlretrieve(url, filename_pdf)

('capmarkets-jargon.pdf', <http.client.HTTPMessage at 0x7f5815cbe2f0>)

Since our dataset is in PDF format, we need to install PyPDF2 library to be able to extract text from PDF.

In [None]:
!pip install PyPDF2

Now, let's define a function that converts each line of a PDF page into JSONL.

In [87]:
terms_counter = 0
parts = []

### Construct JSONL from one PDF page

def visitor_body(text, cm, tm, fontDict, fontSize):
    global terms_counter
    
    if len(text) > 1:
        if(fontDict['/BaseFont'] == '/IYOEDH+CandidaStd-Bold'):
            # close previous term, if there's one
            if(parts):
                parts.append("\"}")
                parts.append("\n")


            terms_counter = terms_counter + 1
            
            parts.append("{\"input\":" + " \"" + text )
        else:
            parts.append(text)


Next, we iterate through each PDF page containing definitions converting them to JSONL and then save the output as capmarkets-jargon.jsonl file.

In [88]:
from PyPDF2 import PdfReader

pdfReader = PdfReader(filename_pdf) 
start_page = 3
end_page = 186
terms_counter = 0

# clear all old items, if any
parts.clear()
output = ""
for i in range(start_page, end_page):
    page = pdfReader.pages[i]
    page.extract_text(visitor_text=visitor_body)

# close the last term
parts.append( "\"" + "}" )

print("Terms processed:", terms_counter)

output = "".join(parts)
with open(filename + ".jsonl", "w") as outfile:
    outfile.write(output)


Terms processed: 1624


Finally, let's upload the generated JSONL file to S3 so that it's accessible by Bedrock.

In [134]:
import sagemaker

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

bucket = sagemaker_session.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [135]:
# Specify the path to the training data
train_data_path = filename + ".jsonl"

# Upload the training data to the specified S3 key prefix 'PreProcessed'
s3_train_data = sagemaker_session.upload_data(path=train_data_path, key_prefix='PreProcessed')

# Print a message indicating the successful upload
print(f"Uploaded {train_data_path} to {s3_train_data}")

Uploaded capmarkets-jargon.jsonl to s3://sagemaker-us-east-1-922650977840/PreProcessed/capmarkets-jargon.jsonl


## Create the training job

In [136]:
# Select the foundation model you want to customize
base_model_id = "amazon.titan-text-express-v1"

job_prefix = "customTitan"

# Update the role with your ARN
roleArn = 'arn:aws:iam::[account-number]:role/service-role/[role-name]'

# Generate a unique identifier for the job and custom model name
job_uuid = str(uuid.uuid4())[:8]  # Extracting the first 8 characters for brevity
jobName = f"{job_prefix}-{job_uuid}"
customModelName = f"{job_prefix}-{job_uuid}"

jobIdentifier = bedrock.create_model_customization_job(
    customizationType="CONTINUED_PRE_TRAINING",
    jobName=jobName,
    customModelName=customModelName,
    roleArn=roleArn,
    baseModelIdentifier=base_model_id,
    hyperParameters = {
        "epochCount": "5",
        "batchSize": "1",
        "learningRate": "0.00001",
    },
    trainingDataConfig={"s3Uri": s3_train_data},
    outputDataConfig={"s3Uri": f"s3://{bucket}/CustomModel/"},
)

## Monitor the job till the status is shown as "Completed"

In [137]:
pretrain_job = bedrock.get_model_customization_job(jobIdentifier=jobIdentifier['jobArn'])
print(pretrain_job['status'])

InProgress


## Create provisioned no-commit throughput for the custom model

 (Only run the following once the status of the above job is shown as "Completed")

In [127]:
customModelId=pretrain_job['outputModelArn']


provisionedModelName = f"{job_prefix}-provisioned-{job_uuid}"

# Create the provisioned capacity without passing any commitment option
provisionedModelArn = bedrock.create_provisioned_model_throughput(
    modelUnits=1,
    provisionedModelName=provisionedModelName, 
    modelId=customModelId
   )['provisionedModelArn']

## Check the provisoned capacity creation status

In [131]:
# Get Provisioned model status untill it's completed
provisionedModelStatus = bedrock.get_provisioned_model_throughput(provisionedModelId=provisionedModelArn)
print (provisionedModelStatus['status'])

Creating


## Run inferences 

Run inferences on the custom provisioned model and the base model via bedrock and observe the difference.

In [133]:
import json

# Initialize Bedrock Runtime client in the specified region
bedrockRuntime = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')

# Sample request body containing text for summarization and parameters for model inference
body = json.dumps({
    "inputText": "Explain what is JLM.",
    "textGenerationConfig": {
        "temperature": 0.01,  
        "topP": 0.99,
        "maxTokenCount": 300
    }
})

# Specify content types for request and response
accept = 'application/json'
contentType = 'application/json'

# Invoke custom model with the provided parameters
response = bedrockRuntime.invoke_model(body=body, modelId=provisionedModelArn, accept=accept, contentType=contentType)

# Parse and print the output from the custom model
response_body_custom = json.loads(response.get('body').read())
print("Custom Model Output:")
print(response_body_custom['results'][0]['outputText'])

# Invoke the base model with the same parameters
response = bedrockRuntime.invoke_model(body=body, modelId=basemodelId, accept=accept, contentType=contentType)

# Parse and print the output from the base model
response_body_base = json.loads(response.get('body').read())
print("\n")
print("Base Model Output:")
print(response_body_base['results'][0]['outputText'])


Custom Model Output:

JLM is an abbreviation that stands for Java Language Model. It is a component of the Java programming language used for natural language processing tasks such as language translation, text summarization, and question answering. The JLM is trained on large amounts of text data and can generate human-like responses to various input queries. It is commonly used in chatbots, virtual assistants, and other applications that require natural language processing capabilities.


Base Model Output:

JLM is an abbreviation that stands for "Java Language Model." It refers to a type of artificial intelligence (AI) model developed using the Java programming language. JLM models are designed to understand and generate human language, allowing them to perform tasks such as natural language processing (NLP), text analysis, and chatbot functionality.

One of the most well-known JLM models is called Turing NLG (Turing Natural Language Generation), developed by researchers at the Univ

## Delete the provisioned capacity and the custom model

In [None]:
# Delete the provisioned capacity
bedrock.delete_provisioned_model_throughput(provisionedModelId=provisionedModelArn)

In [142]:
# Delete the custom model
bedrock.delete_custom_model (modelIdentifier=customModelId)

{'ResponseMetadata': {'RequestId': '074b5a82-0789-42fa-af46-eb353cdb2fc9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sun, 10 Mar 2024 09:27:33 GMT',
   'content-type': 'application/json',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': '074b5a82-0789-42fa-af46-eb353cdb2fc9'},
  'RetryAttempts': 0}}