# Chapter 8 - Custom Models: Continued Pre-training on Amazon Bedrock

## Overview
This notebook demonstrates how to perform continued pre-training on Amazon Bedrock foundation models. We'll explore how to extend pre-trained models with domain-specific knowledge and adapt them for specialized use cases.

## Introduction
This notebook demonstrates how to perform continued pre-training on Amazon Bedrock using custom datasets. Continued pre-training allows you to adapt foundation models to specific domains, industries, or writing styles by exposing them to relevant text data. This approach enhances model performance on domain-specific tasks without requiring full fine-tuning or extensive prompt engineering.

## Prerequisites
- AWS account with Amazon Bedrock access
- Access to Amazon Titan Text models
- Custom text dataset for continued pre-training
- Appropriate IAM permissions


## Setup

### Install Required Dependencies

In [None]:
# Install required packages for dataset handling
!pip install datasets==2.15.0

### Import Libraries

In [None]:
# Import required libraries
import boto3      # AWS SDK for Python
import json       # JSON handling
import datetime   # Date/time operations
import os         # Operating system interface

## AWS Resource Setup

### Create S3 Bucket and IAM Role

In [None]:
## Dataset Preparation
# Initialize AWS clients
iam = boto3.client("iam")
s3 = boto3.client('s3')

# Get current AWS account ID for unique resource naming
account_id = boto3.client('sts').get_caller_identity()['Account']
bucket_name = f"bedrock-pretraining1-{account_id}"

# Create S3 bucket for storing training data
print(f"Creating S3 bucket: {bucket_name}")
s3.create_bucket(Bucket=bucket_name)

# Create IAM role that Bedrock can assume
role_name = f"Bedrock-Pretraining-Role1-{account_id}"
print(f"Creating IAM role: {role_name}")

# Define trust policy allowing Bedrock to assume this role
trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

role = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy)
)["Role"]["RoleName"]

# Create IAM policy with S3 permissions
policy_name = "Bedrock-Pretraining-Role1-Policy"
print(f"Creating IAM policy: {policy_name}")

# Define policy allowing S3 operations on our bucket
s3_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",      # Read objects from S3
                "s3:PutObject",      # Write objects to S3
                "s3:ListBucket"      # List bucket contents
            ],
            "Resource": [
                f"arn:aws:s3:::{bucket_name}",     # Bucket itself
                f"arn:aws:s3:::{bucket_name}/*"    # All objects in bucket
            ]
        }
    ]
}

policy_arn = iam.create_policy(
    PolicyName=policy_name,
    PolicyDocument=json.dumps(s3_policy)
)["Policy"]["Arn"]

# Attach the policy to the role
iam.attach_role_policy(
    RoleName=role,
    PolicyArn=policy_arn
)

print("✅ AWS resources created successfully!")

## Dataset Preparation

### Load and Process Dataset

In [None]:
# Import the datasets library for data handling
from datasets import load_dataset

# Load the pre-training dataset from JSONL file
# This dataset contains the text data we want to continue pre-training on
print("Loading dataset from JSONL file...")
dataset = load_dataset('json', data_files='data/pretraining_dataset.jsonl', split='train')
print(f"Dataset loaded with {len(dataset)} examples")

### Format and Split Dataset

In [None]:
# Split dataset into training (90%) and validation (10%) sets
# Validation data helps monitor training progress and prevent overfitting
print("Splitting dataset into train/validation sets...")
train_and_validation_dataset = dataset.train_test_split(test_size=0.1)

print(f"Training examples: {len(train_and_validation_dataset['train'])}")
print(f"Validation examples: {len(train_and_validation_dataset['test'])}")

# Create directory for processed datasets
dataset_dir = "dataset"

def format_save_dataset(filename, dataset):
    """
    Format and save dataset in the format expected by Bedrock.
    Each line contains a JSON object with an 'input' field.
    """
    os.makedirs(dataset_dir, exist_ok=True)
    
    with open(f"{dataset_dir}/{filename}", "w") as f:
        for example in dataset:
            # Extract content from the 'input' field
            content = example["input"]
            
            # Format as required by Bedrock pre-training
            formatted_example = {
                "input": content
            }
            
            # Write as JSONL (one JSON object per line)
            json.dump(formatted_example, f)
            f.write('\n')
    
    print(f"Saved {len(dataset)} examples to {filename}")

# Save formatted datasets
format_save_dataset("train.jsonl", train_and_validation_dataset["train"])
format_save_dataset("validation.jsonl", train_and_validation_dataset["test"])

### Upload to S3

In [None]:
# Upload formatted datasets to S3 bucket
print("Uploading datasets to S3...")

s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity()['Account']
bucket_name = f"bedrock-pretraining-{account_id}"  # Note: Different bucket name

# Walk through dataset directory and upload all files
uploaded_files = []
for root, dirs, files in os.walk(dataset_dir):
    for file in files:
        # Get full local path
        full_path = os.path.join(root, file)
        
        # Get relative path for S3 key
        relative_path = os.path.relpath(full_path, dataset_dir)
        
        # Upload to S3
        s3.upload_file(full_path, bucket_name, relative_path)
        uploaded_files.append(relative_path)
        print(f"Uploaded: {relative_path}")

print(f"✅ Successfully uploaded {len(uploaded_files)} files to S3")

## Model Customization Job

### Create Pre-training Job

In [None]:
# Initialize Bedrock client for model customization
bedrock = boto3.client(service_name='bedrock')
account_id = boto3.client('sts').get_caller_identity()['Account']

In [None]:
# Configure training job parameters
datetime_string = datetime.datetime.now().strftime("%Y%m%d%H%M%S")

# Job configuration
customizationType = "CONTINUED_PRE_TRAINING"  # Type of customization
customModelName = "pretrained-titan-lite-model"  # Name for our custom model
jobName = f"Titan-Lite-Pretraining-Job-{datetime_string}"  # Unique job name

# Base model to continue training from
baseModelIdentifier = "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-lite-v1:0:4k"

# IAM role for Bedrock to access S3
roleArn = f"arn:aws:iam::{account_id}:role/Bedrock-Pretraining-Role-{account_id}"

# Training hyperparameters
hyperParameters = {
    "epochCount": "1",                    # Number of training epochs
    "batchSize": "1",                    # Batch size for training
    "learningRate": ".0001",              # Learning rate
    "learningRateWarmupSteps": "0"       # Warmup steps
}

print(f"Creating model customization job: {jobName}")
print(f"Base model: Amazon Titan Text Lite")
print(f"Training epochs: {hyperParameters['epochCount']}")

# Create the model customization job
response_ft = bedrock.create_model_customization_job(
    jobName=jobName,
    customModelName=customModelName,
    customizationType=customizationType,
    roleArn=roleArn,
    baseModelIdentifier=baseModelIdentifier,
    hyperParameters=hyperParameters,
    
    # Training data location
    trainingDataConfig={
        "s3Uri": f"s3://bedrock-pretraining-{account_id}/train.jsonl"
    },
    
    # Validation data location
    validationDataConfig={
        'validators': [{
            "s3Uri": f"s3://bedrock-pretraining-{account_id}/validation.jsonl"
        }]
    },
    
    # Output location for training artifacts
    outputDataConfig={
        "s3Uri": f"s3://bedrock-pretraining-{account_id}/pretraining-output"
    }
)

print("✅ Model customization job created successfully!")

In [None]:
# Get and display the job ARN for reference
jobArn = response_ft.get('jobArn')
print(f"Job ARN: {jobArn}")

## Monitor Training Progress

### Check Job Status

In [None]:
# Check the current status of the training job
# Run this cell periodically to monitor progress

job_details = bedrock.get_model_customization_job(jobIdentifier=jobName)
status = job_details["status"]

print(f"Job Status: {status}")

# Display additional job information
if 'creationTime' in job_details:
    print(f"Started: {job_details['creationTime']}")

if status == "Complete":
    print("🎉 Training completed successfully!")
elif status == "InProgress":
    print("⏳ Training in progress... Please wait and check again later.")
elif status == "Failed":
    print("❌ Training failed. Check the job details for error information.")
else:
    print(f"Current status: {status}")

## Purchase Provisioned Throughput

### Configure Throughput

In [None]:
# Purchase provisioned throughput for the custom model
# This allocates dedicated compute resources for inference

provisioned_model_name = "ProvisionedCustomTitanLite"
model_units = 1  # Number of model units to provision

print(f"Creating provisioned throughput: {provisioned_model_name}")
print(f"Model units: {model_units}")

response_pt = bedrock.create_provisioned_model_throughput(
    modelId=customModelName,
    provisionedModelName=provisioned_model_name,
    modelUnits=model_units
)

# Get the ARN of the provisioned model for inference
provisionedModelArn = response_pt.get('provisionedModelArn')
print(f"Provisioned Model ARN: {provisionedModelArn}")

print("✅ Provisioned throughput created successfully!")
print("💰 Note: This will incur ongoing costs until deleted.")

## Test the Custom Model

### Run Inference

In [None]:
# Initialize Bedrock Runtime client for model inference
bedrock_runtime = boto3.client(service_name='bedrock-runtime')

# Define your test prompt here
# Replace with a prompt relevant to your training data
prompt = "<ENTER_PROMPT>"  # TODO: Replace with your actual prompt

# Configure inference parameters
inference_config = {
    "prompt": prompt,
    "temperature": 0.5,      # Controls randomness (0.0 = deterministic, 1.0 = very random)
    "p": 0.9,               # Top-p sampling parameter
    "max_tokens": 512      # Maximum tokens to generate
}

print(f"Testing custom model with prompt: {prompt[:50]}...")

# Invoke the custom model
response = bedrock_runtime.invoke_model(
    modelId=provisionedModelArn,  # Use our provisioned custom model
    body=json.dumps(inference_config)
)

# Parse and display the response
response_body = json.loads(response['body'].read())
generated_text = response_body.get('outputText', '')

print("\n=== Model Response ===")
print(generated_text)
print("\n✅ Custom model inference completed!")

## 8. Cleanup (Important!)

⚠️ **Don't forget to clean up resources** to avoid ongoing charges:

1. **Delete provisioned throughput** (incurs hourly costs)
2. **Delete custom model** (if no longer needed)
3. **Delete S3 bucket** and contents
4. **Delete IAM role and policy**

```python
# Example cleanup commands:
bedrock.delete_provisioned_model_throughput(provisionedModelId='ProvisionedCustomTitanLite')
bedrock.delete_custom_model(modelIdentifier=customModelName)
```

## Conclusion

In this notebook, we demonstrated the complete workflow for continued pre-training of foundation models on Amazon Bedrock. This approach enables you to adapt large language models to specific domains, industries, or writing styles by exposing them to targeted text data.

Key accomplishments:

1. **Resource Setup**: We established the necessary AWS infrastructure including secure IAM roles and S3 storage for our training data.

2. **Dataset Preparation**: We loaded, processed, and formatted our custom dataset into the structure required by Bedrock's continued pre-training process.

3. **Model Customization**: We created a customization job that builds upon the Amazon Titan Text Lite foundation model, adapting it to our specific domain using our custom dataset.

4. **Deployment**: We provisioned throughput resources to make our custom model available for inference.

5. **Model Testing**: We demonstrated how to use the custom pre-trained model to generate text that reflects the characteristics of our training data.

Continued pre-training offers several advantages over other customization approaches:

- **Domain Adaptation**: Models become more fluent and knowledgeable about specific domains or industries
- **Style Adaptation**: The model can better match the writing style, tone, and terminology of your organization
- **Knowledge Integration**: The model implicitly learns information embedded in your training texts
- **Efficiency**: Often more efficient than fine-tuning for general domain adaptation
