# Fine-tuning Llama 3.2 with Vision Capabilities - Data Preparation

## Introduction

Fine-tuning multi-modal models allows you to enhance their capabilities for specific visual understanding tasks. This notebook demonstrates how to prepare data for fine-tuning Meta Llama 3.2 with vision capabilities using Amazon Bedrock. We'll use a subset of the llava-instruct dataset to create training, validation, and test sets in the required format.

The Llama 3.2 vision model can process and understand both text and images, enabling it to answer questions about visual content. Fine-tuning can improve the model's performance on domain-specific visual tasks.

In this notebook, we'll:

- Download a subset of the llava-instruct dataset
- Process the images and upload them to Amazon S3
- Format the data according to the Bedrock conversation schema
- Prepare the dataset for fine-tuning

## Prerequisites


Before starting, ensure you have:

- An AWS account with access to Amazon Bedrock
- Appropriate IAM permissions for Bedrock and S3
- A working Python environment with the necessary libraries

You'll need to create an IAM role with the following permissions:

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR_BUCKET_NAME",
                "arn:aws:s3:::YOUR_BUCKET_NAME/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelCustomizationJob",
                "bedrock:GetModelCustomizationJob",
                "bedrock:ListModelCustomizationJobs",
                "bedrock:StopModelCustomizationJob"
            ],
            "Resource": "arn:aws:bedrock:us-west-2:YOUR_ACCOUNT_ID:model-customization-job/*"
        }
    ]
}
```

## Setup

First, let's install and import the necessary libraries:

In [None]:
# Install required libraries
%pip install --upgrade pip
%pip install boto3 datasets pillow tqdm --upgrade --quiet

In [None]:
# Restart kernel to ensure updated packages take effect
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import boto3
import os
import json
import time
import shutil
from tqdm import tqdm
from datasets import load_dataset
from PIL import Image
import io
import uuid
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set AWS region
region = "us-west-2"  # Llama 3.2 fine-tuning is currently only available in us-west-2

In [None]:
# Create AWS clients
session = boto3.session.Session(region_name=region)
s3_client = session.client('s3')
sts_client = session.client('sts')
bedrock = session.client(service_name="bedrock")

# Get account ID
account_id = sts_client.get_caller_identity()["Account"]

# Generate bucket name with account ID for uniqueness
bucket_name = f"llama32-vision-ft-{account_id}-{region}"

print(f"Account ID: {account_id}")
print(f"Bucket name: {bucket_name}")

## Create S3 Bucket

Let's create an S3 bucket to store our images and processed data:

In [None]:
try:
    if region == 'us-east-1':
        s3_client.create_bucket(
            Bucket=bucket_name
        )
    else:
        # For all other regions, specify the LocationConstraint
        s3_client.create_bucket(
            Bucket=bucket_name,
            CreateBucketConfiguration={'LocationConstraint': region}
        )
    print(f"Bucket {bucket_name} created successfully")
except s3_client.exceptions.BucketAlreadyExists:
    print(f"Bucket {bucket_name} already exists")
except s3_client.exceptions.BucketAlreadyOwnedByYou:
    print(f"Bucket {bucket_name} already owned by you")
except Exception as e:
    print(f"Error creating bucket: {e}")

## Download and Prepare the Dataset

For this example, we'll use a subset of the llava-instruct dataset from Hugging Face. We'll limit the data to 1000 samples for training, 100 for validation, and 100 for testing to keep this demonstration manageable.

<div style="background-color: #FFFFCC; color: #856404; padding: 15px; border-left: 6px solid #FFD700; margin-bottom: 15px;">
<h3 style="margin-top: 0; color: #856404;">⚠️ Large Dataset Warning</h3>
<p>This cell downloads the COCO image dataset which:</p>
<ul>
  <li>Is approximately <b>19.3 GB</b> in size</li>
  <li>May take <b>~10 minutes</b> to download depending on your internet connection</li>
  <li>Requires at least <b>25 GB</b> of free disk space for download, extraction, and processing</li>
</ul>
<p>Please ensure you have sufficient storage and a stable internet connection before proceeding.</p>
</div>

In [None]:
import requests
import zipfile
from tqdm import tqdm

# Create directories to store images and metadata
os.makedirs('llava_images/train', exist_ok=True)
os.makedirs('llava_images/val', exist_ok=True)
os.makedirs('llava_images/test', exist_ok=True)

# Function to download a file with progress bar
def download_file(url, save_path):
    print(f"Downloading {url}...")
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(save_path, 'wb') as f:
        with tqdm(total=total_size, unit='B', unit_scale=True, desc=os.path.basename(save_path)) as pbar:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
                    pbar.update(len(chunk))
    return save_path

# Step 1: Download the LLaVA dataset JSON file
json_path = 'llava_instruct_150k.json'
if not os.path.exists(json_path):
    json_url = "https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/resolve/main/llava_instruct_150k.json"
    download_file(json_url, json_path)

# Step 2: Download COCO images if not already downloaded
coco_zip_path = 'train2017.zip'
images_dir = 'images'
os.makedirs(images_dir, exist_ok=True)

# Only download if the directory is empty
if not os.listdir(images_dir):
    if not os.path.exists(coco_zip_path):
        images_url = "http://images.cocodataset.org/zips/train2017.zip"
        download_file(images_url, coco_zip_path)
    
    print("Extracting images...")
    with zipfile.ZipFile(coco_zip_path, 'r') as zip_ref:
        zip_ref.extractall('.')
    
    # Move images to the images directory
    print("Organizing files...")
    os.makedirs(images_dir, exist_ok=True)
    for img in tqdm(os.listdir('train2017'), desc="Moving images"):
        shutil.move(os.path.join('train2017', img), os.path.join(images_dir, img))
    
    # Clean up extraction directory
    if os.path.exists('train2017'):
        os.rmdir('train2017')
        
print("Loading the LLaVA dataset from JSON...")
# Load the dataset
with open(json_path, 'r') as f:
    dataset = json.load(f)

# Select a subset for our fine-tuning task
# We want 1200 examples total (1000 train, 100 val, 100 test)
dataset = dataset[:1200]

# Process and organize the data
dataset_list = []
successful_copies = 0
failed_copies = 0

print("Processing images...")
for example in tqdm(dataset, desc="Processing dataset"):
    if successful_copies >= 1200:
        break
    
    # Determine if this is for train, val, or test
    if successful_copies < 1000:
        subset = 'train'
    elif successful_copies < 1100:
        subset = 'val'
    else:
        subset = 'test'
    
    # Get image filename from the example
    if "image" in example:
        image_path = example["image"]
        image_filename = os.path.basename(image_path)
        
        # Source and destination paths
        source_path = os.path.join(images_dir, image_filename)
        dest_path = f"llava_images/{subset}/{image_filename}"
        
        # Copy the image if it exists
        if os.path.exists(source_path):
            shutil.copy2(source_path, dest_path)
            
            # Update example with local path
            example_copy = dict(example)
            example_copy['image_path'] = dest_path
            dataset_list.append(example_copy)
            successful_copies += 1
        else:
            failed_copies += 1

print(f"\nProcessing complete:")
print(f"Successful copies: {successful_copies}")
print(f"Failed copies: {failed_copies}")

# Split into train, validation, and test sets
train_data = dataset_list[:1000]
val_data = dataset_list[1000:1100]
test_data = dataset_list[1100:]

print(f"\nNumber of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(val_data)}")
print(f"Number of test examples: {len(test_data)}")

## Upload Images to S3

Now, let's upload the downloaded images to S3:

In [None]:
def upload_images_to_s3(data_list, subset):
    """Upload images to S3 and return paths"""
    print(f"Uploading {subset} images to S3...")
    
    s3_paths = []
    
    for i, example in enumerate(tqdm(data_list)):
        try:
            # Get local image path
            local_path = example['image_path']
            
            # Create S3 key
            file_name = os.path.basename(local_path)
            s3_key = f"images/{subset}/{file_name}"
            
            # Upload to S3
            s3_client.upload_file(local_path, bucket_name, s3_key)
            
            # Store S3 path
            s3_uri = f"s3://{bucket_name}/{s3_key}"
            s3_paths.append({
                'local_path': local_path,
                's3_uri': s3_uri,
                'example': example
            })
            
        except Exception as e:
            print(f"Error uploading image {i}: {e}")
    
    return s3_paths

# Upload images to S3
train_s3_paths = upload_images_to_s3(train_data, 'train')
val_s3_paths = upload_images_to_s3(val_data, 'val')
test_s3_paths = upload_images_to_s3(test_data, 'test')

print(f"Uploaded {len(train_s3_paths)} training images")
print(f"Uploaded {len(val_s3_paths)} validation images")
print(f"Uploaded {len(test_s3_paths)} test images")

## Format Data for Fine-tuning

Let's prepare the data in the required format for Bedrock Llama 3.2 fine-tuning:

In [None]:
def create_jsonl_entry(example, s3_uri):
    """Create a JSONL entry in the Bedrock conversation schema format"""
    
    # Extract conversation components
    conversations = example.get('conversations', [])
    
    if len(conversations) >= 2:
        question = conversations[0].get('value', "What's in this image?")
        answer = conversations[1].get('value', "This is an image.")
    else:
        question = "What's in this image?"
        answer = "This is an image."
    
    # Create entry in the required format
    return {
        "schemaVersion": "bedrock-conversation-2024",
        "system": [
            {
                "text": "You are a helpful assistant that can answer questions about images accurately and concisely."
            }
        ],
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "text": question
                    },
                    {
                        "image": {
                            "format": "png",
                            "source": {
                                "s3Location": {
                                    "uri": s3_uri,
                                    "bucketOwner": account_id
                                }
                            }
                        }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "text": answer
                    }
                ]
            }
        ]
    }

def prepare_dataset_jsonl(s3_paths, output_file):
    """Prepare dataset in JSONL format for fine-tuning"""
    
    with open(output_file, 'w') as f:
        for item in s3_paths:
            # Create JSONL entry
            entry = create_jsonl_entry(item['example'], item['s3_uri'])
            
            # Write to file
            f.write(json.dumps(entry) + '\n')
    
    print(f"Created {output_file} with {len(s3_paths)} samples")

# Prepare JSONL files
prepare_dataset_jsonl(train_s3_paths, 'train.jsonl')
prepare_dataset_jsonl(val_s3_paths, 'validation.jsonl')
prepare_dataset_jsonl(test_s3_paths, 'test.jsonl')

## Upload JSONL Files to S3

Let's upload our prepared JSONL files to S3:

In [None]:
# Upload JSONL files to S3
s3_client.upload_file('train.jsonl', bucket_name, 'data/train.jsonl')
s3_client.upload_file('validation.jsonl', bucket_name, 'data/validation.jsonl')
s3_client.upload_file('test.jsonl', bucket_name, 'data/test.jsonl')

# Store S3 URIs for later use
train_data_uri = f"s3://{bucket_name}/data/train.jsonl"
validation_data_uri = f"s3://{bucket_name}/data/validation.jsonl"
test_data_uri = f"s3://{bucket_name}/data/test.jsonl"

print(f"Training data URI: {train_data_uri}")
print(f"Validation data URI: {validation_data_uri}")
print(f"Test data URI: {test_data_uri}")

## Create IAM Role for Model Fine-tuning
Let's create an IAM role that will be used for the fine-tuning job:

In [None]:
# Generate policy documents
trust_policy_doc = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": account_id
                },
                "ArnLike": {
                    "aws:SourceArn": f"arn:aws:bedrock:{region}:{account_id}:model-customization-job/*"
                }
            }
        }
    ]
}

access_policy_doc = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                f"arn:aws:s3:::{bucket_name}",
                f"arn:aws:s3:::{bucket_name}/*"
            ]
        }
    ]
}


# Create IAM client
iam = session.client('iam')

# Role name for fine-tuning
role_name = f"Llama32VisionFineTuningRole-{int(time.time())}"
policy_name = f"Llama32VisionFineTuningPolicy-{int(time.time())}"

# Create role
try:
    response = iam.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_policy_doc),
        Description="Role for fine-tuning Llama 3.2 vision model with Amazon Bedrock"
    )
    
    role_arn = response["Role"]["Arn"]
    print(f"Created role: {role_arn}")
    
    # Create policy
    response = iam.create_policy(
        PolicyName=policy_name,
        PolicyDocument=json.dumps(access_policy_doc)
    )
    
    policy_arn = response["Policy"]["Arn"]
    print(f"Created policy: {policy_arn}")
    
    # Attach policy to role
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn=policy_arn
    )
    
    print(f"Attached policy to role")
    
except Exception as e:
    print(f"Error creating IAM resources: {e}")

# Allow time for IAM role propagation
print("Waiting for IAM role to propagate...")
time.sleep(10)


## Save Variables for the Next Notebook
Let's save the important variables we'll need in the next notebook:

In [None]:
# Store variables for the next notebook
%store bucket_name
%store train_data_uri
%store validation_data_uri
%store test_data_uri
%store role_arn
%store role_name
%store policy_arn

print("Variables saved for use in the next notebook")

## Conclusion

In this notebook, we prepared the data needed for fine-tuning a Llama 3.2 multi-modal model. We:

- Downloaded a subset of the llava-instruct dataset with COCO images
- Uploaded images to S3
- Formatted the data according to the Bedrock conversation schema
- Created an IAM role with the necessary permissions

The data is now ready for fine-tuning, which we'll perform in the next notebook.