# Fine-tuning Llama 3.2 3B Instruct with SageMaker JumpStart

This notebook demonstrates how to fine-tune Meta's Llama 3.2 3B Instruct model using Amazon SageMaker JumpStart. We'll use a small dataset (around 100 examples) and parameter-efficient fine-tuning techniques like LoRA/QLoRA.

## Setup Environment

First, let's set up our SageMaker environment and install any required dependencies.

In [None]:
!pip install -U sagemaker boto3 pandas

In [None]:
import boto3
import sagemaker
import pandas as pd
import os
import json
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker import get_execution_role

# Initialize SageMaker session
session = sagemaker.Session()
role = get_execution_role()
region = session.boto_region_name
bucket = session.default_bucket()

print(f"SageMaker Role: {role}")
print(f"AWS Region: {region}")
print(f"Default S3 Bucket: {bucket}")

## Prepare Training Data

Let's create a simple example dataset in JSONL format. For real use cases, you would replace this with your actual data.

In [None]:
# Create a sample dataset or read from existing file
# Example of a chat-based jsonl format compatible with Llama models
sample_data = [
    {
        "messages": [
            {"role": "user", "content": "What are the benefits of fine-tuning language models?"},
            {"role": "assistant", "content": "Fine-tuning language models offers several benefits, including improved performance on domain-specific tasks, better alignment with specific use cases, reduced hallucinations, and more controlled outputs that follow your preferred style and format."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "How does parameter-efficient fine-tuning work?"},
            {"role": "assistant", "content": "Parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation) work by freezing most of the pre-trained model parameters and only updating a small set of adapter parameters. LoRA adds trainable low-rank matrices to certain layers, typically the attention layers, which significantly reduces memory requirements while maintaining performance."}
        ]
    }
]

# Create a local file with the sample data
sample_data_file = "sample_training_data.jsonl"
with open(sample_data_file, 'w') as f:
    for item in sample_data:
        f.write(json.dumps(item) + '\n')
        
print(f"Created sample training data with {len(sample_data)} examples")

# For real data, you might load from a file like this:
# import pandas as pd
# df = pd.read_json("your_real_data.jsonl", lines=True)
# print(f"Loaded {len(df)} training examples")

## Upload Training Data to S3

In [None]:
# Upload the training data to S3
prefix = "llama3-finetuning"
train_data_s3_path = session.upload_data(
    path=sample_data_file, 
    bucket=bucket, 
    key_prefix=f"{prefix}/data"
)
print(f"Training data uploaded to: {train_data_s3_path}")

## Configure Hyperparameters

Let's set up the hyperparameters for fine-tuning. These are optimized for a small dataset (around 100 examples).

In [None]:
# Define hyperparameters
hyperparameters = {
    # Training parameters
    "epoch": "3",                  # Number of training epochs
    "learning_rate": "5e-5",       # Learning rate
    "per_device_train_batch_size": "2",  # Batch size per GPU for training
    "per_device_eval_batch_size": "2",   # Batch size per GPU for evaluation
    "gradient_accumulation_steps": "4",  # Number of steps to accumulate gradients
    "warmup_steps": "10",          # Number of warmup steps for learning rate scheduler
    "weight_decay": "0.01",        # Weight decay
    
    # LoRA specific parameters
    "use_lora": "True",            # Use LoRA for fine-tuning
    "lora_r": "16",                # LoRA attention dimension
    "lora_alpha": "32",            # LoRA alpha parameter
    "lora_dropout": "0.05",        # Dropout probability for LoRA layers
    
    # QLoRA specific parameters (for memory efficiency)
    "use_qlora": "True",           # Use QLoRA for more memory efficiency
    "bnb_4bit_quant_type": "nf4",  # Quantization type
    "bnb_4bit_compute_dtype": "float16",  # Compute dtype
    
    # Other settings
    "max_seq_length": "2048",      # Maximum sequence length
    "save_strategy": "epoch",      # Save strategy
    "evaluation_strategy": "epoch" # Evaluation strategy
}

print("Hyperparameters configured for small dataset fine-tuning")

## Create and Start the Training Job

Now we'll use SageMaker JumpStart to create and start the fine-tuning job.

In [None]:
# Define model ID for Llama 3.2 3B Instruct
model_id = "meta-textgeneration-llama-3-2-3b-instruct"
model_version = "1.0.0"  # Update this version as needed

# Create JumpStart estimator
estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    instance_type="ml.g5.2xlarge",  # GPU instance with good memory
    instance_count=1,
    hyperparameters=hyperparameters,
    role=role,
)

print(f"Created JumpStart estimator for {model_id} version {model_version}")

In [None]:
# Configure the training job
training_input = {"train": train_data_s3_path}

# Generate a unique job name
import time
job_name = f"llama3-2-3b-finetuning-{int(time.time())}"

# Start the training job
estimator.fit(
    inputs=training_input,
    job_name=job_name,
    wait=False,  # Set to True if you want to wait for job completion in the notebook
    logs=False   # Set to True if you want to see logs in the notebook
)

print(f"Training job '{job_name}' started!")
print(f"You can monitor the job in the SageMaker console or run 'estimator.latest_training_job.wait()' to wait for completion")

## Monitor Training Job Status

In [None]:
# Get training job info
training_job_name = estimator.latest_training_job.job_name
print(f"Training job name: {training_job_name}")

# Get training job status
sm_client = boto3.client('sagemaker')
response = sm_client.describe_training_job(TrainingJobName=training_job_name)
print(f"Training job status: {response['TrainingJobStatus']}")

# If you want to wait for the job to complete
# estimator.latest_training_job.wait()

## Deploy the Fine-tuned Model (After Training Completes)

Once the training job completes, you can deploy the model as an endpoint for inference.

In [None]:
# Uncomment and run this after training completes

# # Deploy the fine-tuned model
# predictor = estimator.deploy(
#     initial_instance_count=1,
#     instance_type="ml.g5.xlarge",
#     endpoint_name=f"llama3-2-finetuned-endpoint-{int(time.time())}"
# )
# 
# print(f"Model deployed to endpoint: {predictor.endpoint_name}")

## Test the Fine-tuned Model (After Deployment)

After deployment, you can test your fine-tuned model with inference requests.

In [None]:
# Uncomment and run this after deployment completes

# # Test the deployed model
# prompt = [{"role": "user", "content": "What are the key benefits of fine-tuning language models?"}]
# 
# response = predictor.predict({
#     "inputs": prompt,
#     "parameters": {
#         "max_new_tokens": 256,
#         "top_p": 0.9,
#         "temperature": 0.7
#     }
# })
# 
# print(json.dumps(response, indent=2))

## Clean Up Resources

Don't forget to clean up resources when you're done to avoid unnecessary charges.

In [None]:
# Uncomment and run this when you're done

# # Delete the endpoint
# predictor.delete_endpoint(delete_endpoint_config=True)
# print(f"Deleted endpoint: {predictor.endpoint_name}")