# 🚀 Customize and Deploy `meta-llama/Llama-4-Maverick-17B-128E-Instruct` on Amazon SageMaker AI
---
In this notebook, we explore **Llama-4-Maverick-17B-128E-Instruct**, Meta's groundbreaking multimodal model that combines vision and language understanding with expert routing capabilities. You'll learn how to fine-tune this advanced model on multimodal datasets, evaluate its vision-language performance, and deploy it using SageMaker.

**What is Llama-4-Maverick-17B-128E-Instruct?**

Meta's **Llama-4-Maverick-17B-128E-Instruct** represents a significant advancement in multimodal AI, featuring a 17-billion-parameter architecture with 128 expert modules (128E) that enable efficient processing of both visual and textual information. This model combines the proven Llama architecture with advanced vision capabilities and mixture-of-experts routing for optimal performance across diverse tasks.  
🔗 Model card: [meta-llama/Llama-4-Maverick-17B-128E-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct)

---

**Key Specifications**

| Feature | Details |
|---|---|
| **Total Parameters** | ~17 billion |
| **Active Parameters** | ~2-3 billion per token (via expert routing) |
| **Architecture** | Mixture-of-Experts Transformer with Vision Encoder |
| **Expert Modules** | 128 expert networks (128E) with dynamic routing |
| **Modalities** | Image + Text input → Text output |
| **Context Length** | Extended context window for complex multimodal reasoning |
| **Vision Encoder** | Advanced vision transformer for high-resolution image processing |
| **License** | Llama 4 Community License |

---

**Benchmarks & Behavior**

- Llama-4-Maverick achieves **state-of-the-art performance** on multimodal benchmarks including VQA, image captioning, and visual reasoning.  
- Exceptional **vision-language understanding** with detailed scene analysis and contextual reasoning.  
- Advanced **instruction following** capabilities for complex multimodal tasks.  
- Efficient **expert routing** enables high performance while maintaining computational efficiency.  
- Strong **multilingual and multicultural** understanding across diverse visual contexts.  

---

**Using This Notebook**

Here's what you'll cover:

* Load multimodal datasets and prepare them for vision-language fine-tuning  
* Fine-tune with SageMaker Training Jobs using MoE-optimized configurations  
* Run Model Evaluation on vision-language benchmarks  
* Deploy to SageMaker Endpoints for multimodal inference  


In [1]:
%pip install -Uq sagemaker datasets

In [2]:
import boto3
import sagemaker
from PIL import Image
import torch

In [3]:
region = boto3.Session().region_name

from sagemaker.local import LocalSession 
sess = LocalSession() #sagemaker.Session(boto3.Session(region_name=region))
sess.config = {"local": {"local_code": True}}

sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

In [4]:
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Data Preparation for Supervised Fine-tuning

### [Visual-TableQA](https://huggingface.co/datasets/AI-4-Everyone/Visual-TableQA)

**Visual-TableQA** is a large-scale benchmark for **open-domain reasoning over table images**, designed to advance research in multimodal understanding of structured data. The dataset provides high-quality synthetic question–answer pairs associated with rendered LaTeX table images, making it well-suited for training and evaluating vision–language models on table reasoning tasks.

**Data Format & Structure**:
- Distributed in **JSON** format.  
- Contains standard splits for training, validation, and testing.  
- Each record includes:  
  - `table_id` – a unique identifier for the table  
  - `image` – rendered PNG image of the LaTeX table  
  - `question` – a natural language query about the table  
  - `answer` – the ground-truth response grounded in the table  

**Dataset Quality**:
- Questions are automatically generated and verified with reasoning-oriented LLMs.  
- Ensures strong alignment between the table structure, visual representation, and annotated answers.  

**License**: Released under the **Apache-2.0** license.  

**Applications**:

The dataset can support a variety of multimodal and structured reasoning tasks, including:  
- Table-based question answering (QA)  
- Document QA and table parsing  
- Multimodal reasoning and visual understanding  
- Benchmarking pipelines for table reasoning tasks  
- Fine-tuning and evaluation of vision–language models on structured visual data  
 

In [5]:
import os
import io
import base64
import json
import pprint
from tqdm import tqdm
from datasets import load_dataset

In [None]:
dataset_parent_path = os.path.join(os.getcwd(), "tmp_cache_local_dataset")
os.makedirs(dataset_parent_path, exist_ok=True)

**Preparing Your Dataset in `messages` format**

This section walks you through creating a conversation-style dataset—the required `messages` format—for directly training LLMs using SageMaker AI.

**What Is the `messages` Format?**

The `messages` format structures instances as chat-like exchanges, wrapping each conversation turn into a role-labeled JSON array. It’s widely used by frameworks like TRL.

Example entry:

```json
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "How do I bake sourdough?" },
    { "role": "assistant", "content": "First, you need to create a starter by..." }
  ]
}


In [None]:
dataset_name = "AI-4-Everyone/Visual-TableQA"
dataset = load_dataset(dataset_name, split="train[:1000]")

In [None]:
pprint.pp(dataset[0])

In [None]:
print(f"total number of fine-tunable samples: {len(dataset)}")

In [6]:
def pil_to_base64(pil_img):
    """Convert a PIL image to base64-encoded PNG string."""
    buffer = io.BytesIO()
    pil_img.save(buffer, format="PNG")
    return base64.b64encode(buffer.getvalue()).decode("utf-8")

def convert_to_messages_multimodal(row):
    system_content = (
        "You are a multimodal reasoning assistant. Given a table (and its image if present) "
        "and a question, provide a clear, concise answer followed by a brief explanation of "
        "how the table supports your conclusion. Keep the reasoning grounded in the data and avoid speculation."
    )
    user_content = row["question"]
    assistant_content = row["answer"]
    image_content = row["image"]

    images = []
    if image_content is not None:
        if isinstance(image_content, list):
            for img in image_content:
                if hasattr(img, "save"):  # PIL image
                    b64_img = pil_to_base64(img)
                    images.append({
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{b64_img}"}
                    })
        else:
            if hasattr(image_content, "save"):  # PIL image
                b64_img = pil_to_base64(image_content)
                images.append({
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{b64_img}"}
                })

    return {
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": system_content}]},
            {"role": "user", "content": images + [{"type": "text", "text": user_content}]},
            {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]}
        ]
    }

dataset = dataset.map(convert_to_messages_multimodal, remove_columns=dataset.column_names)

In [None]:
dataset_filename = os.path.join(dataset_parent_path, f"{dataset_name.replace('/', '--').replace('.', '-')}.jsonl")
dataset.to_json(dataset_filename, lines=True)

#### Upload file to S3

In [None]:
from sagemaker.s3 import S3Uploader

In [None]:
data_s3_uri = f"s3://{sess.default_bucket()}/dataset"

uploaded_s3_uri = S3Uploader.upload(
    local_path=dataset_filename,
    desired_s3_uri=data_s3_uri
)
print(f"Uploaded {dataset_filename} to > {uploaded_s3_uri}")

## Fine-Tune LLMs using SageMaker `Estimator`/`ModelTrainer`

In [None]:
import time
from sagemaker.pytorch import PyTorch
from sagemaker.huggingface import HuggingFace
from getpass import getpass
import yaml
from jinja2 import Template

In [None]:
# Get Hugging Face token for model downloads (if needed)
hf_token = getpass()

### Training using `PyTorch` Estimator

**Training Using `PyTorch` Estimator**
Leverages the official PyTorch SageMaker container to run a custom training script using the Accelerate and DeepSpeed libraries. This option is ideal for users who want full control over the training pipeline 

---
**Observability**: SageMaker AI has [SageMaker MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) which enables you to accelerate generative AI by making it easier to track experiments and monitor performance of models and AI applications using a single tool.

You can choose to include MLflow as a part of your training workflow to track your model fine-tuning metrics in realtime by simply specifying a **mlflow** tracking arn.

Optionally you can also report to : **tensorboard**, **wandb**.

In [8]:
MLFLOW_TRACKING_SERVER_ARN = None # or "arn:aws:sagemaker:us-west-2:<account-id>:mlflow-tracking-server/<server-name>"

if MLFLOW_TRACKING_SERVER_ARN:
    reports_to = "mlflow"
else:
    reports_to = "tensorboard"

In [None]:
job_name = 'meta-llama--Llama-3.2-11B-Instruct'
training_instance_type = "local_gpu"

In [None]:
if MLFLOW_TRACKING_SERVER_ARN:
    training_env = {
        "MLFLOW_EXPERIMENT_NAME": f"exp-{job_name}",
        "MLFLOW_TAGS": '{"source.job": "sm-training-jobs", "source.type": "sft", "source.framework": "pytorch"}',
        "HF_TOKEN": hf_token,
        "MLFLOW_TRACKING_URI": MLFLOW_TRACKING_SERVER_ARN,
    }
else:
    training_env = {
        "HF_TOKEN": hf_token
    }

In [None]:
pytorch_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:2.8.0-gpu-py312-cu129-ubuntu22.04-sagemaker"
print(f"Using image: {pytorch_image_uri}")

#### Training strategy: `PeFT/LoRA`

In [None]:
pytorch_estimator = PyTorch(
    image_uri=pytorch_image_uri,
    entry_point="sm_accelerate_train.sh", # Adapted bash script to train using accelerate on SageMaker - Multi-GPU
    source_dir="sagemaker_code",
    instance_type=training_instance_type,
    instance_count=1,
    base_job_name=f"{job_name}-pytorch",
    role=role,
    volume_size=300,
    py_version="py312",
    keep_alive_period_in_seconds=3600,
    environment=training_env,
    sagemaker_session=sess,
    hyperparameters={
        "config": "hf_recipes/meta-llama/Llama-4-Maverick-17B-128E-Instruct--vanilla-peft-qlora.yaml"
    }
)

# fit or train
pytorch_estimator.fit(
    {"training": uploaded_s3_uri}, 
    wait=False
)

In [None]:
s3_model_data_uri = pytorch_estimator.model_data
print(f"Fine-tuned model location: {s3_model_data_uri}")

#### Training strategy: `Spectrum`

In [None]:
pytorch_estimator = PyTorch(
    image_uri=pytorch_image_uri,
    entry_point="sm_accelerate_train.sh", # Adapted bash script to train using accelerate on SageMaker - Multi-GPU
    source_dir="sagemaker_code",
    instance_type=training_instance_type,
    instance_count=1,
    base_job_name=f"{job_name}-pytorch",
    role=role,
    volume_size=300,
    py_version="py312",
    keep_alive_period_in_seconds=3600,
    environment=training_env,
    sagemaker_session=sess,
    hyperparameters={
        "config": "hf_recipes/meta-llama/Llama-4-Maverick-17B-128E-Instruct--vanilla-spectrum.yaml"
    }
)

# fit or train
pytorch_estimator.fit(
    {"training": uploaded_s3_uri}, 
    wait=False
)

In [None]:
s3_model_data_uri = pytorch_estimator.model_data
print(f"Fine-tuned model location: {s3_model_data_uri}")

#### Training strategy: `Full-Finetuning`

In [None]:
pytorch_estimator = PyTorch(
    image_uri=pytorch_image_uri,
    entry_point="sm_accelerate_train.sh", # Adapted bash script to train using accelerate on SageMaker - Multi-GPU
    source_dir="sagemaker_code",
    instance_type=training_instance_type,
    instance_count=1,
    base_job_name=f"{job_name}-pytorch",
    role=role,
    volume_size=300,
    py_version="py312",
    keep_alive_period_in_seconds=3600,
    environment=training_env,
    sagemaker_session=sess,
     hyperparameters={
        "config": "hf_recipes/meta-llama/Llama-4-Maverick-17B-128E-Instruct--vanilla-full.yaml"
    }
)

# fit or train
pytorch_estimator.fit(
    {"training": uploaded_s3_uri}, 
    wait=False
)

In [None]:
s3_model_data_uri = pytorch_estimator.model_data
print(f"Fine-tuned model location: {s3_model_data_uri}")