# Deploy Llama 4 Scout on SageMaker AI with Hugging Face Text Generation Inference.

The Llama 4 collection of models are multimodal AI models developed by Meta. These models leverage a mixture-of-experts architecture for industry leading text and image understanding.

Llama 4 Scout is a 17 billion parameter model with 16 experts released on April 5th 2025. For more information, please refer to the model card [here](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)

We can deploy Llama 4 Scout on a SageMaker AI Endpoint using Text Generation Inference (TGI). TGI is a toolkit for deploying and serving large language models developed by Hugging Face available for use in Amazon SageMaker AI. For more information on TGI, refer to [here](https://huggingface.co/docs/text-generation-inference/en/index).

---

To get started, we should update our SageMaker Python SDK and configure our role and session information

In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
import json
import sagemaker

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"sagemaker version: {sagemaker.__version__}")

We set the ECR image URI for the Hugging Face Text Generation Inference container. Version 3.2.2 is updated to support the Llama 4 collection of models so the container must be at least version 3.2.2

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# OVERRIDE:
llm_image = f"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/huggingface-pytorch-tgi-inference:2.6.0-tgi3.2.3-gpu-py311-cu124-ubuntu22.04-v2.0"

print(f"llm image uri: {llm_image}")

## Deploying an endpoint

We can configure our endpoint with the SageMaker Python SDK to deploy our model. As Llama 4 models are gated models, please ensure that you have been granted access and provide a valid token for use.

We also need to configure our instance type and environment variables. Llama 4 Scout is a Mixture-of-Experts (MOE) model with 16 billion active parameters out of 109 billion in total. As all these parameters are stored in memory, we need a large amount of GPU memory available. Here we select the `ml.p4d.24xlarge` which has 320GB of GPU memory across 8 A100 GPUs.

In [None]:
model_name = sagemaker.utils.name_from_base("llama4-tgi")
endpoint_name = model_name

In [None]:
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.p4d.24xlarge"
number_of_gpu = 8
health_check_timeout = 1800

# TGI config
config = {
    "HF_MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "HUGGING_FACE_HUB_TOKEN": "<REPLACE WITH YOUR TOKEN>",
    'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
    'MAX_INPUT_LENGTH': json.dumps(4096),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(8192),  # Max length of the generation (including input text)
}

assert config['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."

# create HuggingFaceModel
llm_model = HuggingFaceModel(
    role = role,
    image_uri = llm_image,
    env = config,
    name = model_name
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
    initial_instance_count = 1,
    instance_type = instance_type,
    container_startup_health_check_timeout = health_check_timeout,
    endpoint_name = endpoint_name,
)

## Inference
Once our model is deployed we can invoke it with the `predict` method as below for synchronous inference and see the response.

In [None]:
llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

prompt = "What is Amazon SageMaker?"

res = llm.predict({"inputs": prompt, "parameters": {"temperature": 0.9, "max_tokens": 1024}})
print(res[0]["generated_text"])

The Llama 4 models are multi-modal meaning they can work with image and text inputs. We can use this for image understanding usecases.

To start, we can define some helper functions for prediction

In [None]:
import boto3
from PIL import Image
import requests
from io import BytesIO

runtime = boto3.client('sagemaker-runtime')

def get_image_urls(payload):
    image_urls = []
    payload = json.loads(payload)
    for msg in payload["messages"]:
        if type(msg["content"]) == list:
            for ms in msg["content"]:
                typ = ms["type"]
                if typ == "image_url":
                    image_url = ms["image_url"]["url"]
                    image_urls.append(image_url)
    return image_urls

def display_images(image_paths):
    """
    Displays multiple images side by side using PIL.

    Args:
        image_paths: A list of file paths to the images.
    """
    responses = [ requests.get(url) for url in image_paths]
    images = [Image.open(BytesIO(response.content)) for response in responses]
    widths, heights = zip(*(i.size for i in images))

    total_width = sum(widths)
    max_height = max(heights)

    new_image = Image.new('RGB', (total_width, max_height))

    x_offset = 0
    for image in images:
        new_image.paste(image, (x_offset, 0))
        x_offset += image.size[0]

    new_image.show()

def predict(payload, endpoint_name, imgs=False):
    response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='application/json',
                                       Body=payload)
    if imgs:
        image_urls = get_image_urls(payload)
        display_images(image_urls)
    result = json.loads(response['Body'].read().decode())
    return result["choices"][0]["message"]["content"]

We use the Messages API format to invoke Llama 4 Scout with a URL of an image. On invocation, the model is able to access the URL and provide an output based on the image.

Below we use it to describe an image of a rabbit

In [None]:
data = {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this image in detail please.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png",
                        },
                    },
                ],
            },
        ],
        "temperature": 0.6,
        "top_p": 0.9,
        "max_tokens": 512
    }
payload = json.dumps(data)

In [None]:
predict(payload, endpoint_name, imgs=False)

We can also encode images in base64 for inference as below. Note that the image bytes must be prefixed with `data:image/png;base64,`.

In [None]:
!curl -o rabbit.png https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png

In [None]:
import base64

def image_to_base64_data_uri(file_path):
    with open(file_path, "rb") as img_file:
        base64_data = base64.b64encode(img_file.read()).decode('utf-8')
        return f"{base64_data}"

# Replace 'file_path.png' with the actual path to your PNG file
file_path = 'rabbit.png'
data_uri = image_to_base64_data_uri(file_path)

data = {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this image in detail please.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{data_uri}",
                        },
                    },
                ],
            },
        ],
        "temperature": 0.6,
        "top_p": 0.9,
        "max_tokens": 512
    }
payload = json.dumps(data)

In [None]:
predict(payload, endpoint_name, imgs=False)

## Cleanup

Once we are done, we delete our endpoint and model.

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(model_name)