## Pixtral 12b LMI v12 Deployment Guide

This notebook demonstrates how to deploy the [Llama3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model using the LMI v12 container. This example uses the vllm backend.
The current implementation of Llama3.2 vision models in vllm 0.6.2 does not support CUDA graphs (eager exectuion required), and does not support multi-image inputs.

### Install Required dependencies

In [None]:
%pip install sagemaker boto3

## Create the SageMaker model object

In [None]:
import sagemaker
from sagemaker import image_uris
from sagemaker.djl_inference import DJLModel

image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124"

role = sagemaker.get_execution_role()

# Once the SageMaker Python SDK PR is merged, we can use image_uris directly
# image_uri = image_uris.retrieve(framework="djl-lmi", version="0.30.0", region="us-west-2")

model = DJLModel(
    role=role,
    image_uri=image_uri,
    env={
        "HF_MODEL_ID": "meta-llama/Llama-3.2-11B-Vision-Instruct",
        "HF_TOKEN": "<huggingface hub token>",
        "OPTION_ROLLING_BATCH": "vllm",
        "OPTION_MAX_MODEL_LEN": "8192", # this can be tuned depending on instance type + memory available
        "OPTION_MAX_ROLLING_BATCH_SIZE": "16", # this can be tuned depending on instance type + memory available
        "OPTION_TENSOR_PARALLEL_DEGREE": "max",
        "OPTION_ENFORCE_EAGER": "true",
    }
)

## Deploy the model

In [None]:
predictor = model.deploy(instance_type="ml.g6.12xlarge", initial_instance_count=1)

## Test prompts

The following prompts demonstrate how to use the pixtral-12b model for:
- Text only inference
- Single image inference
- Multi image inference

For the multi image inference use-case, we use two images. However, the model is configured to accept up to 4 images in a single prompt. This setting can be tuned with the `OPTION_LIMIT_MM_PER_PROMPT` configuration.

In [None]:
IMAGE_1_KITTEN = "https://resources.djl.ai/images/kitten.jpg"

text_only_payload = {
    "messages": [
        {
            "role": "user",
            "content": "I would like to get better at basketball. Can you provide me a 3 month plan to improve my skills?"
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.6,
    "top_p": 0.9,
}

single_image_payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Can you describe the following image and tell me what it contains?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": IMAGE_1_KITTEN
                    }
                }
            ]
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.6,
    "top_p": 0.9,
}


# Text Only Inference

In [None]:
print(f"Prompt is:\n {text_only_payload['messages'][0]['content']}")
text_only_output = predictor.predict(text_only_payload)
print("Response is:\n")
print(text_only_output['choices'][0]['message']['content'])
print('----------------------------')

In [None]:
from PIL import Image
import requests
from io import BytesIO

response_kitten = requests.get(IMAGE_1_KITTEN)
img_kitten = Image.open(BytesIO(response_kitten.content))
response_truck = requests.get(IMAGE_2_TRUCK)
img_truck = Image.open(BytesIO(response_truck.content))

# Single Image Inference

In [None]:
print("This is the image provided to the model")
img_kitten.show()
single_image_output = predictor.predict(single_image_payload)
print(single_image_output['choices'][0]['message']['content'])
print('----------------------------')

In [None]:
# clean up resources
predictor.delete_endpoint()
model.delete_model()