# Pixtral vLLM SageMaker Deployment guide - v1

---

In this notebook, we provide to you a guide to performing a simple deployment of Pixtral with vLLM. Pixtral is trained to understand both natural images and documents, achieving 52.5% on the MMMU reasoning benchmark, surpassing a number of larger models. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.

Currently, Pixtral models on HuggingFace are in the original consolidated format of:
    => params.json and consolidated.safetensors 
    and not the standard  HF format:
    => model-00001-of-00003.safetensors and config.json
    
We are not able to deploy the model with TGI at the moment but Mistral has upstreamed changes in `v0.6.1` of vLLM to allow users to deploy Pixtral with this 'Mistral' format.
The djl-lmi container for vLLM is currently being worked on by us to allow deployment of the model to a SM endpoint and easy inference.

#### Note: spin up your sagemaker notebook instance with a g5.12xlarge to follow along with this notebook

### Install dependencies

In [None]:
%%writefile requirements.txt
gradio
vllm==0.6.1.post2
ipywidgets

In [None]:
!pip install -U -r requirements.txt --quiet

---
A prerequisite for this notebook is to have a huggingface token with read access set up to be able to access the gated model from HuggingFace.

=>Follow the steps here for getting access to a [HF token](https://huggingface.co/docs/hub/en/security-tokens)

Once you have your access token and have allowed access to the model on HuggingFace, you can proceed to the login below. If you run into dependency errors with the cell below, please login to Hugging Face via your CLI

In [None]:
from huggingface_hub import notebook_login
notebook_login()

For the purpose of this notebook, you need access to an instance with at least 24 gb of GPU memory to load the model in. 

In order to utilize the complete context window of the model, you need a larger instance size since the model supports up to 128k tokens, 
since we are only able to store upto 102096 tokens in the kv cache with a g5.12xlarge

In this example we limit the `max_model_len` param in the instance of the vLLM LLM class to 20k for demonstration purposes. Ensure you have the GPU capacity if you would like to utilize the complete context window.

We also set `tensor_parallel_size` to 4 since we are using a g5.12xlarge with 4x Nvidia A10g GPUs. Change this according to the instance you are using.

In [None]:
#using a g5.12xlarge
import gradio as gr
from vllm import LLM
from vllm.sampling_params import SamplingParams
import torch.multiprocessing as mp
# Set the multiprocessing start method early in the script, to not fork the process
mp.set_start_method('spawn', force=True)
# Define the model and LLM object globally to avoid reloading for every request
model_name = "mistralai/Pixtral-12B-2409"
llm = LLM(model=model_name, tokenizer_mode="mistral", tensor_parallel_size=4, max_model_len=20000)

### Creating our function

In [None]:
# Function for a simple demo
def simple_response(prompt, image_url):
    sampling_params = SamplingParams(max_tokens=8192)
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    }
                },
            ],
        },
    ]
    outputs = llm.chat(messages, sampling_params=sampling_params)
    return outputs[0].outputs[0].text

# Define the Gradio interface
simple_demo_interface = gr.Interface(
    fn=simple_response, 
    inputs=[
        gr.Textbox(label="Prompt"), 
        gr.Textbox(label="Image URL")
    ], 
    outputs="text",
    title="Pixtral Image Description",
    description="Provide a prompt and an image URL to get a description."
)

### Gradio interface with Pixtral

In [None]:
demo = gr.TabbedInterface([simple_demo_interface], ["Simple Pixtral Demo"])
# Launch the Gradio app
demo.launch()

---
#### Examples
bounding box example: "https://huggingface.co/datasets/nithiyn/bounding-box/resolve/main/bounding-box-ppl.jpg"
prompt: describe in detail, the first three objects within bounding boxes

mykonos: "https://huggingface.co/datasets/nithiyn/bounding-box/resolve/main/mykonos-2.jpeg"
prompt: Describe and identify the location in this image

## Conclusion

In this notebook, we loaded in Mistral's Pixtral model with vLLM and created a simple Gradio interface to inference with the model.

----

### Distributors

- Mistral AI
- AWS