# Tutorial: Deploy Qwen2-VL on Trn2 instances

This tutorial provides a step-by-step guide to deploy [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) using NeuronX Distributed (NxD) Inference on a single `trn2.48xlarge` instance.

## Step 1: Set up your development environment

As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To run a Jupyter (.ipynb) notebook on a Neuron instance, follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).

After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.

After you are connected, activate the Python virtual environment that includes the Neuron SDK.

```python
pip list | grep neuron
```

You should see Neuron packages including
`neuronx-distributed-inference` and `neuronx-cc`.

## Step 2: Install the vLLM version that supports NxD Inference

NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html).

Ensure that the Neuron virtual environment is activated if you are using a new terminal session instead of the one from connection step above. Then, install the Neuron vLLM fork into the virtual environment.

## Step 3 Download the model from HuggingFace (Optional)

To deploy [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) on Neuron, first download the checkpoint from HuggingFace to a local path on the Trn2 instance. For more information on downloading models from HuggingFace, refer to [HuggingFace's guide on Downloading models](https://huggingface.co/docs/hub/en/models-downloading)).

After the download, you should see a `config.json` file in the output folder along with weights in `model-xxxx-of-xxxx.safetensors` format.

## Step 4: Compile and deploy Qwen2-VL Inference

In this step, you use the `vllm` command to deploy the model. The `neuronx-distributed-inference` model loader in vllm performs JIT compilation before deploying it with the model server. Replace the `model_name_or_path` with your specific path if you download the model checkpoint from HuggingFace(Step 3).

Here are two examples of running Qwen2-VL with vLLM V1:

* Offline inference: you can provide prompts in a python script and execute it.
* Online inference: you will serve the model in an online server and send requests. 

### Model Configuration Requirements & Examples

There is a known issue with `batch_size` > 1 or `tp_degree` != 4 configurations for Qwen2-VL models. Here we suggest to use `batch_size` = 1 and `tp_degree` = 4 configuration, which deploys `Qwen/Qwen2-VL-7B-Instruct` model on a single trn2 chip with 4 cores. You can replicate the setting on the `trn2.48xlarge` instance consisting of 16 chips and 64 cores.

We support configurable image sizes for Qwen2-VL and use `number_of_images` as the vision buckets. For example, in the configuration below, `number_of_images` is the maximum vision bucket, i.e., `128`.
Please specify `default_image_width` and `default_image_height` in the `vision_neuron_config` as the input image size. The default image sizes are `default_image_width: 640` and `default_image_height: 320`.

<div class="alert alert-block alert-warning">
<b>Note:</b> Please make sure the number of tokens does not exceed the `max_content_length` in the `text_neuron_config`, i.e., `number_of_prompt_tokens + (default_image_width // 28) * (default_image_height // 28) * number_of_images < max_context_length - max_new_tokens`.
</div>

We configure these fields below to improve performance. For more details, refer to [NxD Inference features configurations guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html).
- `sequence_parallel_enabled`: whether to enable sequence parallel.
- `fuse_qkv` and `qkv_kernel_enabled`: whether to use the fused QKV kernel. `qkv_kernel_enabled` is not supported yet in the `vision_neuron_config` for Qwen2-VL.
- `attn_kernel_enabled`: whether to use the optimized attention kernel.

Below we provide the recommended configuration with `batch_size` 1 and `tp_degree` 4.
<div class="alert alert-block alert-warning">
<b>Note:</b> If you encounter Out-of-Memory issue during the runtime, please try to reduce the size of vision buckets as the KV cache grows linearly with batch size and sequence length.
</div>

In [None]:
qwen2_vl_neuron_config = {
    "text_neuron_config": {
        "batch_size": 1,
        "ctx_batch_size": 1,
        "tkg_batch_size": 1,
        "seq_len": 32768,
        "max_new_tokens": 64,
        "max_context_length": 32768,
        "torch_dtype": "float16",
        "skip_sharding": False,
        "save_sharded_checkpoint": True,
        "tp_degree": 4,
        "world_size": 4,
        "enable_bucketing": True,
        "context_encoding_buckets": [2048, 16384, 32768],
        "token_generation_buckets": [2048, 16384, 32768],
        "fused_qkv": True,
        "qkv_kernel_enabled": True,
        "sequence_parallel_enabled": True,
        "attn_kernel_enabled": True,
        "cc_pipeline_tiling_factor": 2,
        "attention_dtype": "float16",
        "rpl_reduce_dtype": "float16",
        "cast_type": "as-declared",
        "logical_neuron_cores": 2,
        "on_device_sampling_config": None,
    },
    "vision_neuron_config": {
        "batch_size": 1,
        "seq_len": 131072,
        "max_context_length": 131072,
        "torch_dtype": "bfloat16",
        "skip_sharding": False,
        "save_sharded_checkpoint": True,
        "tp_degree": 4,
        "world_size": 4,
        "fused_qkv": True,
        "qkv_kernel_enabled": False,
        "attn_kernel_enabled": True,
        "enable_bucketing": True,
        "buckets": [128],
        "cc_pipeline_tiling_factor": 2,
        "attention_dtype": "bfloat16",
        "rpl_reduce_dtype": "bfloat16",
        "cast_type": "as-declared",
        "logical_neuron_cores": 2,
        "default_image_width": 640,
        "default_image_height": 320
    }
}

### Offline Example

In [None]:
import os

os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from transformers import AutoProcessor

def qwen2_vl_offline_test():
    model_name_or_path = "Qwen/Qwen2-VL-7B-Instruct/"
    # Create an LLM.
    llm = LLM(
    model=model_name_or_path,
    tensor_parallel_size=4,
    max_num_seqs=1,
    max_model_len=32768,
    additional_config=dict(
        override_neuron_config=qwen2_vl_neuron_config  # Use the configuration defined above
    ),
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
    )

    # Sample prompts.
    prompt = "What do you see in these images?"
    # Resize to default image size
    default_image_size = (640, 320)

    images = [
        ImageAsset("blue_flowers").pil_image.resize(default_image_size),
        ImageAsset("bird").pil_image.resize(default_image_size),
    ]

    processor = AutoProcessor.from_pretrained(model_name_or_path)

    placeholders = [{"type": "image"} for _ in images]
    messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
                *placeholders,
                {
                "type": "text",
                "text": prompt,
                },
        ],
    },
    ]

    prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    )
    inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": images,
    },
    }
    outputs = llm.generate([inputs], SamplingParams(top_k=1, max_tokens=64))

    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {generated_text!r}")

if __name__ == "__main__":
    qwen2_vl_offline_test()

Below is an example output:
```bash
Generated text: 'The first image shows a close-up of a flower with blue petals and water droplets on them, set against a dark background. The second image features a vibrant red bird with blue and green wings perched on a branch.'
```

### Online Example

In [None]:
import json

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'
additional_neuron_config=json.dumps(dict(override_neuron_config=qwen2_vl_neuron_config))
start_server_cmd=cmd = f'''python3 -m vllm.entrypoints.openai.api_server \
   --model=\'{model_name_or_path}\' \
   --tensor-parallel-size=4 \
   --max-num-seqs=1 \
   --max-model-len=32768 \
   --additional-config=\'{additional_neuron_config}\' \
   --no-enable-chunked-prefill \
   --no-enable-prefix-caching \
   --port=8080
'''

import os
os.system(start_server_cmd)

Once the vLLM server is online, submit requests using the example below:

In [None]:
from openai import OpenAI


client = OpenAI(api_key="EMPTY", base_url="http://0.0.0.0:8080/v1")
models = client.models.list()
model_name = models.data[0].id

messages = [
   {"role": "system", "content": "You are a helpful assistant."},
   {
      "role": "user",
      "content": [
        {
            "type": "text",
            "text": "Describe this image.",
        },
        {
            "type": "image_url",
            "image_url": {
                "url": "example_image_url" # need to resize to {default_image_width}x{default_image_height}
            }
        }
      ],
   },
]

response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    max_tokens=64,
    temperature=1.0,
    top_p=1.0,
    stream=False,
    extra_body={"top_k": 1},
)

generated_text = response.choices[0].message.content
print(generated_text)

## Conclusion

Congratulations ! You now know how to deploy `Qwen/Qwen2-VL-7B-Instruct` on a `trn2.48xlarge` instance. Modify the configurations and deploy the model as per your requirements and use case.