# Tutorial: Deploy Qwen3-VL 8B on Trn2 instances
This tutorial provides a step-by-step guide to deploy [Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking) on a single `trn2.48xlarge` instance using vLLM V1 with the vLLM-Neuron Plugin.

.. contents:: Table of contents

## Examples

- [Offline Example](#offline-example)
- [Online Example](#online-example)

## Step 1: Set up your development environment

As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To use a Jupyter (.ipynb) notebook on a Neuron instance, follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).

After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.

After you are connected, activate the Python virtual environment that includes the Neuron SDK.

```python
pip list | grep neuron
```
You should see Neuron packages including
`neuronx-distributed-inference` and `neuronx-cc`.

## Step 2: Install the vLLM version that supports NxD Inference

NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html).

Ensure that the Neuron virtual environment is activated if you are using a new terminal instead of the one from connection step above. Then, install the Neuron vLLM into the virtual environment.

## Step 3: Download the model from HuggingFace (Optional)

To deploy [Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking) on Neuron,  download the checkpoint from HuggingFace to a local path on the Trn2 instance. For more information on downloading models from HuggingFace, refer to [the HuggingFace guide on downloading models](https://huggingface.co/docs/hub/en/models-downloading)).

After the download, you should see a `config.json` file in the output folder along with weights in `model-xxxx-of-xxxx.safetensors` format.

## Step 4: Compile and deploy Qwen3 VL Inference

We provide two examples to run Qwen3 VL with vLLM V1:

* Offline inference: you can provide prompts in a python script and execute it.
* Online inference: you will serve the model in an online server and send requests. 

#### Model Compilation and Configuration

Certain configurations are used to optimize the performance of the model during compilation. These configurations are described below and can be modified for your specific use case.
- Qwen3 VL consists of a text model and a vision encoder. You must specify configurations explicitly through `text_neuron_config` and `vision_neuron_config`.
- `world_size`: max number of neuron cores in the distributed environment. Text and vision model must have the same world size.
- `tp_degree`: degree of tensor parallelism. Text and vision model can use different sharding scheme and therefore different TP degree.
- `batch_size`: This is set to the batch size for compiling the models. For optimized latency, Prefill is always done with batch_size = 1; hence `ctx_batch_size` in `text_neuron_config` and the `batch_size` in `vision_neuron_config` are set to 1. The `batch_size` and `tkg_batch_size` in `text_neuron_config` are set to the desired value for handling concurrent requests (same as max-num-seqs for the vllm argument). Currently, there is a known issue with batch size > 1 and this limitation will be addressed in future release.
- `text_neuron_config`
    - `seq_len`: Set this to the maximum sequence length in your use case. We currently support up to 32768 in the text model. This refers to the total length of vision and text, input and output tokens.
    - `enable_bucketing`: [Bucketing](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#bucketing) allows one to optimize performance for specific sequence lengths and in this case we [configure specific buckets](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#configuring-specific-buckets).
    - `context_encoding_buckets`: This refers to the prefill/context encoding phase and should be set to handle different total length of vision and text input tokens.
    - Note that in Qwen3 VL, vision embeddings are spatially compressed by a factor of `spatial_merge_size ** 2` before being fed into the text model. This value is defined in the model's `config.json`. As a result, the effective text context length is calculated as: `text_context_len = vision_seq_len // (spatial_merge_size ** 2)`.
    - `token_generation_buckets`: This refers to the decode/token generation phase. The bucket size should reflect the total sequence length, which is the sum of vision tokens, text input tokens, and output tokens.
    - `fused_qkv`: [QKV weight fusion](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#qkv-weight-fusion) concatenates a model’s query, key and value weight matrices to achieve better performance.
    - `qkv_kernel_enabled`: Enable the use of the fused QKV kernel.
    - `mlp_kernel_enabled`: Enable the use of the MLP kernel.
    - `attn_kernel_enabled`: Enable the use of the Flash Attention kernel.
- `vision_neuron_config`
    - `seq_len`: Set this to the maximum vision sequence length in your use case. We currently support up to 16384 in the vision model. Vision sequence length is calculated by `num_images * (image_height//patch_size) * (image_width//patch_size)`.
    - `buckets`: Set this to handle different vision sequence lengths.
    - `fused_qkv`: [QKV weight fusion](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#qkv-weight-fusion) concatenates a model’s query, key and value weight matrices to achieve better performance.


In [None]:
text_neuron_config = {
    # Batch Size
    "batch_size": 1,
    "ctx_batch_size": 1,
    "tkg_batch_size": 1,
    
    # Sequence Lengths
    "seq_len": 32768,
    "max_context_length": 32768,
    
    # Buckets
    "enable_bucketing": True,
    "context_encoding_buckets": [2048, 5120, 32768],
    "token_generation_buckets": [2048, 5120, 32768],
    
    # Parallelism
    "world_size": 16,
    "tp_degree": 16,
    
    # Others
    "torch_dtype": "bfloat16",
    "rpl_reduce_dtype": "bfloat16",
    "attention_dtype": "bfloat16",
    "cast_type": "as-declared",
    "logical_neuron_cores": 2,
    "cc_pipeline_tiling_factor": 2,
    
    # Kernels
    "fused_qkv": True,
    "qkv_kernel_enabled": True,
    "mlp_kernel_enabled": True,
    "attn_kernel_enabled": True,
}

vision_neuron_config = {
    # Batch Size
    "batch_size": 1,
    
    # Sequence Lengths
    "seq_len": 16384,
    "max_context_length": 16384,
    
    # Buckets
    "enable_bucketing": True,
    "buckets": [1024, 16384],
    
    # Parallelism
    "world_size": 16,
    "tp_degree": 16,
    
    # Others
    "torch_dtype": "bfloat16",
    "rpl_reduce_dtype": "bfloat16",
    "cast_type": "as-declared",
    "logical_neuron_cores": 2,
    "cc_pipeline_tiling_factor": 2,
    
    # Kernels
    "fused_qkv": True,
    "attn_kernel_enabled": False,
    "mlp_kernel_enabled": False,
}

### Offline Example

In [None]:
import os

os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
os.environ["NEURON_RT_DBG_INTRA_RDH_CHANNEL_BUFFER_SIZE"] = "146800640" # to support 32k sequence length

from vllm import LLM, SamplingParams

model_name_or_path = "~/models/Qwen3-VL-8B-Thinking/"

# Create an LLM.
llm = LLM(
   model=model_name_or_path,
   tokenizer=model_name_or_path,
   trust_remote_code=True,
   dtype="bfloat16",
   tensor_parallel_size=16,
   max_num_seqs=1,
   max_model_len=32768,
   additional_config={
      "override_neuron_config": {
            "text_neuron_config": text_neuron_config,
            "vision_neuron_config": vision_neuron_config
      }
   },
   limit_mm_per_prompt={"image": 20}, # Use the max number of image in your use case
   enable_prefix_caching=False,
   enable_chunked_prefill=False,
)

# Sample prompts.
from transformers import AutoProcessor
from vllm.assets.image import ImageAsset

processor = AutoProcessor.from_pretrained(model_name_or_path)

prompt = "What do you see in these images?"
images = [
   ImageAsset("blue_flowers").pil_image,
   ImageAsset("bird").pil_image,
]
      
placeholders = [{"type": "image"} for _ in images]
messages = [
   {"role": "system", "content": "You are a helpful assistant."},
   {
   "role": "user",
      "content": [
               *placeholders,
               {
               "type": "text",
               "text": prompt,
               },
      ],
   },
]

prompt = processor.apply_chat_template(
   messages,
   tokenize=False,
   add_generation_prompt=True,
)
inputs = {
   "prompt": prompt,
   "multi_modal_data": {
      "image": images,
   },
}

outputs = llm.generate([inputs], SamplingParams(top_k=1, max_tokens=1024))
print(f"Prompt: {prompt!r}, Generated text: {outputs[0].outputs[0].text!r}")

Below is an example output:
```bash
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|>What do you see in these images?<|im_end|>\n<|im_start|>assistant\n<think>\n', Generated text: "So, let's look at both images. First image: there are blue flowers with water droplets, some pink flowers in the background, and they're in a wet, reflective surface, maybe water. There are bokeh lights (those yellow circles) in the background, so it's a shallow depth of field. Second image: a bird with bright red head and chest, blue wings and tail, perched on a branch. The background is green, blurred, so it's a forest or jungle setting. Need to describe each image clearly.\n\nFirst image details: blue flowers (maybe plumeria?), water droplets on petals, some pink flowers, wet surface (water), reflections, bokeh lights (out of focus yellow circles). Second image: bird with vibrant colors—red body, blue wings/tail, black beak, perched on a brown branch, green background (blurred foliage). Both images have high detail, vibrant colors, nature themes.\n\nSo, summarize each image's content.\n</think>\n\nIn the first image, I see **vibrant blue flowers** (likely plumeria) with water droplets glistening on their petals. These flowers are partially submerged in a reflective, wet surface (possibly water), creating subtle ripples and reflections. In the background, there are soft, out-of-focus pink flowers and warm, golden bokeh lights (blurred circular highlights), which add a dreamy, atmospheric quality to the scene. The overall mood is serene and ethereal, emphasizing the delicate beauty of the flowers and the moisture around them.  \n\nIn the second image, I observe a **colorful bird** perched on a thick, textured brown branch. The bird has a striking combination of colors: a bright red head and chest, vivid blue wings and tail, and a dark beak. Its feathers appear detailed and glossy, with the blue wings showing intricate patterns. The background is a blurred, lush green (suggesting a forest or jungle environment), which creates a soft, natural backdrop that highlights the bird’s vibrant plumage. The image captures the bird in sharp focus, emphasizing its vivid colors and the texture of its feathers and the branch it rests on.  \n\nBoth images showcase nature’s beauty with high detail, vibrant colors, and a focus on the interplay of light and texture."
```

### Online Example

In [None]:
import json

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'
additional_neuron_config=json.dumps(dict(override_neuron_config=dict(text_neuron_config=text_neuron_config, vision_neuron_config=vision_neuron_config)))
limit_mm_per_prompt_json = json.dumps({"image": 20})

start_server_cmd= f'''vllm serve \
--model="~/models/Qwen3-VL-8B-Thinking/" \
--tokenizer="~/models/Qwen3-VL-8B-Thinking/" \
--trust-remote-code \
--dtype="bfloat16" \
--tensor-parallel-size=16 \
--max-num-seqs=1 \
--max-model-len=32768 \
--additional-config=\'{additional_neuron_config}\' \
--limit_mm_per_prompt=\'{limit_mm_per_prompt_json}\' \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--port=8080
'''

import os
os.system(start_server_cmd)

After deploying the model server, you can run inference by sending it requests. The below example sends a text prompt with two images -

In [None]:
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://0.0.0.0:8080/v1")
models = client.models.list()
model_name = models.data[0].id

messages = [
   {
      "role": "user",
      "content": [
            {
               "type": "image_url",
               "image_url": {
                  "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}
            },
            {
               "type": "text",
               "text": "Describe this image",
            },
      ],
   },
]

response = client.chat.completions.create(
   model=model_name,
   messages=messages,
   temperature=1.0,
   top_p=1.0,
   stream=False,
   extra_body={"top_k": 1},
)

generated_text = response.choices[0].message.content
print(generated_text)

Below is an example output:
```bash
So, let's describe this image. First, the main subject is a wild cat, probably a Pallas's cat, in a snowy environment. Let's check the details. The cat has thick, fluffy fur that's a mix of brown, gray, and maybe some lighter shades. Its fur is dusted with snow, so it's in a winter setting. The cat is walking on snow, with one paw lifted, so it's in motion. The background has white birch trees with black bark patterns, typical of a snowy forest. There's also a chain-link fence on the left side, which might indicate a controlled environment like a zoo or wildlife reserve. The snow on the ground is fresh, and there are some small twigs or debris visible. The cat's face has distinctive markings, like the white area around the mouth and the striped pattern on its cheeks. The overall scene is cold, with the snow and the cat's thick fur suggesting it's adapted to cold climates. Let's structure the description: start with the main subject, then details about the cat's appearance, the environment, and the setting.
</think>

The image depicts a **Pallas's cat** (a wild feline species native to Central Asia) walking through a snowy landscape. The cat’s thick, fluffy fur is a mix of brown, gray, and cream tones, dusted with snowflakes, emphasizing its adaptation to cold climates. Its face features distinctive markings: a white patch around the mouth, dark stripes on the cheeks, and a short, rounded muzzle. The cat is captured mid-stride, with one paw lifted, conveying movement across the snow-covered ground.  

In the background, **white-barked birch trees** with dark, irregular bark patterns create a stark, wintry forest scene. To the left, a **chain-link fence** suggests the setting may be a controlled environment like a zoo or wildlife reserve. The snow on the ground is fresh and undisturbed except for the cat’s path, with small twigs and debris scattered nearby. The overall atmosphere is serene and cold, highlighting the cat’s natural camouflage and resilience in a snowy habitat.
```

## Conclusion

Congratulations ! You now know how to deploy `Qwen/Qwen3-VL-8B-Thinking` on a `trn2.48xlarge` instance. Modify the configurations and deploy the model as per your requirements and use case.