[model] add support for qwen3vl series #16780

JJJYmmm · 2025-10-26T12:58:54Z

This PR adds support for the Qwen3-VL series, including both the dense and MoE variants.
The original implementation was contributed by @yairpatch and @Thireus (see #16207). @LETS-BEE also helped address issues such as weights loading.

In this PR, I’ve fixed several algorithmic implementation details (e.g., deepstack), added support for MRoPE-Interleave, and performed final code cleanup.

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

Thireus · 2025-10-26T14:58:45Z

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

ddh0 · 2025-10-27T05:11:01Z

Thank you! Looking forward to this so we (myself and @rujialiu) can progress with #16600 :)

xbl916 · 2025-10-27T11:01:07Z

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition.

@JJJYmmm

Integrates Qwen3-VL and Qwen3VL-MoE architecture support from upstream. Implements IMROPE (Interleaved Multi-resolution RoPE) for vision models. Adds deepstack layer support for visual feature processing. Changes include: - New architecture types: LLM_ARCH_QWEN3VL, LLM_ARCH_QWEN3VLMOE - IMROPE rope type for vision position encoding - Deepstack visual feature handling in clip.cpp - GGML CUDA kernels for IMROPE - Tensor mappings for Qwen3VL architecture Upstream PR: ggml-org/llama.cpp#16780 Contributors: @JJJYmmm @yairpatch @Thireus @LETS-BEE

theo77186 · 2025-10-27T11:27:26Z

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

psi00 · 2025-10-27T13:39:47Z

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

I'm still getting an unknown model architecture error here?

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Apps\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Apps\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Apps\llama.cpp\ggml-cpu-haswell.dll
build: 7106 (495c6115) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:07:00.0) - 11240 MiB free
llama_model_loader: max stdio successfully set to 2048
llama_model_loader: loaded meta data with 21 key-value pairs and 399 tensors from C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = Qwen3-VL-8B-Instruct
llama_model_loader: - kv   1:                                    version u32              = 3
llama_model_loader: - kv   2:                               tensor_count u32              = 399
llama_model_loader: - kv   3:                               general.type str              = model
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                               bos_token_id u32              = 151643
llama_model_loader: - kv   6:                               eos_token_id u32              = 151645
llama_model_loader: - kv   7:                                 hidden_act str              = silu
llama_model_loader: - kv   8:                                hidden_size u32              = 4096
llama_model_loader: - kv   9:                          intermediate_size u32              = 12288
llama_model_loader: - kv  10:                    max_position_embeddings u32              = 262144
llama_model_loader: - kv  11:                        num_attention_heads u32              = 32
llama_model_loader: - kv  12:                          num_hidden_layers u32              = 36
llama_model_loader: - kv  13:                        num_key_value_heads u32              = 8
llama_model_loader: - kv  14:                               rms_norm_eps f32              = 0.000001
llama_model_loader: - kv  15:                                 rope_theta f32              = 5000000.000000
llama_model_loader: - kv  16:                             attention_bias bool             = false
llama_model_loader: - kv  17:                                   head_dim u32              = 128
llama_model_loader: - kv  18:                        tie_word_embeddings bool             = false
llama_model_loader: - kv  19:                                 vocab_size u32              = 151936
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_0:  254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0 (guessed)
print_info: file size   = 4.29 GiB (4.50 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'Qwen3-VL-8B-Instruct'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf', try reducing --n-gpu-layers if you're running out of VRAM
main: error: unable to load model```

i4TsU · 2025-10-27T13:55:51Z

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

they are not, as @FMayran and @rujialiu are still figuring out the best way to implement a fix properly, once and for all :) . you can cherry-pick the changes from #16745 without any problems though, and then just build it yourself for a temporary implementation, though make sure to check the issues raised in the last 24-48 hours re why its not a real 100% fix

PaymonHossaini · 2025-10-27T14:07:52Z

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

psi00 · 2025-10-27T14:21:59Z

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

Thank you. I was using the GGUFs from NexaAI. May I add though that I think the architecture is different for each model (30B/8B/4B) etc. I will try this though, thanks again

ngxson · 2025-10-27T14:17:49Z

tools/mtmd/clip.cpp

+                feat = ggml_mul_mat(ctx0, merger.fc1_w, feat);
+                feat = ggml_add(ctx0, feat, merger.fc1_b);
+
+                feat = ggml_gelu(ctx0, feat);
+
+                feat = ggml_mul_mat(ctx0, merger.fc2_w, feat);
+                feat = ggml_add(ctx0, feat, merger.fc2_b);


replacing this with build_ffn can improve performance a bit

ngxson · 2025-10-27T14:23:21Z

tools/mtmd/clip.cpp

+                    deepstack_features = feat;
+                } else {
+                    // concat along the feature dimension
+                    deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);


not very important to optimize this right now, but doing ggml_concat at on multiple layers can increase memory usage. one trick is to allocate one big tensor, then use ggml_set_rows to copy the intermediate result into the allocated tensor.

cc @ggerganov , do you think this can be a good idea for concat multiple tensors?

ngxson · 2025-10-27T14:24:17Z

tools/mtmd/clip.cpp

+            if (std::find(hparams.deepstack_layers.begin(), hparams.deepstack_layers.end(), il) != hparams.deepstack_layers.end()) {
+                const int deepstack_idx = std::find(hparams.deepstack_layers.begin(), hparams.deepstack_layers.end(), il) - hparams.deepstack_layers.begin();
+                auto & merger = model.deepstack_mergers[deepstack_idx];
+                ggml_tensor * feat = ggml_dup(ctx0, cur);


this ggml_dup can be redundant

ngxson · 2025-10-27T14:27:02Z

tools/mtmd/clip.cpp

    int32_t n_wa_pattern = 0;
    int32_t spatial_merge_size = 0;

+    std::vector<int32_t> deepstack_layers; // qwen3vl deepstack layers


maybe better to convert this to std::vector<bool> is_deepstack_layers

and the vector contains exactly n_layers elements, so for example, if is_deepstack_layer[il] == true then the layer is deepstack

ngxson · 2025-10-27T14:31:28Z

tools/mtmd/clip.cpp

+    struct deepstack_merger {
+        ggml_tensor * norm_w = nullptr;
+        ggml_tensor * norm_b = nullptr;
+        ggml_tensor * fc1_w = nullptr;
+        ggml_tensor * fc1_b = nullptr;
+        ggml_tensor * fc2_w = nullptr;
+        ggml_tensor * fc2_b = nullptr;
+    };


this can be moved to clip_layer, and add prefix for it (no need to be a dedicated class), example:

ggml_tensor * deepstack_norm_w = nullptr; ggml_tensor * deepstack_norm_b = nullptr; ggml_tensor * deepstack_fc1_w = nullptr; ggml_tensor * deepstack_fc1_b = nullptr; ggml_tensor * deepstack_fc2_w = nullptr; ggml_tensor * deepstack_fc2_b = nullptr;

ngxson · 2025-10-27T14:32:27Z

tools/mtmd/clip.cpp

+                        model.deepstack_mergers.resize(hparams.deepstack_layers.size());
+                        for (size_t i = 0; i < hparams.deepstack_layers.size(); i++) {
+                            auto & merger = model.deepstack_mergers[i];
+                            merger.norm_w = get_tensor(string_format("v.deepstack.%d.norm.weight", (int)i), false);


these tensor names should be #define like the other tensors

psi00 · 2025-10-27T14:49:47Z

@PaymonHossaini,
I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported

PaymonHossaini · 2025-10-27T17:38:28Z

@PaymonHossaini, I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported

While its true the the 30B is MOE and the 8B is dense I was unable to recreate this issue. Make sure your local checkout tracks the PR branch as there were some changes to that script to make it compatible with these models.

My instructions for using 8B model below

huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir tmp/Qwen3-VL-8B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-8B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-8b-Instruct-F16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999

I don't belive this issue is a result of the code changes.

JJJYmmm and others added 2 commits October 26, 2025 19:18

support qwen3vl series.

1e4fd19

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

bugfix: fix the arch check for qwen3vl-moe.

f84bd67

JJJYmmm requested review from CISC, ggerganov, ngxson and slaren as code owners October 26, 2025 12:58

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025

Thireus mentioned this pull request Oct 26, 2025

Created branch for builds - Add support for qwen3vl series by @JJJYmmm Thireus/llama.cpp#27

Merged

taronaeo linked an issue Oct 26, 2025 that may be closed by this pull request

Feature Request: support qwen3-vl series #16207

Open

4 tasks

ngxson reviewed Oct 27, 2025

View reviewed changes

fixu124 mentioned this pull request Oct 27, 2025

Feature Request: support qwen3-vl series #16207

Open

4 tasks

Uh oh!

[model] add support for qwen3vl series #16780

Are you sure you want to change the base?

[model] add support for qwen3vl series #16780

Conversation

JJJYmmm commented Oct 26, 2025

Uh oh!

Thireus commented Oct 26, 2025

Uh oh!

ddh0 commented Oct 27, 2025

Uh oh!

xbl916 commented Oct 27, 2025

Uh oh!

theo77186 commented Oct 27, 2025

Uh oh!

psi00 commented Oct 27, 2025

Uh oh!

i4TsU commented Oct 27, 2025

Uh oh!

PaymonHossaini commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psi00 commented Oct 27, 2025

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

psi00 commented Oct 27, 2025

Uh oh!

PaymonHossaini commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My instructions for using 8B model below

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

PaymonHossaini commented Oct 27, 2025 •

edited

Loading

PaymonHossaini commented Oct 27, 2025 •

edited

Loading