Skip to content

Conversation

@JJJYmmm
Copy link

@JJJYmmm JJJYmmm commented Oct 26, 2025

This PR adds support for the Qwen3-VL series, including both the dense and MoE variants.
The original implementation was contributed by @yairpatch and @Thireus (see #16207). @LETS-BEE also helped address issues such as weights loading.

In this PR, I’ve fixed several algorithmic implementation details (e.g., deepstack), added support for MRoPE-Interleave, and performed final code cleanup.

JJJYmmm and others added 2 commits October 26, 2025 19:18
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025
@taronaeo taronaeo linked an issue Oct 26, 2025 that may be closed by this pull request
4 tasks
@Thireus
Copy link

Thireus commented Oct 26, 2025

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

@ddh0
Copy link
Contributor

ddh0 commented Oct 27, 2025

Thank you! Looking forward to this so we (myself and @rujialiu) can progress with #16600 :)

@xbl916
Copy link

xbl916 commented Oct 27, 2025

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition.

iosub added a commit to iosub/ollama that referenced this pull request Oct 27, 2025
Integrates Qwen3-VL and Qwen3VL-MoE architecture support from upstream.
Implements IMROPE (Interleaved Multi-resolution RoPE) for vision models.
Adds deepstack layer support for visual feature processing.

Changes include:
- New architecture types: LLM_ARCH_QWEN3VL, LLM_ARCH_QWEN3VLMOE
- IMROPE rope type for vision position encoding
- Deepstack visual feature handling in clip.cpp
- GGML CUDA kernels for IMROPE
- Tensor mappings for Qwen3VL architecture

Upstream PR: ggml-org/llama.cpp#16780
Contributors: @JJJYmmm @yairpatch @Thireus @LETS-BEE
@theo77186
Copy link

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

@psi00
Copy link

psi00 commented Oct 27, 2025

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

I'm still getting an unknown model architecture error here?

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Apps\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Apps\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Apps\llama.cpp\ggml-cpu-haswell.dll
build: 7106 (495c6115) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:07:00.0) - 11240 MiB free
llama_model_loader: max stdio successfully set to 2048
llama_model_loader: loaded meta data with 21 key-value pairs and 399 tensors from C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = Qwen3-VL-8B-Instruct
llama_model_loader: - kv   1:                                    version u32              = 3
llama_model_loader: - kv   2:                               tensor_count u32              = 399
llama_model_loader: - kv   3:                               general.type str              = model
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                               bos_token_id u32              = 151643
llama_model_loader: - kv   6:                               eos_token_id u32              = 151645
llama_model_loader: - kv   7:                                 hidden_act str              = silu
llama_model_loader: - kv   8:                                hidden_size u32              = 4096
llama_model_loader: - kv   9:                          intermediate_size u32              = 12288
llama_model_loader: - kv  10:                    max_position_embeddings u32              = 262144
llama_model_loader: - kv  11:                        num_attention_heads u32              = 32
llama_model_loader: - kv  12:                          num_hidden_layers u32              = 36
llama_model_loader: - kv  13:                        num_key_value_heads u32              = 8
llama_model_loader: - kv  14:                               rms_norm_eps f32              = 0.000001
llama_model_loader: - kv  15:                                 rope_theta f32              = 5000000.000000
llama_model_loader: - kv  16:                             attention_bias bool             = false
llama_model_loader: - kv  17:                                   head_dim u32              = 128
llama_model_loader: - kv  18:                        tie_word_embeddings bool             = false
llama_model_loader: - kv  19:                                 vocab_size u32              = 151936
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_0:  254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0 (guessed)
print_info: file size   = 4.29 GiB (4.50 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'Qwen3-VL-8B-Instruct'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf', try reducing --n-gpu-layers if you're running out of VRAM
main: error: unable to load model```

@i4TsU
Copy link

i4TsU commented Oct 27, 2025

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

they are not, as @FMayran and @rujialiu are still figuring out the best way to implement a fix properly, once and for all :) . you can cherry-pick the changes from #16745 without any problems though, and then just build it yourself for a temporary implementation, though make sure to check the issues raised in the last 24-48 hours re why its not a real 100% fix

@PaymonHossaini
Copy link

PaymonHossaini commented Oct 27, 2025

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

@psi00
Copy link

psi00 commented Oct 27, 2025

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

Thank you. I was using the GGUFs from NexaAI. May I add though that I think the architecture is different for each model (30B/8B/4B) etc. I will try this though, thanks again

Comment on lines +996 to +1002
feat = ggml_mul_mat(ctx0, merger.fc1_w, feat);
feat = ggml_add(ctx0, feat, merger.fc1_b);

feat = ggml_gelu(ctx0, feat);

feat = ggml_mul_mat(ctx0, merger.fc2_w, feat);
feat = ggml_add(ctx0, feat, merger.fc2_b);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replacing this with build_ffn can improve performance a bit

deepstack_features = feat;
} else {
// concat along the feature dimension
deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very important to optimize this right now, but doing ggml_concat at on multiple layers can increase memory usage. one trick is to allocate one big tensor, then use ggml_set_rows to copy the intermediate result into the allocated tensor.

cc @ggerganov , do you think this can be a good idea for concat multiple tensors?

if (std::find(hparams.deepstack_layers.begin(), hparams.deepstack_layers.end(), il) != hparams.deepstack_layers.end()) {
const int deepstack_idx = std::find(hparams.deepstack_layers.begin(), hparams.deepstack_layers.end(), il) - hparams.deepstack_layers.begin();
auto & merger = model.deepstack_mergers[deepstack_idx];
ggml_tensor * feat = ggml_dup(ctx0, cur);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ggml_dup can be redundant

int32_t n_wa_pattern = 0;
int32_t spatial_merge_size = 0;

std::vector<int32_t> deepstack_layers; // qwen3vl deepstack layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to convert this to std::vector<bool> is_deepstack_layers

and the vector contains exactly n_layers elements, so for example, if is_deepstack_layer[il] == true then the layer is deepstack

Comment on lines +365 to +372
struct deepstack_merger {
ggml_tensor * norm_w = nullptr;
ggml_tensor * norm_b = nullptr;
ggml_tensor * fc1_w = nullptr;
ggml_tensor * fc1_b = nullptr;
ggml_tensor * fc2_w = nullptr;
ggml_tensor * fc2_b = nullptr;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be moved to clip_layer, and add prefix for it (no need to be a dedicated class), example:

        ggml_tensor * deepstack_norm_w = nullptr;
        ggml_tensor * deepstack_norm_b = nullptr;
        ggml_tensor * deepstack_fc1_w = nullptr;
        ggml_tensor * deepstack_fc1_b = nullptr;
        ggml_tensor * deepstack_fc2_w = nullptr;
        ggml_tensor * deepstack_fc2_b = nullptr;

model.deepstack_mergers.resize(hparams.deepstack_layers.size());
for (size_t i = 0; i < hparams.deepstack_layers.size(); i++) {
auto & merger = model.deepstack_mergers[i];
merger.norm_w = get_tensor(string_format("v.deepstack.%d.norm.weight", (int)i), false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tensor names should be #define like the other tensors

@psi00
Copy link

psi00 commented Oct 27, 2025

@PaymonHossaini,
I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported

@PaymonHossaini
Copy link

PaymonHossaini commented Oct 27, 2025

@PaymonHossaini, I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported
image

While its true the the 30B is MOE and the 8B is dense I was unable to recreate this issue. Make sure your local checkout tracks the PR branch as there were some changes to that script to make it compatible with these models.

My instructions for using 8B model below

huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir tmp/Qwen3-VL-8B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-8B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-8b-Instruct-F16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999

I don't belive this issue is a result of the code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: support qwen3-vl series

9 participants