-
Couldn't load subscription status.
- Fork 13.4k
[model] add support for qwen3vl series #16780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
|
Thank you @JJJYmmm! Test builds: https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611 |
For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition. |
Integrates Qwen3-VL and Qwen3VL-MoE architecture support from upstream. Implements IMROPE (Interleaved Multi-resolution RoPE) for vision models. Adds deepstack layer support for visual feature processing. Changes include: - New architecture types: LLM_ARCH_QWEN3VL, LLM_ARCH_QWEN3VLMOE - IMROPE rope type for vision position encoding - Deepstack visual feature handling in clip.cpp - GGML CUDA kernels for IMROPE - Tensor mappings for Qwen3VL architecture Upstream PR: ggml-org/llama.cpp#16780 Contributors: @JJJYmmm @yairpatch @Thireus @LETS-BEE
|
the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged. |
I'm still getting an unknown model architecture error here? |
they are not, as @FMayran and @rujialiu are still figuring out the best way to implement a fix properly, once and for all :) . you can cherry-pick the changes from #16745 without any problems though, and then just build it yourself for a temporary implementation, though make sure to check the issues raised in the last 24-48 hours re why its not a real 100% fix |
|
I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using How I prepared mine below
No GGUFs I found off the shelf were working right until I did this. Hope this helps. |
Thank you. I was using the GGUFs from NexaAI. May I add though that I think the architecture is different for each model (30B/8B/4B) etc. I will try this though, thanks again |
| feat = ggml_mul_mat(ctx0, merger.fc1_w, feat); | ||
| feat = ggml_add(ctx0, feat, merger.fc1_b); | ||
|
|
||
| feat = ggml_gelu(ctx0, feat); | ||
|
|
||
| feat = ggml_mul_mat(ctx0, merger.fc2_w, feat); | ||
| feat = ggml_add(ctx0, feat, merger.fc2_b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replacing this with build_ffn can improve performance a bit
| deepstack_features = feat; | ||
| } else { | ||
| // concat along the feature dimension | ||
| deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very important to optimize this right now, but doing ggml_concat at on multiple layers can increase memory usage. one trick is to allocate one big tensor, then use ggml_set_rows to copy the intermediate result into the allocated tensor.
cc @ggerganov , do you think this can be a good idea for concat multiple tensors?
| if (std::find(hparams.deepstack_layers.begin(), hparams.deepstack_layers.end(), il) != hparams.deepstack_layers.end()) { | ||
| const int deepstack_idx = std::find(hparams.deepstack_layers.begin(), hparams.deepstack_layers.end(), il) - hparams.deepstack_layers.begin(); | ||
| auto & merger = model.deepstack_mergers[deepstack_idx]; | ||
| ggml_tensor * feat = ggml_dup(ctx0, cur); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ggml_dup can be redundant
| int32_t n_wa_pattern = 0; | ||
| int32_t spatial_merge_size = 0; | ||
|
|
||
| std::vector<int32_t> deepstack_layers; // qwen3vl deepstack layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to convert this to std::vector<bool> is_deepstack_layers
and the vector contains exactly n_layers elements, so for example, if is_deepstack_layer[il] == true then the layer is deepstack
| struct deepstack_merger { | ||
| ggml_tensor * norm_w = nullptr; | ||
| ggml_tensor * norm_b = nullptr; | ||
| ggml_tensor * fc1_w = nullptr; | ||
| ggml_tensor * fc1_b = nullptr; | ||
| ggml_tensor * fc2_w = nullptr; | ||
| ggml_tensor * fc2_b = nullptr; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be moved to clip_layer, and add prefix for it (no need to be a dedicated class), example:
ggml_tensor * deepstack_norm_w = nullptr;
ggml_tensor * deepstack_norm_b = nullptr;
ggml_tensor * deepstack_fc1_w = nullptr;
ggml_tensor * deepstack_fc1_b = nullptr;
ggml_tensor * deepstack_fc2_w = nullptr;
ggml_tensor * deepstack_fc2_b = nullptr;| model.deepstack_mergers.resize(hparams.deepstack_layers.size()); | ||
| for (size_t i = 0; i < hparams.deepstack_layers.size(); i++) { | ||
| auto & merger = model.deepstack_mergers[i]; | ||
| merger.norm_w = get_tensor(string_format("v.deepstack.%d.norm.weight", (int)i), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tensor names should be #define like the other tensors
|
@PaymonHossaini, |
While its true the the 30B is MOE and the 8B is dense I was unable to recreate this issue. Make sure your local checkout tracks the PR branch as there were some changes to that script to make it compatible with these models. My instructions for using 8B model below
I don't belive this issue is a result of the code changes. |

This PR adds support for the Qwen3-VL series, including both the dense and MoE variants.
The original implementation was contributed by @yairpatch and @Thireus (see #16207). @LETS-BEE also helped address issues such as weights loading.
In this PR, I’ve fixed several algorithmic implementation details (e.g., deepstack), added support for MRoPE-Interleave, and performed final code cleanup.