Qwen vl bounding box and overall vision fix #16745

FMayran · 2025-10-23T16:02:41Z

This PR adresses the root cause of issue #13694 which has been stale for quite a long time.
It is an alternate solution to #15474 and is necessary because pull request #15474 causes incorrect positions to be used for positional encoding after an image.
The root issue is discussed in #13694 and #15474 and comes down to this: for models with positional encoding, llamacpp expects token positions in the KV cache to be strictly increasing for causal masking, but Qwen VL and possibly others have a token pos that do not respect this (they can decrease for instance).

Although this PR does incur one root change inside ubatch (introducing a new field to store the position at which the token was inserted in the KV cache), positional encodings match transformer's implementation. The causal check uses this new position instead of the token pos.

This PR may be necessary to reach perfect accuracy in Qwen3VL, as mentionned in the discussion of #16207

This is my first PR.
Thanks @ggerganov @ngxson and especially @rujialiu

This is just a quick and dirty demonstration.

…batch

pwilkin · 2025-10-24T10:17:01Z

src/llama-batch.cpp

+        /*.output       =*/        udata->output.data(),
+        /*.kv_position_of_token=*/ udata->kv_position_of_token.data(),
+        /*.data         =*/        std::move(udata),
+


Nitpick: reorder so that kv_position_of_token is last.

I am not sure about what you mean by that. It can't be after .data because it should be treated like other members of the update struct

pwilkin · 2025-10-24T10:17:55Z

src/llama-kv-cache.cpp

+    {
+        if (ubatch->kv_position_of_token[i] != -1)
+            map_kv_to_batch[ubatch->kv_position_of_token[i]] = i;
+    }


You're filling a vector of int32_t's but iterating over uint32_t's, this is bound to trigger type conversion checks. Just iterate over int32_t.

I will check to minimize casts, thanks for the reminder

ngxson

Quick question: do we still need to provide the absolute (KV) position via llama_batch? I'm not sure what happen if the image is separated into multiple llama_batch before llama_decode

ngxson · 2025-10-24T14:39:21Z

src/llama-batch.h

        std::vector<llama_seq_id>   seq_id_unq;
        std::vector<int32_t>        seq_idx;
        std::vector<int8_t>         output;
+        std::vector<int32_t>        kv_position_of_token;//when pushed to the kv cache, where is the token pushed (used for causal masking)


I'd suggest a different naming for it, either kv_pos or abs_pos (absolute position). But it's up to @ggerganov to decide the naming here

Yes, I was extra pedantic here :-)

ngxson · 2025-10-24T14:40:56Z

tools/mtmd/mtmd.cpp

 llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) {
    if (image_tokens->use_mrope_pos) {
-        return 1; // for M-RoPE, the whole image is 1 in temporal dimension
+        return (::std::max)(image_tokens->nx, image_tokens->ny);//assuming image, not video // for M-RoPE, the whole image is 1 in temporal dimension


Suggested change

return (::std::max)(image_tokens->nx, image_tokens->ny);//assuming image, not video // for M-RoPE, the whole image is 1 in temporal dimension

return std::max(image_tokens->nx, image_tokens->ny); // for M-RoPE, the image takes max(t, w, h) positions; t is omitted because we don't support video input

Regarding (::std::max), I am being extra careful because sometimes on windows max is a preprocessor definition and it breaks std::max. Parenthesis stop that behavior. If this is prevented for sure in llamacpp, we can remove them. And I will remove the useless :: scope

Regarding the comment, it is not exactly "t", much like k.t with k a frame rate dependent constant and I will update the comment with the actual expression. I have some needs for video encoding in my side and may try and implement it in the future.

I am being extra careful because sometimes on windows max is a preprocessor definition and it breaks std::max

std::max is used in multiple places across llama.cpp code base, there is nothing wrong with it. you should look to see how other files have certain #define to resolve the problem

ngxson · 2025-10-24T14:41:59Z

src/llama-batch.cpp

+        //if (p0 >= 0) {
+        //    bool ok = true;
+        //
+        //    if (batch.token) {
+        //        if (seq_pos_min(s) != p0 + 1) {


if this code block is safe to remove, better remove it altogether

Applied changes from llama.cpp PR #16745 to fix cache causal masking issues for Qwen2.5 VL models. Key changes: - Disabled consecutive position validation in llama-batch.cpp (allows position jumps for vision embeddings) - Added kv_position_of_token field to track KV cache positions for proper causal masking - Modified causal masking logic to use batch positions instead of temporal positions - Updated M-RoPE position calculation to use max(nx, ny) for images This fix allows Qwen VL models to handle non-consecutive positions in embeddings, which is required for proper vision processing. Ref: ggml-org/llama.cpp#16745

FMayran · 2025-10-25T23:59:23Z

Quick question: do we still need to provide the absolute (KV) position via llama_batch? I'm not sure what happen if the image is separated into multiple llama_batch before llama_decode

From my understanding, the only use of the kv positions is causal making, which only needs to know if the considered token is after the current token. The batches should be inserted sequentially to the kV cache, so only the last batch matters and "after the current token in the current batch" means "after the current token". If apply_ubatch is not called on the last inserted ubatch, we do have a problem, yes, because my code only checks causality within the current batch.

ggerganov

The PR in #15474 that I reviewed sometime ago seemed like a better solution. This PR makes an invalid assumption about the KV cell indices - namely that they can be used for causality checks.

ggerganov · 2025-10-26T07:08:16Z

src/llama-kv-cache.cpp

            }

            cells.pos_set(idx, ubatch.pos[i]);
+            ubatch.kv_position_of_token[i] = (int32_t)idx;//set the position in the kv cache as a property for this token (needed for proper causal masking)


This does not work - we should never use the idx in the KV cache buffer for causality checks. Even if this works sometimes, in general nothing guarantees that the cache is ordered causally in memory. The simplest scenario in which this breaks is multiple sequences with unified KV cache.

rujialiu · 2025-10-26T07:21:18Z

The PR in #15474 that I reviewed sometime ago seemed like a better solution. This PR makes an invalid assumption about the KV cell indices - namely that they can be used for causality checks.

So it looks like we really need, explicitly, another sequence 0,1,2,3,... (as illustrated in your review) that's indepedent of KV cell indices and dedicated to causual masking.

FMayran · 2025-10-26T18:00:38Z

The PR in #15474 that I reviewed sometime ago seemed like a better solution. This PR makes an invalid assumption about the KV cell indices - namely that they can be used for causality checks.

So it looks like we really need, explicitly, another sequence 0,1,2,3,... (as illustrated in your review) that's indepedent of KV cell indices and dedicated to causual masking.

I see what you mean @ggerganov . The main issue is not actually to store the new sequence of positions, but rather the fact that most code assumes that the base value for positional encoding is equal to the token age in the cache. This in turn means @rujialiu 's #15474 solution PLUS modifying mtmd_image_tokens_get_n_pos() to return both the kv cache sequence position and the positional encoding position, and updating everything that depends on these.
Or alternatively, we could add the handling of both positions in the implementation of the kv cache and add APIs to query both pos.

This commit introduces comprehensive support for Qwen3-VL vision-language models, including both the dense variant and the Mixture-of-Experts (MoE) architecture with DeepStack fusion capabilities. ## Overview Qwen3-VL represents Alibaba's advanced multimodal models capable of understanding and reasoning about images alongside text. This implementation enables running these models for various vision-language tasks including image understanding, optical character recognition (OCR), visual question answering, and document analysis. ## Architecture Implementation ### Core Architecture (llama-arch.cpp/h) - **LLM_ARCH_QWEN3_VL**: Dense vision-language model architecture - **LLM_ARCH_QWEN3_VL_MOE**: Mixture-of-Experts variant with expert routing - Complete tensor mapping registration for both architectures - Architecture-specific parameter handling and validation ### Model Loading (llama-model.cpp) **Hyperparameter Loading** - QWEN3_VL: Standard dense model configuration * Uses full n_embd dimension throughout * 36 layers for 4B parameter variant - QWEN3_VL_MOE: Expert-based configuration * 4x n_embd expansion (n_embd/4 per channel × 4 channels) * 48 layers (30B-A3B) or 94 layers (235B-A22B) * Expert feed-forward network dimensions **Multi-axis Rotary Position Embedding (M-RoPE)** - Configured rope_sections = [24, 20, 20, 0] * Temporal dimension: 24 dims * Height dimension: 20 dims * Width dimension: 20 dims * Unused dimension: 0 - Enables spatial awareness for image patch processing - Added debug logging for MRoPE configuration verification **Tensor Initialization** - QWEN3_VL follows QWEN3 dense structure * Token embeddings, output projection * Per-layer: attention (Q/K/V/O), normalization, FFN - QWEN3_VL_MOE includes expert-specific tensors * Expert gate networks for routing * Per-expert FFN weights (gate, down, up) * Shared and expert-specific parameters ### Graph Building (llama-graph.cpp/h) **DeepStack Architecture for MoE** The Qwen3-VL-MoE variant implements a novel DeepStack fusion mechanism: 1. **Channel Splitting**: Vision embeddings split into 3 processing channels - ds0, ds1, ds2 (DeepStack channels 0, 1, 2) - Each channel: n_embd/4 dimensions 2. **Per-layer Processing**: Independent expert selection per channel - Token-level expert routing - Gated mixture-of-experts computation - Q/K normalization before attention 3. **Fusion Layers**: Learned merging at early transformer layers - Fusion occurs at layers 0, 1, and 2 - DeepStack merger combines information across channels - Only active when vision embeddings present (text-only safe) **Batch Processing** - Enhanced position array handling for M-RoPE multi-dimensional positions - Proper ubatch preparation distinguishing vision vs text tokens - Conditional graph construction based on modality ### Vision Processing (clip.cpp/clip-impl.h) **PROJECTOR_TYPE_QWEN3VLMOE** - New projector type for Qwen3-VL-MoE vision encoder - Handles projection from vision encoder to language model space **DeepStack Merger Implementation** The merger is a learnable 2-layer MLP with normalization: ``` Input (3 channels) → LayerNorm(norm_w, norm_b) → Linear(fc1_w, fc1_b) → GELU activation → Linear(fc2_w, fc2_b) → Output (fused representation) ``` Components: - `norm_w`, `norm_b`: Layer normalization parameters - `fc1_w`, `fc1_b`: First linear projection - `fc2_w`, `fc2_b`: Second linear projection **Spatial Operations** - Fixed spatial merge for vision patch sequences - Proper handling of patch grid dimensions - Vision-text boundary management **Safety Improvements** - Removed illegal zero-tensor initialization for text-only inputs - Conditional fusion: only processes when vision embeddings exist - Prevents memory access violations in text-only inference ### Platform Support (llama-model-loader.cpp) **Windows File Handle Limit** - Increased stdio limit to 2048 handles (from default ~512) - Critical for MoE models with many expert weight files - Uses `_setmaxstdio()` on Windows platform - Prevents "too many open files" errors during loading ### Reference Patches (llama/patches/) Included for transparency and reproducibility: - `0033-qwen3vl-base-architecture.patch` - `0034-qwen3vl-deepstack-implementation.patch` - `0035-qwen3vl-memory-fix.patch` - `0036-qwen3vl-layer-norm-bias.patch` ## Technical Specifications ### Qwen3-VL (Dense) - **Type**: Standard transformer with integrated vision encoder - **Layers**: 36 (4B parameter model) - **Embedding**: Full n_embd dimension - **Position Encoding**: M-RoPE with 4 dimensional sections - **Use Cases**: General vision-language understanding ### Qwen3-VL-MoE (Mixture of Experts) - **Type**: Sparse MoE with DeepStack fusion - **Layers**: 48 (30B activated/3B) or 94 (235B activated/22B) - **Embedding**: 4-channel architecture (n_embd/4 per channel) - **Experts**: Multiple expert networks per layer with learned routing - **Fusion**: 3-layer early fusion (layers 0, 1, 2) - **Use Cases**: High-quality vision understanding at improved efficiency ### DeepStack Fusion Mechanism The multi-channel fusion enables: 1. **Parallel Processing**: Different aspects of vision processed independently 2. **Early Integration**: Information merged in early transformer layers 3. **Adaptive Routing**: Expert selection per channel and token 4. **Efficiency**: Sparse activation patterns reduce computation ## Capabilities Enabled This implementation supports: - **Multimodal Chat**: Conversational AI with image understanding - **Image Captioning**: Detailed image descriptions - **Visual Question Answering**: Answer questions about image content - **Optical Character Recognition**: Extract text from images - **Document Understanding**: Analyze documents, tables, charts - **Image Analysis**: Detailed visual scene understanding ## References and Acknowledgments This implementation is based on the outstanding work by the community: **Primary Source Repository** - Branch: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/ - Author: LETS-BEE **Source Commits** (applied in llama/patches/): 1. Base Architecture LETS-BEE/llama.cpp@9971912 2. DeepStack Implementation LETS-BEE/llama.cpp@b913e89 3. Memory Access Fix LETS-BEE/llama.cpp@de0e3d3 4. Layer Normalization Update LETS-BEE/llama.cpp@e45aecb **Related Discussions and Pull Requests** - Upstream llama.cpp Discussion: ggml-org/llama.cpp#16207 (comment) - Upstream llama.cpp PR: ggml-org/llama.cpp#16745 - Related Ollama PR: ollama#12665 **Additional Context** - OCR-related discussion: ggml-org/llama.cpp#16764 ## Testing Tested with: - Qwen3-VL 4B parameter models (dense) - Qwen3-VL-MoE 30B-A3B models (MoE) - Various image understanding tasks - Text-only and multimodal inference modes ## Future Work Potential enhancements: - Additional model size variants - Performance optimizations for DeepStack fusion - Extended M-RoPE configuration options - Enhanced vision preprocessing pipelines --- Special thanks to the llama.cpp community and all contributors who made this multimodal vision-language support possible.

rujialiu · 2025-10-27T03:49:23Z

I see what you mean @ggerganov . The main issue is not actually to store the new sequence of positions, but rather the fact that most code assumes that the base value for positional encoding is equal to the token age in the cache. This in turn means @rujialiu 's #15474 solution PLUS modifying mtmd_image_tokens_get_n_pos() to return both the kv cache sequence position and the positional encoding position, and updating everything that depends on these. Or alternatively, we could add the handling of both positions in the implementation of the kv cache and add APIs to query both pos.

Yes, that's why I didn't go further. I hoped to come up with a correct solution that has the following property:

The pure language models (and multi-modal models that don't "abuse" position_id) shouldn't incur any extra cost (time/memory)
The pos related variables should have consistent meanings.

However, 2 is difficult because, we need to go over lots of codes and think about "which pos_id is referred to". I suspect some of the codes would force us to add (and maintain) more variables than we initially planed.

And trying to keep 1 is even more difficult.

FMayran · 2025-10-27T08:29:18Z

@rujialiu I know a way to see the implications of such a change. We change the output of mtmd_image_tokens_get_n_pos() to an opaque struct and see what that breaks, and we start figuring things out by propagating the opaque struct and fixing things as we go.

rujialiu · 2025-10-27T08:42:11Z

@rujialiu I know a way to see the implications of such a change. We change the output of mtmd_image_tokens_get_n_pos() to an opaque struct and see what that breaks, and we start figuring things out by propagating the opaque struct and fixing things as we go.

That's a brilliant idea!

FMayran added 4 commits October 23, 2025 16:47

simple fix proposal for Qwen2.5 VL's cache causal masking issues.

b428300

This is just a quick and dirty demonstration.

replace the map with a vector for performance

5f91c45

fix compiler warning in llama_ubatch struct construction

f05fdf9

adapting the previous fix to the syntax used by other fields of the u…

4d51baa

…batch

FMayran requested review from ggerganov and ngxson as code owners October 23, 2025 16:02

This was referenced Oct 23, 2025

Feature Request: support qwen3-vl series #16207

Open

Eval bug: Qwen2.5-VL-7B-Instruct returns extremely inaccurate bbox coordinates #13694

Open

Fix incorrect causual attention mask caused by M-Rope #15474

Closed

support GLM-4.5V vision model #16600

Draft

github-actions bot added the examples label Oct 23, 2025

pwilkin suggested changes Oct 24, 2025

View reviewed changes

ngxson reviewed Oct 24, 2025

View reviewed changes

ggerganov reviewed Oct 26, 2025

View reviewed changes

theo77186 mentioned this pull request Oct 27, 2025

[model] add support for qwen3vl series #16780

Open

	return (::std::max)(image_tokens->nx, image_tokens->ny);//assuming image, not video // for M-RoPE, the whole image is 1 in temporal dimension
	return std::max(image_tokens->nx, image_tokens->ny); // for M-RoPE, the image takes max(t, w, h) positions; t is omitted because we don't support video input

Uh oh!

Qwen vl bounding box and overall vision fix #16745

Are you sure you want to change the base?

Qwen vl bounding box and overall vision fix #16745

Conversation

FMayran commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FMayran commented Oct 25, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rujialiu commented Oct 26, 2025

Uh oh!

FMayran commented Oct 26, 2025

Uh oh!

rujialiu commented Oct 27, 2025

Uh oh!

FMayran commented Oct 27, 2025

Uh oh!

rujialiu commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants