-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d5192bf to
d2f8f00
Compare
AMD Radeon Pro VII
AMD Radeon RX 6800 XT
Intel A770
RTX 3090
|
jeffbolznv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a quick read through. I'll do some perf testing soon.
|
As usual, I appear to have caused an llvmpipe issue. I'll look into it. |
|
Some initial perf results: I reran some of the models with the biggest deltas. Most seem to be noise, except the improvement for gpt-oss MXFP4 is real: |
b153aac to
1b78909
Compare
The funny thing about that is that I didn't even enable the MMVQ path for Nvidia Turing+ on MXFP4. Not sure what is going on there. I still have some tuning to do here, my Strix Halo device isn't liking this PR yet. |
1b78909 to
937f992
Compare
|
The tuning seems okay now, even though I didn't change anything. @jeffbolznv Please take another look. Did you have any concerns with your benchmarks? Here are updated results: AMD Radeon 8060S
AMD RX 6800 XT
AMD Radeon Pro VII
Intel A770
Nvidia RTX 3090
|
b99726c to
3c22e38
Compare
|
Something's broken in the nvidia-vulkan-cm and cm2 runs, I'll look into it. |
|
I can't reproduce the problem, even on my RTX 3090, with coopmat2, coopmat or without coopmat. Not sure what is going on. It looks like incoherence, but for me the example runs just fine. @jeffbolznv any ideas? |
|
I pulled the branch but wasn't able to reproduce the failure. I don't have any great ideas - maybe some missing bounds checking? |
|
MMVQ is much simpler with bounds checking, since all the inputs are in blocks of 256, 128 or 32. I didn't change how the output is stored, so I don't think that's likely. |
3c22e38 to
e086733
Compare
|
I would like to merge this, but the CI keeps failing in a way I can't reproduce or understand. |
|
Had those Illegal (instruction) failures once in a PR and it was related to bad Ccache. Maybe you can clear it and re-run that test. |
|
How do I clear it? |
|
I'm guessing find the ccache entry related to this PR in https://github.com/ggml-org/llama.cpp/actions/caches and delete it. I don't have the required permissions, maybe you do. @slaren did it at the time. |
e086733 to
e69d645
Compare
9d0f9af to
9cbe4f8
Compare
|
So I've done about as much as I can think of to try to figure out what is going on with the
@ggerganov Is anything about the CI Nvidia setup unusual or special? Maybe it's an issue with a specific driver? |
|
Taking a look now |
|
@0cc4m Do you reproduce using these steps: # converted from https://huggingface.co/Qwen/Qwen3-0.6B-Base
wget https://huggingface.co/ggerganov/qwen-tmp/resolve/main/qwen3-0.6b-base-bf16.gguf
./bin/llama-quantize ./qwen3-0.6b-base-bf16.gguf ./qwen3-0.6b-base-q4_0.gguf q4_0
./bin/llama-save-load-state -m ./qwen3-0.6b-base-q4_0.gguf -ngl 10 -c 1024 -fa off |
|
@ggerganov I basically did that, yes, only that I downloaded https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-Q4_0.gguf instead of quantizing the base model myself. I just gave your way a try and it still just passes on my hardware, in many tries. |
|
The runners are also successful when using https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-Q4_0.gguf, but fail with the model that I provided. The difference is that in the model that I provided, the input embeddings and the output tensor are merged, while in the |
|
Thank you, that was it! It still doesn't happen on my RTX 3090, but now I was able to reproduce it on a laptop RTX 2060. I guess it's Turing-specific (@jeffbolznv ). |
|
This is an extremely specific and fragile issue. It really only happens on Nvidia Turing, with this specific model and quant, with this specific amount of layers (10), with flash attention off. Even if I just enable vulkan result checks (which run each operator individually and compare their results to the same operation on CPU), the issue goes away. Also if I disable operator fusion ( You can see some kind of incoherence going on with ngl 9 and 10: Even if I just go to 8 or 11 layers offloaded, it is fixed: If I disable integer dot (which takes out the changes in this PR), it runs correctly, but output still looks incoherent: @ggerganov Is it possible this is some kind of numerical model issue triggered by very small and specific changes in operator outputs? I don't see how this PR can be at fault if even using its changes more often (by increasing ngl or forcing more operators to use the MMVQ path) fixes it. |
|
I was finally able to reproduce it on AMD RX 8060S, but there it only happens with ngl 21, not 10. It also goes away if I change the sampling seed in |
|
The incoherent texts are most likely due to some numerical issues. I guess this can be expected. The problem that I am worried is that this test is needed to guarantee that restoring a previously saved state would lead to the same generation. Somehow, this is not always the case here which is very strange. The only reason that I can see is that during the store and restore of the KV cache it probably gets slightly mutated for some reason. But I don't see how this could happen since we do just byte-level copies of the buffers. |
|
I want to make sure I understand what this test is doing. Is the save/restore happening between tokens, or does it happen in the middle of generating a token? Is it using the get/set_tensor functions to do the save/restore? |
|
I think the part that is failing here is processing the prompt, saving the state to a file, generating a number of tokens, restoring the state, generating the same amount again and then checking whether the two text results are identical. |
|
I built a (WIP) tool in https://github.com/ggml-org/llama.cpp/tree/0cc4m/model-backend-compare to run CPU and device(s) side-by-side and compare the intermediate results. I didn't see anything concerning in the results of this model though: LogThe error is accumulating every deviation over time inside of the model graph, so it's higher than in test-backend-ops. But still nothing out of the ordinary. It looks very similar on the ROCm backend. Let me know if you have suggestions on how to improve the tool. I could run the CPU side step by step next to the GPU ops as well, similar to how the Vulkan result checker works, that would give more meaningful error values. |
Yes, this is correct. We store the state of the Notice that after restoring the state, most of the initial generated tokens match the generated tokens from the first run - this is the expectation. But in some rare cases, the sequence starts to diverge for some reason. |
|
I was able to reproduce this on my 5090 and am pretty sure I see what's happening. It's caused by the "add_rms_fusion" optimization which accumulates partial sums into a temporary buffer and uses those to accelerate/parallelize the rms_norm. This optimization does not happen on the first token of the second run, because we only allocate memory for the temporary buffer after having made a pass through the model on a previous token. For the first run, the memory gets allocated during the prompt processing run on the same context (maybe also during the warmup run? Didn't check that). The difference in rms_norm calculation leads to a slight precision difference. I think each run is deterministic, but they (arguably) start from a different initial state. FWIW, there's a chance we could ditch this add_rms_fusion optimization - I think a lot of its benefits have moved into other fusion optimizations, though this bug is evidence that it still happens sometimes. I'd need to do a bit of perf testing. |
|
So is it just a coincidence that it triggered on this PR or did I make it worse in some way? |
|
I don't think you made it worse, it's probably just the different precision for MMVQ caused it to manifest. |
|
Thank you for looking into it. It's great we are finally making progress on this, would be nice to get this PR merged at last. |
I checked the models I usually run. Most only hit this opt once per token, but qwen3moe and deepseek2 still hit it pretty frequently, and benefit around 1% from it. Another option might be to just reserve some memory at context creation for this, and either resize immediately when needed or never resize. |
Add k-quant mul_mat_vec support, and enable MUL_MAT_ID integer dot vector path.
Tuning this is quite difficult. I've included an attempt, but I'm not done. I'll add performance numbers later.
Q3_K and Q6_K currently don't work well at all, I'm still trying to figure out why.