CUDA: fix FA VKQ accumulator overflow #17746

JohannesGaessler · 2025-12-03T19:56:43Z

The problem is that that particular model is suffering from numerical overflow in the FP16 VKQ accumulators. In principle the use of FP32 of BF16 accumulators could fix the issue but that would be problematic in terms of either register pressure or lack of hardware support. For this reason I think the least bad option is to apply an offset to the KQ maximum used as the scale in the softmax. By adding $\ln 8$ to the maximum both the VKQ accumulators and the KQ sums are effectively being reduced by a factor of 8. So the range of representable values is in turn being shifted upwards by a factor of 8. The downside is that larger values will be flushed to 0 (goes up from $2^{-14}$ to $2^{-11}$). However, this effect should be negligible for the model outputs. This PR changes the LLaMa 3 8b q4_0 perplexity over the Wikitext 2 test set from 6.717177 to 6.717161.

gabe-l-hart · 2025-12-03T20:01:42Z

Wow, there is a 0% chance that I would have found my way to this solution. Aside from knowing the fattn stack intimately, were there any tools/tricks you used to track down the overflow?

gabe-l-hart · 2025-12-03T20:11:32Z

Confirmed that this does solve the bug on my GB10, including the issue mentioned in #17610 (comment) where subsequent short-context requests also exhibit error behavior.

JohannesGaessler · 2025-12-03T20:16:06Z

I was not aware of similar issues with other models or with our tests in test-backend-ops. It could of course simply be that our tests aren't covering a problematic tensor shape. But I suspected numerical issues since at the time of the report the kernel in fattn-mma-f16.cuh had not been touched for a long time and it seemed unlikely to me that a more general bug would have gone undetected for this long. I checked the outputs of the kernel and saw that they contain inf, -inf, and generally large absolute values. So that made me suspect a numerical overflow and I was able to fix the issue with this patch (a factor of 2.73 was already sufficient but I decided to use a larger one to be safe).

gabe-l-hart · 2025-12-03T20:18:49Z

Thanks! We have been seeing numerical overflow issues with a number of these smaller granite models, especially the full-attention ones. I'll report back to the model team and make sure they're aware of this in case there's anything they can do on the training side to keep things more friendly to lower numerical ranges.

And now that I type this, I wonder if this patch might also fix some of those issues. I'll dig a bit.

ggerganov

A bit worried that this change might have some side effects - not convinced that the perplexity test is enough. AFAIU the larger the context, the smaller the values would be. Perplexity runs with a fixed context of ~2048 tokens.

Using F32 accumulators seems like the proper solution, but it could affect the performance. So not really sure.

JohannesGaessler · 2025-12-04T11:15:34Z

AFAIU the larger the context, the smaller the values would be.

In principle one could do the softmax in FA by calculating the exponentials of the raw KQ values. But since that has a tendency to result in numerical overflows one instead subtracts the KQ maximum (so far) from all KQ values. This does (beyond numerical differences) not affect the final result but it forces all input values for the VKQ matrix multiplication to be <= 1. But the value that is being subtracted is essentially arbitrary. With this patch an extra 2.079 are being subtracted from each KQ value so post exponentiation the input values for the VKQ matrix multiplication will be <= 0.125. So there is a higher tolerance for numerically large values in the V cache. This is a constant factor and the context size in not relevant in that regard.

The only way in which the context size could be relevant is if small, additive contributions post softmax have a large impact on the total result of VKQ. But this seems unlikely to me since the V cache seems to be relatively robust against quantization, i.e. noise. And the values post softmax typically have a few very large values and lots of extremely small values that the model is trying to push towards 0 anyways. For training the patch in this PR would be problematic since it would be zeroing out gradients but for inference I think it will be fine.

I chose the current offset as "conservative" in the sense of avoiding overflows but if you are concerned that this changes the results too much we could also go with a lower offset. An offset of 0.6931 shifts the representable range by a factor of 2 and is already enough to fix the issue for Granite 4 1b + the test prompt. It would in principle also be possible to define these values in the GGUF files and to set it at runtime.

Perplexity runs with a fixed context of ~2048 tokens.

It is actually a context size of 512 tokens. With a context size of 8192 tokens the LLaMA 3 8b perplexity changes from 5.6651 to 5.6645 with an offset of 2.079 or to 5.6643 with an offset of 0.6931.

Using F32 accumulators seems like the proper solution, but it could affect the performance. So not really sure.

Each CUDA thread can at most use 255 registers. The MMA kernel runs on 64 Q/VKQ columns in its largest configuration. For a head size of 128 this requires 64 registers simply to store the Q input and VKQ output values during the kernel. With FP32 VKQ accumulators that would increase to 96 registers. For a head size of 256 each thread currently needs 128 registers and the kernel is still viable (with limitations). So for head sizes <= 128 FP32 accumulators could probably be made to work but not for head sizes 256 and 576/512. So it would not be a universal solution.

ggerganov · 2025-12-04T11:32:34Z

And the values post softmax typically have a few very large values and lots of extremely small values that the model is trying to push towards 0 anyways. For training the patch in this PR would be problematic since it would be zeroing out gradients but for inference I think it will be fine.

Yes, I agree with your analysis. This point here is my main concern. During training, the model didn't see the values flushed to zero. I understand the argument that these values are likely negligible, but at the same time I can't say I have a good intuition of how the ensemble of many small attentions would add up and the role that they would play in the end. Hence the concern.

So for head sizes <= 128 FP32 accumulators could probably be made to work but not for head sizes 256 and 576/512. So it would not be a universal solution.

Yeah, I agree - same issue in the Metal backend. I decided to still go with the F32 accumulators regardless, at the price that the performance of some models such as Gemma (head size = 256) is affected negatively.

JohannesGaessler · 2025-12-04T12:16:49Z

For now I've reduced the offset to 0.6931 (factor of 2 post softmax). As long as people like @gabe-l-hart and the llama.cpp maintainers are aware that this is a possible issue in terms of numerics the time lost on debugging should be comparatively small and we can adjust the offset if necessary.

There are no tensor cores for (FP16, FP16) -> BF16 math but we could in principle emulate it by doing (FP16, FP16) -> FP32 math in combination with bitwise operators. Then the issue of the numerical range would be fixed but we would in return have possible issues with reduced numerical precision in the VKQ accumulators.

JohannesGaessler mentioned this pull request Dec 3, 2025

Eval bug: Granite-4.0-h-1B fails long context (>16k); same model in Transformers works, Granite-4.0-h-MicroGGUF works)` #17610

Closed

CISC approved these changes Dec 3, 2025

View reviewed changes

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2025

ggerganov approved these changes Dec 4, 2025

View reviewed changes

CUDA: fix FA VKQ accumulator overflow

59e6cba

JohannesGaessler force-pushed the cuda-fa-fix-vkq-overflow branch from 1dd1272 to 59e6cba Compare December 4, 2025 12:11

JohannesGaessler merged commit e95d0bc into ggml-org:master Dec 5, 2025
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix FA VKQ accumulator overflow #17746

CUDA: fix FA VKQ accumulator overflow #17746

JohannesGaessler commented Dec 3, 2025 •

edited

Loading

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

JohannesGaessler commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

JohannesGaessler commented Dec 4, 2025

Uh oh!

ggerganov commented Dec 4, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUDA: fix FA VKQ accumulator overflow #17746

CUDA: fix FA VKQ accumulator overflow #17746

Conversation

JohannesGaessler commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

JohannesGaessler commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Dec 4, 2025

Uh oh!

ggerganov commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JohannesGaessler commented Dec 3, 2025 •

edited

Loading

gabe-l-hart commented Dec 3, 2025 •

edited

Loading

ggerganov commented Dec 4, 2025 •

edited

Loading