Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes issue described in #16528 (comment) .

The problem as far as I can tell are numerical issues for the rescaling of the VKQ accumulators with the inverse of the KQ sum at the end of the kernel. The input values in test-backend-ops and the models I tested did not provoke this issue so I did not detect it in #16492 . The fix is to simply use FP32 arithmetic, the impact on performance is negligible since it's only done once per CUDA block.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes themselves look fine and i think its a good idea to do this, even if the the intimidate precision issue lies elsewhere, which i did not verify.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Oct 13, 2025

I definitely did observe issues with the numerical range when I debugged this, there seems to be an additional bug that specifically occurs when running >1 parallel blocks in ne01 direction and ne01 % ncols1 != 0 when the KQ maximum comes from the zeroed-out padding.

@ggerganov ggerganov merged commit 7049736 into ggml-org:master Oct 13, 2025
59 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 13, 2025
* origin/master: (32 commits)
metal : FA support F32 K and V and head size = 32 (ggml-org#16531)
graph : support cacheless embeddings with FA and iSWA (ggml-org#16528)
opencl: fix build targeting CL 2 (ggml-org#16554)
CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)
ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520)
CANN: fix CPU memory leak in CANN backend (ggml-org#16549)
fix: add remark plugin to render raw HTML as literal text (ggml-org#16505)
metal: add support for opt_step_sgd (ggml-org#16539)
ggml : fix scalar path for computing norm (ggml-org#16558)
CANN: Update several operators to support FP16 data format (ggml-org#16251)
metal : add opt_step_adamw and op_sum (ggml-org#16529)
webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506)
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521)
ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532)
common : handle unicode during partial json parsing (ggml-org#16526)
common : update presets (ggml-org#16504)
ggml : Fix FP16 ELU positive branch (ggml-org#16519)
hparams : add check for layer index in is_recurrent (ggml-org#16511)
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518)
CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)
...
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants