CUDA: fix crash on uneven context #16988

JohannesGaessler · 2025-11-04T07:54:57Z

The problem is that the CUDA kernel selection logic does not check strides, so it's trying to run kernels where the strides don't fit. The tests don't detect this because the strides are always constructed as 2*ne00.

@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.

am17an · 2025-11-04T07:56:56Z

Should we a test for this as well? Maybe other backends also have the same issue

ggerganov · 2025-11-04T08:16:51Z

@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.

Yes - good point. You can either add it, or leave a TODO in llama_context constructor so I don't forget about this.

JohannesGaessler · 2025-11-04T13:43:50Z

I changed the logic for views in test-backend-ops to use k/2 for the view instead of creating a tensor with 2*k and then taking a view of size k of that tensor. Only a few tests use views in the first place so it should be fine.

slaren · 2025-11-04T13:55:52Z

I changed the logic for views in test-backend-ops to use k/2 for the view instead of creating a tensor with 2*k and then taking a view of size k of that tensor. Only a few tests use views in the first place so it should be fine.

To me it would be a very unintuitive if, for example, the test parameters say k=1024 and then the test is actually run with k=512.

JohannesGaessler · 2025-11-04T13:58:14Z

How about this: instead of a boolean, specify some integer value as the view. The default is 0, which means no view, a non-zero value is used as dimension 0 of the view.

slaren · 2025-11-04T14:00:56Z

How about this: instead of a boolean, specify some integer value as the view. The default is 0, which means no view, a non-zero value is used as dimension 0 of the view.

The other way around. Use the value of k as the dimension of the view, and add another parameter (>=k) to specify the dimension of the parent tensor.

am17an · 2025-11-07T15:31:51Z

ggml/src/ggml-cuda/mmf.cu

+        if (src0_nb[i] % (2*ts) != 0) {
+            return false;
+        }
+    }


I think this disables mmf for batch_size = 1. Is that expected?

Before

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl test t/s

lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 pp512 7456.65 ± 45.82

lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 tg128 146.77 ± 0.08

after

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl test t/s

lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 pp512 7405.42 ± 53.04

lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 tg128 129.49 ± 0.68

JohannesGaessler requested a review from am17an November 4, 2025 07:54

JohannesGaessler requested a review from slaren as a code owner November 4, 2025 07:54

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 4, 2025

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16988: CUDA: fix crash on uneven context auroralabs-loci/llama.cpp#71

Open

JohannesGaessler force-pushed the cuda-fix-uneven-ctx branch from e5cc811 to 7c48209 Compare November 4, 2025 13:41

JohannesGaessler requested a review from ggerganov as a code owner November 4, 2025 13:41

ggerganov approved these changes Nov 4, 2025

View reviewed changes

github-actions bot added the testing Everything test related label Nov 4, 2025

CUDA: fix crash on uneven context without FA

41735c2

JohannesGaessler force-pushed the cuda-fix-uneven-ctx branch from 7c48209 to 41735c2 Compare November 6, 2025 08:23

slaren approved these changes Nov 6, 2025

View reviewed changes

JohannesGaessler merged commit aa37417 into ggml-org:master Nov 6, 2025
64 of 71 checks passed

zhang-hui-yulo mentioned this pull request Nov 7, 2025

HIP: RDNA4 tensor core support for MMF #17077

Draft

am17an reviewed Nov 7, 2025

View reviewed changes

JohannesGaessler mentioned this pull request Nov 7, 2025

CUDA: fix should_use_mmvf for ne11 == 1 #17085

Merged

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #17085: CUDA: fix should_use_mmvf for ne11 == 1 auroralabs-loci/llama.cpp#122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix crash on uneven context #16988

CUDA: fix crash on uneven context #16988

JohannesGaessler commented Nov 4, 2025

Uh oh!

am17an commented Nov 4, 2025

Uh oh!

ggerganov commented Nov 4, 2025

Uh oh!

JohannesGaessler commented Nov 4, 2025

Uh oh!

slaren commented Nov 4, 2025

Uh oh!

JohannesGaessler commented Nov 4, 2025

Uh oh!

slaren commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

am17an Nov 7, 2025

Uh oh!

am17an Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

model	size	params	backend	ngl	test	t/s
lfm2moe 8B.A1B F16	15.54 GiB	8.34 B	CUDA	99	pp512	7456.65 ± 45.82
lfm2moe 8B.A1B F16	15.54 GiB	8.34 B	CUDA	99	tg128	146.77 ± 0.08

CUDA: fix crash on uneven context #16988

CUDA: fix crash on uneven context #16988

Conversation

JohannesGaessler commented Nov 4, 2025

Uh oh!

am17an commented Nov 4, 2025

Uh oh!

ggerganov commented Nov 4, 2025

Uh oh!

JohannesGaessler commented Nov 4, 2025

Uh oh!

slaren commented Nov 4, 2025

Uh oh!

JohannesGaessler commented Nov 4, 2025

Uh oh!

slaren commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

am17an Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

slaren commented Nov 4, 2025 •

edited

Loading