Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

Fixes #16976 .

The problem is that the CUDA kernel selection logic does not check strides, so it's trying to run kernels where the strides don't fit. The tests don't detect this because the strides are always constructed as 2*ne00.

@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 4, 2025
@am17an
Copy link
Collaborator

am17an commented Nov 4, 2025

Should we a test for this as well? Maybe other backends also have the same issue

@ggerganov
Copy link
Member

@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.

Yes - good point. You can either add it, or leave a TODO in llama_context constructor so I don't forget about this.

@JohannesGaessler
Copy link
Collaborator Author

I changed the logic for views in test-backend-ops to use k/2 for the view instead of creating a tensor with 2*k and then taking a view of size k of that tensor. Only a few tests use views in the first place so it should be fine.

@slaren
Copy link
Member

slaren commented Nov 4, 2025

I changed the logic for views in test-backend-ops to use k/2 for the view instead of creating a tensor with 2*k and then taking a view of size k of that tensor. Only a few tests use views in the first place so it should be fine.

To me it would be a very unintuitive if, for example, the test parameters say k=1024 and then the test is actually run with k=512.

@JohannesGaessler
Copy link
Collaborator Author

How about this: instead of a boolean, specify some integer value as the view. The default is 0, which means no view, a non-zero value is used as dimension 0 of the view.

@slaren
Copy link
Member

slaren commented Nov 4, 2025

How about this: instead of a boolean, specify some integer value as the view. The default is 0, which means no view, a non-zero value is used as dimension 0 of the view.

The other way around. Use the value of k as the dimension of the view, and add another parameter (>=k) to specify the dimension of the parent tensor.

@github-actions github-actions bot added the testing Everything test related label Nov 4, 2025
@JohannesGaessler JohannesGaessler merged commit aa37417 into ggml-org:master Nov 6, 2025
64 of 71 checks passed
if (src0_nb[i] % (2*ts) != 0) {
return false;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this disables mmf for batch_size = 1. Is that expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl test t/s
lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 pp512 7456.65 ± 45.82
lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 tg128 146.77 ± 0.08

after

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl test t/s
lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 pp512 7405.42 ± 53.04
lfm2moe 8B.A1B F16 15.54 GiB 8.34 B CUDA 99 tg128 129.49 ± 0.68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: CUDA "GGML_ASSERT(stride_row % 2 == 0) failed" when FA off for certain ctx lengths

4 participants