CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

am17an · 2025-09-10T13:33:13Z

Following #15767, I do not see a noticeable difference in performance but this change has better memory coalescing and uses all warps available for finding slots. In general, this part of the code does not contribute significantly to the runtime in any case.

While looking at the optimizing the kernel, I noticed that this kernel is overall bounded by register pressure which affects occupancy. I tried adding pragma unroll 1 to dial-back some of the unrolling but that only made performance worse

JohannesGaessler

My experience with mmq_ids_helper has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used.

ggml/src/ggml-cuda/mmf.cuh

am17an · 2025-09-11T05:28:18Z

My experience with mmq_ids_helper has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used.

Unfortunately I still don't see a speedup in my tests, I tried with granite-moe and also test-backend-ops. Also I saw unrolling 16-32 experts_used has a detrimental affect on performance (measured on RTX 3090) due to increased register pressure

JohannesGaessler · 2025-09-11T08:12:19Z

Regarding register pressure: that is always the biggest limitation for matrix multiplications. For MMF to scale properly to larger batch sizes the memory access patterns will need to be changed. Like in MMQ, it will be necessary to load the src0/src1 data into shared memory tiles, do a __syncthreads, and then do matrix-multiply-accumulate. The important difference vs. the current implementation is that the tiles would be much wider in ne01/ne11 and much shorter in ne00/ne10 and that the data loaded by one warp into shared memory would be used by other warps as well (hence the need for a __syncthreads).

What you could do with less effort is extend the kernel to run more than one CUDA block in parallel for ne11. For MoE that should still be faster than going through synchronization + cuBLAS up to some batch size.

JohannesGaessler · 2025-09-14T12:28:53Z

I did some prototyping and was also unable to get better performance than master. So for this PR I think we should remove the template parameter for the number of experts used again (to avoid needlessly increasing the compilation time), then I'll approve.

ggml/src/ggml-cuda/mmf.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

am17an requested a review from JohannesGaessler September 10, 2025 13:33

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 10, 2025

JohannesGaessler reviewed Sep 10, 2025

View reviewed changes

ggml/src/ggml-cuda/mmf.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmf.cuh Outdated Show resolved Hide resolved

am17an force-pushed the mmf_opt_cuda branch from c66fb36 to 94b189b Compare September 11, 2025 04:22

am17an added 2 commits September 12, 2025 10:46

CUDA: MUL_MAT_ID optimizations for mmf

82dd2c7

unroll n_expert_used loop + remove warp syncs

bb831b2

Remove tempalte from n_expert_used

bf08ea5

am17an force-pushed the mmf_opt_cuda branch from 94b189b to bf08ea5 Compare September 14, 2025 13:12

am17an requested a review from JohannesGaessler September 14, 2025 16:21

JohannesGaessler approved these changes Sep 14, 2025

View reviewed changes

ggml/src/ggml-cuda/mmf.cuh Outdated Show resolved Hide resolved

Update k inside the loop as it's not a candidate for unrolling

5016ca5

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

am17an merged commit 1062205 into ggml-org:master Sep 15, 2025
43 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

am17an commented Sep 10, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

am17an commented Sep 11, 2025

Uh oh!

JohannesGaessler commented Sep 11, 2025

Uh oh!

JohannesGaessler commented Sep 14, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

Conversation

am17an commented Sep 10, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

am17an commented Sep 11, 2025

Uh oh!

JohannesGaessler commented Sep 11, 2025

Uh oh!

JohannesGaessler commented Sep 14, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants