CUDA: add fp kernel for larger batch size MoE #16512

am17an · 2025-10-11T09:08:51Z

This PR adds a new kernel in mmf for larger batch sizes for MoE, it leverages mmq_ids_helper and adds double-buffering for the gather of src1 cols based on ids_src_sorted

It is currently faster than the cuBLAS fallback till n_tokens=512 for when ne01 <= 1024, beyond that it is only faster for n_tokens<= 128. It would require a bigger rewrite for larger ne01 where a CTA loads multiple tiles and operates on it. cuBLAS seems to run a 128x128 tile after ne01 >= 1024 which keeps tensor cores better utilized. Other things that I tried which didn't work were - increasing rows per block and cols per block, double buffering the x tiles, and operating on multiple tiles in one kernel.

Models with ne01 <= 1024 seem to be all latest Qwen models, so this should be helpful for them provided someone wants to run them at original precision and default ubatch size.

On a A100 for qwen3:

Model	Microbatch size	Test	t/s master	t/s patch	Speedup
qwen3moe 30B.A3B BF16	32	pp4096	843.14	1113.16	1.32
qwen3moe 30B.A3B BF16	64	pp4096	974.03	1610.87	1.65
qwen3moe 30B.A3B BF16	128	pp4096	625.70	2100.57	3.36
qwen3moe 30B.A3B BF16	256	pp4096	1001.44	1875.85	1.87
qwen3moe 30B.A3B BF16	512	pp4096	1486.45	2312.54	1.56

For a smaller MoE model like lfm2moe where ne01 = 1792 (on a 3090)

Model	Microbatch size	Test	t/s master	t/s cuda_mmf_mmid_opt	Speedup
lfm2moe 8B.A1B BF16	32	pp4096	1054.41	1482.72	1.41
lfm2moe 8B.A1B BF16	64	pp4096	1209.13	2243.19	1.86
lfm2moe 8B.A1B BF16	128	pp4096	1672.92	2925.38	1.75
lfm2moe 8B.A1B BF16	256	pp4096	2651.78	2702.05	1.02
lfm2moe 8B.A1B BF16	512	pp4096	3805.54	3861.64	1.01

Others like granite-4 also benefit

Model	Microbatch size	Test	t/s master	t/s cuda_mmf_mmid_opt	Speedup
granitehybrid ?B F16	32	pp4096	1166.16	1274.41	1.09
granitehybrid ?B F16	64	pp4096	1497.29	2016.67	1.35
granitehybrid ?B F16	128	pp4096	926.84	2633.70	2.84
granitehybrid ?B F16	256	pp4096	1525.48	2532.24	1.66
granitehybrid ?B F16	512	pp4096	2160.88	3154.89	1.46

JohannesGaessler

Without having looked at the kernel in a lot of detail, I am getting the impression that you did not target specific occupancies. If you look at the CUDA documentation you'll find that the amount of SRAM per SM is not the same across generations. For high compute utilization my recommendation would be to either target a single CUDA block with 256 threads or 2 CUDA blocks with 128 threads. The reasoning behind this is that with maximum register use you can run a total of 256 threads in parallel. With a single CUDA block you can then also use larger tile sizes and get higher arithmetic intensity but at the cost of underutilization whenever you call __syncthreads.

am17an · 2025-10-11T13:15:08Z

I did try 4 warps instead of 8 when ne01 is larger, it improves the performance by about 20-30%, whereas cublas is faster by a larger amount. I only tested on 3090 and 4090, on 3090 the effect of this change is much more pronounced, so much so that it reaches parity but doesn't beat cublas. Hence to keep things simple for the first PR of this kernel I didn't branch on the architecture. However, if you think there's something worth exploring more regarding number of warps and architecture I'd be happy to do so.

JohannesGaessler

The biggest problem that this kernel has is that it does not improve upon the arithmetic intensity of the preexisting kernel: for optimal compute efficiency you want to do as many floating-point operations as possible per data loaded. For the memory pattern I originally implemented this was not a priority: each warp loads only its own data so that there is no need to synchronize them until the end. As a consequence the arithmetic intensity and maximum achievable compute utilization is low but the batch size provides a hard limit for that anyways. But for large batch sizes the memory pattern should be different. Warps should cooperate to load data into SRAM, synchronize, load the data with multiple warps loading the same data from SRAM multiple times, do the matrix multiplication, synchronize again, and then load the next data.

How do you want to proceed with this PR? Do you want to merge it with the current memory pattern or implement a different one for large batch sizes? Even without a different memory pattern compacting the expert ids is obviously beneficial.

JohannesGaessler · 2025-10-11T13:22:38Z

ggml/src/ggml-cuda/mmq.cuh

+void ggml_cuda_launch_mmq_ids_helper(
+        const int32_t * ids, int32_t * ids_src1, int32_t * ids_dst, int32_t * expert_bounds,
+        int n_experts, int n_tokens, int n_expert_used, int nchannels_y, int si1, int sis1, cudaStream_t stream);


I think this kernel should be moved to a separate file like mmid.cu.

JohannesGaessler · 2025-10-11T14:26:06Z

if you think there's something worth exploring more regarding number of warps and architecture I'd be happy to do so.

For the current memory pattern the number of warps is essentially negligible if the matrices are sufficiently large. Without having looked at NSight Compute, I suspect that the difference between 4 and 8 warps comes from tail effects where some performance is lost at the end when the GPU runs out of work. I would say not to focus on that until the kernel can be pushed to output tiles that are at least of size 64x64, then it would make sense to look into e.g. a stream-k decomposition to reduce tail effects (this is particularly relevant for datacenter GPUs with lots of SMs).

JohannesGaessler · 2025-10-11T14:30:52Z

To be clear: when I was previously commenting about occupancy that would specifically apply to a kernel with large output tiles like at least 64x64, that is when you'll have to start thinking about SRAM vs. register limits.

JohannesGaessler · 2025-10-11T14:39:31Z

Can you explain what exactly you mean by "double-buffering"? I would have thought you mean the use of asynchronous data copies, like is for example being done in fattn-mma.cuh via the interface exposed in cp-async.cuh. Notably on Ampere or newer asynchronous copies are hardware-accelerated and reduce register pressure even without double-buffering. If you want to add it my suggestion would be to keep a kernel with synchronous copies for Volta and Turing.

For synchronous copies you can also look into ggml_cuda_memcpy_1. In principle it would allow you to issue fewer instructions for the same amount of data (but I doubt that this is the bottleneck for the current kernel).

Overall, please tell me whether you intend to make further changes to the kernel in this PR prior to merging.

am17an · 2025-10-12T03:54:04Z

Can you explain what exactly you mean by "double-buffering"? I would have thought you mean the use of asynchronous data copies, like is for example being done in fattn-mma.cuh via the interface exposed in cp-async.cuh. Notably on Ampere or newer asynchronous copies are hardware-accelerated and reduce register pressure even without double-buffering. If you want to add it my suggestion would be to keep a kernel with synchronous copies for Volta and Turing.

y-tiles have two ping-pong buffers where the global load can be issued in one buffer while computing the previous buffer without stalling the warp. This is due to instruction level parallelism. It's not as efficient as cp.async but it improves the performance while keeping the code relatively simple. I will add cp.async in the future. From Nsight compute the loading of the scattered y-tiles was a bottleneck indeed (33% of the warp stalls) and this change removed it.

Overall, please tell me whether you intend to make further changes to the kernel in this PR prior to merging.

Let me try loading data into shared memory for a 64x64 tile to improve the arithmetic intensity. If that is not helpful then we can merge as is, as it's still useful for a large variety of models

am17an added 7 commits October 11, 2025 16:47

CUDA: kernel for larger batch sizes for MoE

8ee4066

WIP

996bed0

WIP

4a2388e

WIP

3120ddf

WIP

237ade5

WIP

43e0436

WIP

56e7362

am17an requested review from JohannesGaessler and slaren as code owners October 11, 2025 09:08

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 11, 2025

fixup

3183a8e

am17an force-pushed the cuda_mmf_mmid_opt branch from 089888f to 3183a8e Compare October 11, 2025 09:22

JohannesGaessler reviewed Oct 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: add fp kernel for larger batch size MoE #16512

CUDA: add fp kernel for larger batch size MoE #16512

am17an commented Oct 11, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

am17an commented Oct 11, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

JohannesGaessler Oct 11, 2025

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

am17an commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: add fp kernel for larger batch size MoE #16512

Are you sure you want to change the base?

CUDA: add fp kernel for larger batch size MoE #16512

Conversation

am17an commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

am17an commented Oct 11, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

am17an commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

am17an commented Oct 11, 2025 •

edited

Loading