Skip to content

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Oct 11, 2025

This PR adds a new kernel in mmf for larger batch sizes for MoE, it leverages mmq_ids_helper and adds double-buffering for the gather of src1 cols based on ids_src_sorted

It is currently faster than the cuBLAS fallback till n_tokens=512 for when ne01 <= 1024, beyond that it is only faster for n_tokens<= 128. It would require a bigger rewrite for larger ne01 where a CTA loads multiple tiles and operates on it. cuBLAS seems to run a 128x128 tile after ne01 >= 1024 which keeps tensor cores better utilized. Other things that I tried which didn't work were - increasing rows per block and cols per block, double buffering the x tiles, and operating on multiple tiles in one kernel.

Models with ne01 <= 1024 seem to be all latest Qwen models, so this should be helpful for them provided someone wants to run them at original precision and default ubatch size.

On a A100 for qwen3:

Model Microbatch size Test t/s master t/s patch Speedup
qwen3moe 30B.A3B BF16 32 pp4096 843.14 1113.16 1.32
qwen3moe 30B.A3B BF16 64 pp4096 974.03 1610.87 1.65
qwen3moe 30B.A3B BF16 128 pp4096 625.70 2100.57 3.36
qwen3moe 30B.A3B BF16 256 pp4096 1001.44 1875.85 1.87
qwen3moe 30B.A3B BF16 512 pp4096 1486.45 2312.54 1.56

For a smaller MoE model like lfm2moe where ne01 = 1792 (on a 3090)

Model Microbatch size Test t/s master t/s cuda_mmf_mmid_opt Speedup
lfm2moe 8B.A1B BF16 32 pp4096 1054.41 1482.72 1.41
lfm2moe 8B.A1B BF16 64 pp4096 1209.13 2243.19 1.86
lfm2moe 8B.A1B BF16 128 pp4096 1672.92 2925.38 1.75
lfm2moe 8B.A1B BF16 256 pp4096 2651.78 2702.05 1.02
lfm2moe 8B.A1B BF16 512 pp4096 3805.54 3861.64 1.01

Others like granite-4 also benefit

Model Microbatch size Test t/s master t/s cuda_mmf_mmid_opt Speedup
granitehybrid ?B F16 32 pp4096 1166.16 1274.41 1.09
granitehybrid ?B F16 64 pp4096 1497.29 2016.67 1.35
granitehybrid ?B F16 128 pp4096 926.84 2633.70 2.84
granitehybrid ?B F16 256 pp4096 1525.48 2532.24 1.66
granitehybrid ?B F16 512 pp4096 2160.88 3154.89 1.46

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 11, 2025
@am17an am17an force-pushed the cuda_mmf_mmid_opt branch from 089888f to 3183a8e Compare October 11, 2025 09:22
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having looked at the kernel in a lot of detail, I am getting the impression that you did not target specific occupancies. If you look at the CUDA documentation you'll find that the amount of SRAM per SM is not the same across generations. For high compute utilization my recommendation would be to either target a single CUDA block with 256 threads or 2 CUDA blocks with 128 threads. The reasoning behind this is that with maximum register use you can run a total of 256 threads in parallel. With a single CUDA block you can then also use larger tile sizes and get higher arithmetic intensity but at the cost of underutilization whenever you call __syncthreads.

@am17an
Copy link
Collaborator Author

am17an commented Oct 11, 2025

I did try 4 warps instead of 8 when ne01 is larger, it improves the performance by about 20-30%, whereas cublas is faster by a larger amount. I only tested on 3090 and 4090, on 3090 the effect of this change is much more pronounced, so much so that it reaches parity but doesn't beat cublas. Hence to keep things simple for the first PR of this kernel I didn't branch on the architecture. However, if you think there's something worth exploring more regarding number of warps and architecture I'd be happy to do so.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest problem that this kernel has is that it does not improve upon the arithmetic intensity of the preexisting kernel: for optimal compute efficiency you want to do as many floating-point operations as possible per data loaded. For the memory pattern I originally implemented this was not a priority: each warp loads only its own data so that there is no need to synchronize them until the end. As a consequence the arithmetic intensity and maximum achievable compute utilization is low but the batch size provides a hard limit for that anyways. But for large batch sizes the memory pattern should be different. Warps should cooperate to load data into SRAM, synchronize, load the data with multiple warps loading the same data from SRAM multiple times, do the matrix multiplication, synchronize again, and then load the next data.

How do you want to proceed with this PR? Do you want to merge it with the current memory pattern or implement a different one for large batch sizes? Even without a different memory pattern compacting the expert ids is obviously beneficial.

Comment on lines +3760 to +3762
void ggml_cuda_launch_mmq_ids_helper(
const int32_t * ids, int32_t * ids_src1, int32_t * ids_dst, int32_t * expert_bounds,
int n_experts, int n_tokens, int n_expert_used, int nchannels_y, int si1, int sis1, cudaStream_t stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kernel should be moved to a separate file like mmid.cu.

@JohannesGaessler
Copy link
Collaborator

if you think there's something worth exploring more regarding number of warps and architecture I'd be happy to do so.

For the current memory pattern the number of warps is essentially negligible if the matrices are sufficiently large. Without having looked at NSight Compute, I suspect that the difference between 4 and 8 warps comes from tail effects where some performance is lost at the end when the GPU runs out of work. I would say not to focus on that until the kernel can be pushed to output tiles that are at least of size 64x64, then it would make sense to look into e.g. a stream-k decomposition to reduce tail effects (this is particularly relevant for datacenter GPUs with lots of SMs).

@JohannesGaessler
Copy link
Collaborator

To be clear: when I was previously commenting about occupancy that would specifically apply to a kernel with large output tiles like at least 64x64, that is when you'll have to start thinking about SRAM vs. register limits.

@JohannesGaessler
Copy link
Collaborator

Can you explain what exactly you mean by "double-buffering"? I would have thought you mean the use of asynchronous data copies, like is for example being done in fattn-mma.cuh via the interface exposed in cp-async.cuh. Notably on Ampere or newer asynchronous copies are hardware-accelerated and reduce register pressure even without double-buffering. If you want to add it my suggestion would be to keep a kernel with synchronous copies for Volta and Turing.

For synchronous copies you can also look into ggml_cuda_memcpy_1. In principle it would allow you to issue fewer instructions for the same amount of data (but I doubt that this is the bottleneck for the current kernel).

Overall, please tell me whether you intend to make further changes to the kernel in this PR prior to merging.

@am17an
Copy link
Collaborator Author

am17an commented Oct 12, 2025

Can you explain what exactly you mean by "double-buffering"? I would have thought you mean the use of asynchronous data copies, like is for example being done in fattn-mma.cuh via the interface exposed in cp-async.cuh. Notably on Ampere or newer asynchronous copies are hardware-accelerated and reduce register pressure even without double-buffering. If you want to add it my suggestion would be to keep a kernel with synchronous copies for Volta and Turing.

y-tiles have two ping-pong buffers where the global load can be issued in one buffer while computing the previous buffer without stalling the warp. This is due to instruction level parallelism. It's not as efficient as cp.async but it improves the performance while keeping the code relatively simple. I will add cp.async in the future. From Nsight compute the loading of the scattered y-tiles was a bottleneck indeed (33% of the warp stalls) and this change removed it.

Overall, please tell me whether you intend to make further changes to the kernel in this PR prior to merging.

Let me try loading data into shared memory for a 64x64 tile to improve the arithmetic intensity. If that is not helpful then we can merge as is, as it's still useful for a large variety of models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants