CUDA: batch out_prod inner loop with cublasSgemmStridedBatched#22651
Merged
JohannesGaessler merged 3 commits intoggml-org:masterfrom May 7, 2026
Merged
CUDA: batch out_prod inner loop with cublasSgemmStridedBatched#22651JohannesGaessler merged 3 commits intoggml-org:masterfrom
JohannesGaessler merged 3 commits intoggml-org:masterfrom
Conversation
am17an
approved these changes
May 4, 2026
JohannesGaessler
approved these changes
May 4, 2026
JohannesGaessler
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Replace the per-element
cublasSgemmloop inggml_cuda_out_prodwithcublasSgemmStridedBatchedfor the common case (dps2 == 1 && ne2 > 1), batching the inneri2loop into a single cuBLAS call peri3and removing the existing// TODO batched matrix multiplicationcomment.The original loop is kept for
ne2 == 1(no batching benefit, and avoids the overhead ofcublasSgemmStridedBatched(..., batchCount=1)) and fordps2 > 1(src0is reused/broadcast along dim 2 and cannot be represented as a single fixed-stride batch; the pointer-arraycublasSgemmBatchedvariant could cover this in a follow-up).A small
ne2sweep is added totests/test-backend-ops.cppto exercise both the new strided path and the gate boundary atne2 == 1.Additional information
The strided path narrows
ne2fromint64_ttointfor the cuBLASbatchCountargument, so an assert and named local make this explicit:The benchmark cases use small matrices (
m=256,n=16,k=16) where per-call cuBLAS overhead dominates the GPU work. The large speedups below are expected for this small-GEMM / many-batch case; for larger matrices, the speedup should be smaller as the GEMM work amortizes the call overhead.Test environment
-DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release120a-realCorrectness
Performance
Command:
ne2sweep added by this PR, all withdps2 == 1:OUT_PROD(m=256,n=16,k=16,bs=[1,1],nr=[1,1])(ne2=1, fallback)OUT_PROD(m=256,n=16,k=16,bs=[8,1],nr=[1,1])(ne2=8, batched)OUT_PROD(m=256,n=16,k=16,bs=[16,1],nr=[1,1])(ne2=16, batched)OUT_PROD(m=256,n=16,k=16,bs=[32,1],nr=[1,1])(ne2=32, batched)The
ne2 == 1case is unchanged, confirming the fallback gate. Largerne2cases show the expected call-overhead amortization from replacing many smallcublasSgemmcalls with one strided-batched call.Requirements