CUDA: batch out_prod inner loop with cublasSgemmStridedBatched by leonardHONG · Pull Request #22651 · ggml-org/llama.cpp

leonardHONG · 2026-05-03T15:28:08Z

Overview

Replace the per-element cublasSgemm loop in ggml_cuda_out_prod with cublasSgemmStridedBatched for the common case (dps2 == 1 && ne2 > 1), batching the inner i2 loop into a single cuBLAS call per i3 and removing the existing // TODO batched matrix multiplication comment.

The original loop is kept for ne2 == 1 (no batching benefit, and avoids the overhead of cublasSgemmStridedBatched(..., batchCount=1)) and for dps2 > 1 (src0 is reused/broadcast along dim 2 and cannot be represented as a single fixed-stride batch; the pointer-array cublasSgemmBatched variant could cover this in a follow-up).

A small ne2 sweep is added to tests/test-backend-ops.cpp to exercise both the new strided path and the gate boundary at ne2 == 1.

Additional information

The strided path narrows ne2 from int64_t to int for the cuBLAS batchCount argument, so an assert and named local make this explicit:

GGML_ASSERT(ne2 <= std::numeric_limits<int>::max());
const int batch_count = (int) ne2;

The benchmark cases use small matrices (m=256, n=16, k=16) where per-call cuBLAS overhead dominates the GPU work. The large speedups below are expected for this small-GEMM / many-batch case; for larger matrices, the speedup should be smaller as the GEMM work amortizes the call overhead.

Test environment

GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
CUDA: 12.9.86
OS: Ubuntu 22.04.5 LTS
Build: -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
CUDA arch: 120a-real

Correctness

./build/bin/test-backend-ops -o OUT_PROD -b CUDA0

12199/12199 PASS

Performance

Command:

./build/bin/test-backend-ops perf -o OUT_PROD -b CUDA0

ne2 sweep added by this PR, all with dps2 == 1:

Case	Master GB/s	Branch GB/s	Speedup
`OUT_PROD(m=256,n=16,k=16,bs=[1,1],nr=[1,1])` (`ne2=1`, fallback)	7.64	7.63	0.999x
`OUT_PROD(m=256,n=16,k=16,bs=[8,1],nr=[1,1])` (`ne2=8`, batched)	1.79	116.37	65.0x
`OUT_PROD(m=256,n=16,k=16,bs=[16,1],nr=[1,1])` (`ne2=16`, batched)	1.74	233.84	134.4x
`OUT_PROD(m=256,n=16,k=16,bs=[32,1],nr=[1,1])` (`ne2=32`, batched)	4.61	463.88	100.6x

The ne2 == 1 case is unchanged, confirming the fallback gate. Larger ne2 cases show the expected call-overhead amortization from replacing many small cublasSgemm calls with one strided-batched call.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

leonardHONG added 2 commits May 3, 2026 21:29

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched

cbde47d

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched

5a5bbe4

leonardHONG requested review from a team and ggerganov as code owners May 3, 2026 15:28

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 3, 2026

am17an approved these changes May 4, 2026

View reviewed changes

JohannesGaessler approved these changes May 4, 2026

View reviewed changes

CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends

a6284c9

leonardHONG requested a review from IMbackK as a code owner May 5, 2026 00:17

JohannesGaessler approved these changes May 7, 2026

View reviewed changes

JohannesGaessler merged commit 05ff59c into ggml-org:master May 7, 2026
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched#22651

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched#22651
JohannesGaessler merged 3 commits intoggml-org:masterfrom
leonardHONG:cuda-out-prod-batched

leonardHONG commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leonardHONG commented May 3, 2026

Overview

Additional information

Test environment

Correctness

Performance

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants