Skip to content

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched#22651

Merged
JohannesGaessler merged 3 commits intoggml-org:masterfrom
leonardHONG:cuda-out-prod-batched
May 7, 2026
Merged

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched#22651
JohannesGaessler merged 3 commits intoggml-org:masterfrom
leonardHONG:cuda-out-prod-batched

Conversation

@leonardHONG
Copy link
Copy Markdown
Contributor

Overview

Replace the per-element cublasSgemm loop in ggml_cuda_out_prod with cublasSgemmStridedBatched for the common case (dps2 == 1 && ne2 > 1), batching the inner i2 loop into a single cuBLAS call per i3 and removing the existing // TODO batched matrix multiplication comment.

The original loop is kept for ne2 == 1 (no batching benefit, and avoids the overhead of cublasSgemmStridedBatched(..., batchCount=1)) and for dps2 > 1 (src0 is reused/broadcast along dim 2 and cannot be represented as a single fixed-stride batch; the pointer-array cublasSgemmBatched variant could cover this in a follow-up).

A small ne2 sweep is added to tests/test-backend-ops.cpp to exercise both the new strided path and the gate boundary at ne2 == 1.

Additional information

The strided path narrows ne2 from int64_t to int for the cuBLAS batchCount argument, so an assert and named local make this explicit:

GGML_ASSERT(ne2 <= std::numeric_limits<int>::max());
const int batch_count = (int) ne2;

The benchmark cases use small matrices (m=256, n=16, k=16) where per-call cuBLAS overhead dominates the GPU work. The large speedups below are expected for this small-GEMM / many-batch case; for larger matrices, the speedup should be smaller as the GEMM work amortizes the call overhead.

Test environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
  • CUDA: 12.9.86
  • OS: Ubuntu 22.04.5 LTS
  • Build: -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
  • CUDA arch: 120a-real

Correctness

./build/bin/test-backend-ops -o OUT_PROD -b CUDA0

12199/12199 PASS

Performance

Command:

./build/bin/test-backend-ops perf -o OUT_PROD -b CUDA0

ne2 sweep added by this PR, all with dps2 == 1:

Case Master GB/s Branch GB/s Speedup
OUT_PROD(m=256,n=16,k=16,bs=[1,1],nr=[1,1]) (ne2=1, fallback) 7.64 7.63 0.999x
OUT_PROD(m=256,n=16,k=16,bs=[8,1],nr=[1,1]) (ne2=8, batched) 1.79 116.37 65.0x
OUT_PROD(m=256,n=16,k=16,bs=[16,1],nr=[1,1]) (ne2=16, batched) 1.74 233.84 134.4x
OUT_PROD(m=256,n=16,k=16,bs=[32,1],nr=[1,1]) (ne2=32, batched) 4.61 463.88 100.6x

The ne2 == 1 case is unchanged, confirming the fallback gate. Larger ne2 cases show the expected call-overhead amortization from replacing many small cublasSgemm calls with one strided-batched call.

Requirements

@leonardHONG leonardHONG requested review from a team and ggerganov as code owners May 3, 2026 15:28
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 3, 2026
@leonardHONG leonardHONG requested a review from IMbackK as a code owner May 5, 2026 00:17
@JohannesGaessler JohannesGaessler merged commit 05ff59c into ggml-org:master May 7, 2026
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants