Add support for CUMSUM and TRI for CUDA. #17584

pwilkin · 2025-11-28T23:15:53Z

Extracted and adapted kernels by @gabe-l-hart from #16623

am17an · 2025-11-29T00:51:36Z

For cumsum we should use https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html and use this kernel as a fallback

wsbagnsv1 · 2025-11-29T04:19:20Z

I have a small optimization for the tri kernel (;
Since its memory bandwidth bound there is not much room, but I think those should actually be real improvements and the nsight numbers show real improvements (+18% scheduler utilization). Also the improved kernel seems to have less jitter (~56% decrease, though im not 100% sure this is real, could be run variation). Also its not a big change anyways (;

Benchmark Results

1. llama.cpp benchmark (50 runs each)

Device	Dataset	Old Kernel	New Kernel	Delta
Device 0 (RTX 4070 Ti)	Large (1024)	476.54 GB/s (±17.79)	490.05 GB/s (±7.82)	+2.84%
		527.44 μs	512.26 μs	-2.88%
	Small (256)	1282.55 GB/s (±53.22)	1333.17 GB/s (±29.37)	+3.95%
		6.10 μs	5.86 μs	-3.93%
Device 1 (RTX 2070)	Large (1024)	490.77 GB/s (±0.15)	490.52 GB/s (±0.22)	-0.05%
		511.37 μs	511.64 μs	+0.05%
	Small (256)	356.65 GB/s (±4.47)	361.48 GB/s (±7.81)	+1.35%
		21.91 μs	21.63 μs	-1.28%

2. Profiler Statistics rtx 2070 (Nsight)

Metric	Old Kernel	New Kernel	Delta
Eligible Warps / Scheduler	0.390	0.460	+17.95%
Warp Cycles / Instruction	26.87	24.92	-7.24%
Physical DRAM Speed	406.65 GB/s	406.42 GB/s	-0.05%
Executed Instructions	24.6 M	26.5 M	+7.44%

@@ -1,16 +1,7 @@
 #include "tri.cuh"
 #include "ggml.h"
 
-// Triangle type comparison - determines which elements to keep
-__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
-    switch (type) {
-        case GGML_TRI_TYPE_LOWER:      return i < r;
-        case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
-        case GGML_TRI_TYPE_UPPER:      return i > r;
-        case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
-        default: return false;
-    }
-}
+
 
 template<typename T>
 static __global__ void tri_kernel(
@@ -31,10 +22,22 @@ static __global__ void tri_kernel(
     const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
     T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);
 
+    // Optimization: Avoid control flow (switch) inside the hot loop.
+    // Map the 4 triangle types to a generic "split point" and "keep direction" logic.
+    // LOWER / UPPER_DIAG: Split at 'r' (i1). LOWER_DIAG / UPPER: Split at 'r + 1'.
+    int add_to_split = 0;
+    if (ttype == GGML_TRI_TYPE_LOWER_DIAG || ttype == GGML_TRI_TYPE_UPPER) {
+        add_to_split = 1;
+    }
+    int64_t split_point = i1 + add_to_split;
+    bool prefix_keep = (ttype == GGML_TRI_TYPE_LOWER || ttype == GGML_TRI_TYPE_LOWER_DIAG);
+
     // Each thread processes elements at stride blockDim.x
     for (int64_t i0 = threadIdx.x; i0 < ne00; i0 += blockDim.x) {
-        dst_row[i0] = tri_compare(i0, i1, ttype)
-            ? src_row[i0] : static_cast<T>(0.f);
+        // If prefix_keep is true, keep (i0 < split_point). Else, keep (i0 >= split_point).
+        bool keep = ((i0 < split_point) == prefix_keep);
+        dst_row[i0] = keep ? src_row[i0] : T(0);
     }
 }

ggml/src/ggml-cuda/cumsum.cu

JohannesGaessler · 2025-11-29T09:29:58Z

ggml/src/ggml-cuda/cumsum.cu

+        // Load value and compute prefix sum within warp
+        float val = static_cast<float>(src_row[i0]);
+        val = warp_prefix_inclusive_sum(val);
+        dst_row[i0] = static_cast<T>(val);


It would be much preferable to store the temporary results in registers or shared memory rather than global memory.

Isn't val here already stored in a register though? I'm afraid I'll need some more guidance here.

dst_row is in global memory. With this code you are writing data to VRAM on this line, only to later read this data again, add a value to it, and write it back. So you have 3x as much I/O to the comparatively slow VRAM vs. the comparatively faster SRAM or registers where you could be storing it instead until you write the data once at the end of the kernel.

Now I get it, thanks!

ggml/src/ggml-cuda/tri.cu

JohannesGaessler · 2025-11-29T09:41:25Z

Regarding the implementation proposed by @wsbagnsv1 . If one were to do something like that the in my opinion correct way to do it would be to calculate start and end points for copying and for zeroing and to then simply do 2 loops over those areas. If at all possible a conditional statement inside the loop should be avoided. But that would potentially make the kernel less flexible if other patterns for ggml_tri_type are ever implemented (don't know what the intended use cases are). That is why I did not suggest this change, I very much doubt that GGML_TRI is going to have a meaningful impact on end-to-end performance unless it's very poorly implemented.

pwilkin · 2025-12-01T15:12:16Z

Okay, when adding in @JohannesGaessler's remarks about not calculating the comparison in kernel code, @wsbagnsv1's optimizations just flowed naturally, so I just combined it.

EDIT: nvm, had wrong strides

pwilkin · 2025-12-01T17:05:01Z

Okay, I implemented the double loop algorithm. I think those cases that are now templated are the only cases that will be supported, so it's probably fine this way.

pwilkin · 2025-12-01T20:42:35Z

@gabe-l-hart would be grateful if you could look at the HIP code fixes, I have completely no idea what I'm doing there (and not able to test either aside from the CI).

gabe-l-hart · 2025-12-01T23:25:58Z

would be grateful if you could look at the HIP code fixes

Unfortunately, I'm not much use here as I also don't have any background with HIP. I just tried installing it on my GB10 device, but haven't had any luck.

gabe-l-hart · 2025-12-01T23:38:39Z

ggml/src/ggml-cuda/common.cuh


+static __device__ __forceinline__ unsigned int get_warp_mask() {
+#ifdef __HIP_PLATFORM_AMD__
+    return __ballot(1); // HIP equivalent


I know basically nothing about HIP, but according to this doc, it seems like __activemask(); should be supported? The main difference referenced there is the warp size of 64 vs 32 which I could absolutely imagine being accidentally hard coded somewhere.

Specifically, I see #define WARP_SIZE 32 at the top of this file.

cc/ @IMbackK

the WARP_SIZE is deprecated and the remaining uses should only be used in places affecting performance, but not correctness, the non-deprecated equivalent is ggml_cuda_get_physical_warp_size

__activemask is indeed supported and works, but i will need to check how long - will do that later.

We will need to change the return type of this and the kernel below, @pwilkin you can do so or skip the kernel on hip and i will fix it in a follow up.

@IMbackK okay, I'll comment it out then and add a TODO, prefer to leave it so someone who knows what they're doing then leave an untested vibe-coded patch :)

am17an · 2025-12-02T04:53:01Z

@pwilkin not sure if you missed my comment, but CUB should be superior for most cases

pwilkin · 2025-12-02T08:40:54Z

@pwilkin not sure if you missed my comment, but CUB should be superior for most cases

Ah, completely forgot about that one! Yeah, will do.

pwilkin · 2025-12-02T15:12:44Z

All right, implemented CUB-compatible version per @am17an's request, removed the global memory access per @JohannesGaessler's request (I'd be lying if I said I figured all of that on my own, fortunately, it turns out the new DeepSeek 3.2 Speciale is quite good at both optimizing kernels and explaining it).

After all the optimizations expecially the biggest case improved a lot, also, the fallback implementation is performance-wise very similar to the BlockScan implementation.

am17an · 2025-12-02T15:20:03Z

What I meant was to use the out of the box function https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html for the prefix sum

pwilkin · 2025-12-02T17:46:06Z

@am17an Yeah, ended up using https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockScan.html instead since DeviceScan can't be used inside kernels and is only for single-array cumulative sums. The function (InclusiveSum) is pretty much the same.

ggml/src/ggml-cuda/cumsum.cu

pwilkin · 2025-12-03T14:14:14Z

@am17an done

pwilkin · 2025-12-03T14:14:50Z

Now I need to wait for the HIP CI job to finish so that I know what to comment out :)

pwilkin · 2025-12-03T20:56:04Z

Okay, since we're not supporting F16/BF16 in CPU anyway, I'll comment them out since there are some errors on other platforms with the bfloat16 conversions.

CISC · 2025-12-03T21:07:24Z

Okay, since we're not supporting F16/BF16 in CPU anyway, I'll comment them out since there are some errors on other platforms with the bfloat16 conversions.

Using ggml_cuda_cast from convert.cuh would have fixed your issue.

Add support for CUMSUM and TRI for CUDA.

d138a03

pwilkin requested a review from ggerganov as a code owner November 28, 2025 23:15

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025

Minor optimizations.

67207d2

pwilkin requested review from JohannesGaessler and am17an November 28, 2025 23:27

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA. auroralabs-loci/llama.cpp#355

Open

Correct warp_prefix_inclusive_sum in float2 variant to return float2

fab0029

JohannesGaessler reviewed Nov 29, 2025

View reviewed changes

pwilkin added 2 commits December 1, 2025 16:10

Optimize TRI

51c40a5

Whitespace

c30f565

pwilkin added 3 commits December 1, 2025 16:15

Fix strides.

31b55fa

Implement double loop

d1ca1c2

Whitespace

5289b53

Fix HIP compilation bugs

f422ba8

gabe-l-hart reviewed Dec 1, 2025

View reviewed changes

pwilkin added 2 commits December 2, 2025 14:29

Optimizations + big case performance tests

df917cc

Implement using CUB with fallback to custom kernel

76382d7

Remove error message.

01d4033

matteoserva mentioned this pull request Dec 2, 2025

Feature Request: Qwen3-Next support #15940

Closed

4 tasks

am17an reviewed Dec 3, 2025

View reviewed changes

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

Fixes from code review

10a2ea9

gabe-l-hart mentioned this pull request Dec 3, 2025

metal: TRI, FILL, EXPM1, SOFTPLUS #16623

Open

Comment out CPU-unsupported F16/BF16 cases to fix CI

7a83b05

Add support for CUMSUM and TRI for CUDA. #17584

Are you sure you want to change the base?

Add support for CUMSUM and TRI for CUDA. #17584

Conversation

pwilkin commented Nov 28, 2025

Uh oh!

am17an commented Nov 29, 2025

Uh oh!

wsbagnsv1 commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

1. llama.cpp benchmark (50 runs each)

2. Profiler Statistics rtx 2070 (Nsight)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

pwilkin commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Dec 1, 2025

Uh oh!

pwilkin commented Dec 1, 2025

Uh oh!

gabe-l-hart commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an commented Dec 2, 2025

Uh oh!

pwilkin commented Dec 2, 2025

Uh oh!

pwilkin commented Dec 2, 2025

Uh oh!

am17an commented Dec 2, 2025

Uh oh!

pwilkin commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pwilkin commented Dec 3, 2025

Uh oh!

pwilkin commented Dec 3, 2025

Uh oh!

pwilkin commented Dec 3, 2025

Uh oh!

CISC commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wsbagnsv1 commented Nov 29, 2025 •

edited

Loading

pwilkin commented Dec 1, 2025 •

edited

Loading