Skip to content

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Nov 28, 2025

Extracted and adapted kernels by @gabe-l-hart from #16623

@pwilkin pwilkin requested a review from ggerganov as a code owner November 28, 2025 23:15
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025
@am17an
Copy link
Collaborator

am17an commented Nov 29, 2025

For cumsum we should use https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html and use this kernel as a fallback

@wsbagnsv1
Copy link

wsbagnsv1 commented Nov 29, 2025

I have a small optimization for the tri kernel (;
Since its memory bandwidth bound there is not much room, but I think those should actually be real improvements and the nsight numbers show real improvements (+18% scheduler utilization). Also the improved kernel seems to have less jitter (~56% decrease, though im not 100% sure this is real, could be run variation). Also its not a big change anyways (;

Benchmark Results

1. llama.cpp benchmark (50 runs each)

Device Dataset Old Kernel New Kernel Delta
Device 0 (RTX 4070 Ti) Large (1024) 476.54 GB/s (±17.79) 490.05 GB/s (±7.82) +2.84%
527.44 μs 512.26 μs -2.88%
Small (256) 1282.55 GB/s (±53.22) 1333.17 GB/s (±29.37) +3.95%
6.10 μs 5.86 μs -3.93%
Device 1 (RTX 2070) Large (1024) 490.77 GB/s (±0.15) 490.52 GB/s (±0.22) -0.05%
511.37 μs 511.64 μs +0.05%
Small (256) 356.65 GB/s (±4.47) 361.48 GB/s (±7.81) +1.35%
21.91 μs 21.63 μs -1.28%

2. Profiler Statistics rtx 2070 (Nsight)

Metric Old Kernel New Kernel Delta
Eligible Warps / Scheduler 0.390 0.460 +17.95%
Warp Cycles / Instruction 26.87 24.92 -7.24%
Physical DRAM Speed 406.65 GB/s 406.42 GB/s -0.05%
Executed Instructions 24.6 M 26.5 M +7.44%
@@ -1,16 +1,7 @@
 #include "tri.cuh"
 #include "ggml.h"
 
-// Triangle type comparison - determines which elements to keep
-__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
-    switch (type) {
-        case GGML_TRI_TYPE_LOWER:      return i < r;
-        case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
-        case GGML_TRI_TYPE_UPPER:      return i > r;
-        case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
-        default: return false;
-    }
-}
+
 
 template<typename T>
 static __global__ void tri_kernel(
@@ -31,10 +22,22 @@ static __global__ void tri_kernel(
     const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
     T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);
 
+    // Optimization: Avoid control flow (switch) inside the hot loop.
+    // Map the 4 triangle types to a generic "split point" and "keep direction" logic.
+    // LOWER / UPPER_DIAG: Split at 'r' (i1). LOWER_DIAG / UPPER: Split at 'r + 1'.
+    int add_to_split = 0;
+    if (ttype == GGML_TRI_TYPE_LOWER_DIAG || ttype == GGML_TRI_TYPE_UPPER) {
+        add_to_split = 1;
+    }
+    int64_t split_point = i1 + add_to_split;
+    bool prefix_keep = (ttype == GGML_TRI_TYPE_LOWER || ttype == GGML_TRI_TYPE_LOWER_DIAG);
+
     // Each thread processes elements at stride blockDim.x
     for (int64_t i0 = threadIdx.x; i0 < ne00; i0 += blockDim.x) {
-        dst_row[i0] = tri_compare(i0, i1, ttype)
-            ? src_row[i0] : static_cast<T>(0.f);
+        // If prefix_keep is true, keep (i0 < split_point). Else, keep (i0 >= split_point).
+        bool keep = ((i0 < split_point) == prefix_keep);
+        dst_row[i0] = keep ? src_row[i0] : T(0);
     }
 }

// Load value and compute prefix sum within warp
float val = static_cast<float>(src_row[i0]);
val = warp_prefix_inclusive_sum(val);
dst_row[i0] = static_cast<T>(val);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much preferable to store the temporary results in registers or shared memory rather than global memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't val here already stored in a register though? I'm afraid I'll need some more guidance here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst_row is in global memory. With this code you are writing data to VRAM on this line, only to later read this data again, add a value to it, and write it back. So you have 3x as much I/O to the comparatively slow VRAM vs. the comparatively faster SRAM or registers where you could be storing it instead until you write the data once at the end of the kernel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I get it, thanks!

@JohannesGaessler
Copy link
Collaborator

Regarding the implementation proposed by @wsbagnsv1 . If one were to do something like that the in my opinion correct way to do it would be to calculate start and end points for copying and for zeroing and to then simply do 2 loops over those areas. If at all possible a conditional statement inside the loop should be avoided. But that would potentially make the kernel less flexible if other patterns for ggml_tri_type are ever implemented (don't know what the intended use cases are). That is why I did not suggest this change, I very much doubt that GGML_TRI is going to have a meaningful impact on end-to-end performance unless it's very poorly implemented.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 1, 2025

Okay, when adding in @JohannesGaessler's remarks about not calculating the comparison in kernel code, @wsbagnsv1's optimizations just flowed naturally, so I just combined it.

EDIT: nvm, had wrong strides

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 1, 2025

Okay, I implemented the double loop algorithm. I think those cases that are now templated are the only cases that will be supported, so it's probably fine this way.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 1, 2025

@gabe-l-hart would be grateful if you could look at the HIP code fixes, I have completely no idea what I'm doing there (and not able to test either aside from the CI).

@gabe-l-hart
Copy link
Collaborator

would be grateful if you could look at the HIP code fixes

Unfortunately, I'm not much use here as I also don't have any background with HIP. I just tried installing it on my GB10 device, but haven't had any luck.


static __device__ __forceinline__ unsigned int get_warp_mask() {
#ifdef __HIP_PLATFORM_AMD__
return __ballot(1); // HIP equivalent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know basically nothing about HIP, but according to this doc, it seems like __activemask(); should be supported? The main difference referenced there is the warp size of 64 vs 32 which I could absolutely imagine being accidentally hard coded somewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, I see #define WARP_SIZE 32 at the top of this file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc/ @IMbackK

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the WARP_SIZE is deprecated and the remaining uses should only be used in places affecting performance, but not correctness, the non-deprecated equivalent is ggml_cuda_get_physical_warp_size

__activemask is indeed supported and works, but i will need to check how long - will do that later.

We will need to change the return type of this and the kernel below, @pwilkin you can do so or skip the kernel on hip and i will fix it in a follow up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IMbackK okay, I'll comment it out then and add a TODO, prefer to leave it so someone who knows what they're doing then leave an untested vibe-coded patch :)

@am17an
Copy link
Collaborator

am17an commented Dec 2, 2025

@pwilkin not sure if you missed my comment, but CUB should be superior for most cases

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 2, 2025

@pwilkin not sure if you missed my comment, but CUB should be superior for most cases

Ah, completely forgot about that one! Yeah, will do.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 2, 2025

All right, implemented CUB-compatible version per @am17an's request, removed the global memory access per @JohannesGaessler's request (I'd be lying if I said I figured all of that on my own, fortunately, it turns out the new DeepSeek 3.2 Speciale is quite good at both optimizing kernels and explaining it).

After all the optimizations expecially the biggest case improved a lot, also, the fallback implementation is performance-wise very similar to the BlockScan implementation.

@am17an
Copy link
Collaborator

am17an commented Dec 2, 2025

What I meant was to use the out of the box function https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html for the prefix sum

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 2, 2025

@am17an Yeah, ended up using https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockScan.html instead since DeviceScan can't be used inside kernels and is only for single-array cumulative sums. The function (InclusiveSum) is pretty much the same.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 3, 2025

@am17an done

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 3, 2025

Now I need to wait for the HIP CI job to finish so that I know what to comment out :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 3, 2025

Okay, since we're not supporting F16/BF16 in CPU anyway, I'll comment them out since there are some errors on other platforms with the bfloat16 conversions.

@CISC
Copy link
Collaborator

CISC commented Dec 3, 2025

Okay, since we're not supporting F16/BF16 in CPU anyway, I'll comment them out since there are some errors on other platforms with the bfloat16 conversions.

Using ggml_cuda_cast from convert.cuh would have fixed your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants