SOLVE_TRI CUDA kernel for small matrices #17457

pwilkin · 2025-11-23T22:55:33Z

I've managed to actually poke enough LLMs in the correct direction to end up with this:

CPU:

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                    32760 runs -    37.34 us/run -      152 kB/run -    3.88 GB/s

CUDA:

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                    49140 runs -    22.52 us/run -      152 kB/run -    6.44 GB/s

This can most certainly be improved by someone who knows what they're doing, but at least it does the bare minimum by supplying a CUDA kernel that's around twice as fast as the optimized CPU implementation.

wsbagnsv1 · 2025-11-24T04:12:11Z

Hey, I've created a small framework for OpenEvolve for this kernel and ran it for about 40 iterations (I plan to do around 600) and already got around 8% improvement on the kernel below. I'm pretty sure I've covered all test cases, but you should check it for correctness if I missed something in my test cases. Anyway, here is the performance improvement on my old RTX 2070 for the kernel:

Oh and this was reproducible with multiple runs over time and showed a consistent improvement (;

This is the kernel:

#include <cuda_fp16.h>

#define MAX_N_FAST 64
#define MAX_K_FAST 32
#define WARP_SIZE 32

// Warp reduction helper with full mask for safety
static __inline__ __device__ float warpReduceSum(float val) {
    // Use full mask for all participating threads
    unsigned mask = __activemask();
    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(mask, val, offset);
    }
    return val;
}

// Optimized kernel focusing on coalesced access and warp-level parallelism
extern "C" __global__ void solve_tri_f32_fast(
    const float* __restrict__ A,
    const float* __restrict__ B,
    float* __restrict__ X,
    int n, int k,
    int64_t ne02, int64_t ne03,
    size_t nb02, size_t nb03,
    size_t nb12, size_t nb13,
    size_t nb2, size_t nb3)
{
    const int batch_idx = blockIdx.x;
    const int lane      = threadIdx.x;
    const int col_idx   = threadIdx.y;
    const int tid       = threadIdx.x + threadIdx.y * blockDim.x;

    // Early exit for excess warps
    if (col_idx >= k) {
        return;
    }

    // Calculate batch indices
    const int64_t i03 = batch_idx / ne02;
    const int64_t i02 = batch_idx % ne02;

    // Get pointers for this batch
    const float* const A_batch = (const float*)((const char *)A + i02 * nb02 + i03 * nb03);
    const float* const B_batch = (const float*)((const char *)B + i02 * nb12 + i03 * nb13);
    float*             X_batch = (float*)      ((char *)X + i02 * nb2  + i03 * nb3);

    // Shared memory for A and B matrices
    __shared__ float sA[MAX_N_FAST * MAX_N_FAST];
    __shared__ float sX[MAX_N_FAST * MAX_K_FAST];

    // Coalesced loading of A matrix
    // Each thread loads multiple elements to improve bandwidth utilization
    const int total_elements_A = n * n;
    const int stride_A = blockDim.x * blockDim.y;
    for (int i = tid; i < total_elements_A; i += stride_A) {
        sA[i] = A_batch[i];
    }

    // Coalesced loading of B matrix  
    const int total_elements_B = n * k;
    const int stride_B = blockDim.x * blockDim.y;
    for (int i = tid; i < total_elements_B; i += stride_B) {
        sX[i] = B_batch[i];
    }
    __syncthreads();

    // Forward substitution with warp-level parallelism
    // Each warp processes one column of the solution
    for (int row = 0; row < n; ++row) {
        float sum = 0.0f;

        // Use register accumulation for better ILP
        float sum_part = 0.0f;
        
        // Parallel computation of dot product
        // Each thread in the warp processes a subset of elements
        for (int j = lane; j < row; j += WARP_SIZE) {
            // Load row of A and column of X from shared memory
            sum_part += sA[row * n + j] * sX[j * k + col_idx];
        }

        // Warp-level reduction to get the final sum
        sum = warpReduceSum(sum_part);

        // Lane 0 performs the final computation and stores result
        if (lane == 0) {
            const float b_val = sX[row * k + col_idx];
            const float a_diag = sA[row * n + row];
            
            // The safe exact check:
            if (a_diag != 0.0f) {
                sX[row * k + col_idx] = (b_val - sum) / a_diag;
            } else {
                sX[row * k + col_idx] = 0.0f; // Only catch true division by zero
            }
        }
        
        // Synchronize to ensure all threads see the updated value
        __syncthreads();
    }

    // Coalesced write back to global memory
    for (int i = tid; i < total_elements_B; i += stride_B) {
        X_batch[i] = sX[i];
    }
}

For those curious, OpenEvolve implements the AlphaEvolve approach from Google (or at least it started as that): using LLMs to iteratively evolve and optimize algorithms.

Also ill upload the framework tomorrow to my github for anyone interested (;

ggml/src/ggml-cuda/solve_tri.cu

pwilkin · 2025-11-24T12:27:06Z

@am17an something like this?

am17an · 2025-11-24T12:55:09Z

@pwilkin not quite, something like this https://github.com/pwilkin/llama.cpp/compare/solve_tri_cuda...am17an:llama.cpp:solve_tri_cuda_opt?expand=1

With this I get

SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                    57330 runs -    19.13 us/run -      152 kB/run -    7.58 GB/s

am17an · 2025-11-24T12:56:46Z

One other thing, you don't need an entirely separate function for the general case, you can pass 0 as the template parameter and do an if constexpr on the unrolled parts

optimize

am17an · 2025-11-24T13:00:52Z

ggml/src/ggml-cuda/solve_tri.cu

+    __shared__ float sX[MAX_N_FAST * MAX_K_FAST];
+
+    // Load A into shared memory (coalesced)
+    #pragma unroll


These cannot be unrolled so you can remove

Suggested change

#pragma unroll

I think this whole function can go away, it should be something like

if constexpr(n == 0) { //take this path } else { #pragma unroll //the fast loop }

ggml/src/ggml-cuda/solve_tri.cu

am17an · 2025-11-24T13:27:43Z

ggml/src/ggml-cuda/solve_tri.cu

+    __shared__ float sX[MAX_N_FAST * MAX_K_FAST];
+
+    // Load A into shared memory (coalesced)
+    #pragma unroll


I think this whole function can go away, it should be something like

if constexpr(n == 0) { //take this path } else { #pragma unroll //the fast loop }

ggml/src/ggml-cuda/solve_tri.cu

pwilkin · 2025-11-24T13:36:27Z

@am17an not very experienced with this, but I believe this is what you had in mind?

am17an

Yeah looks good. You can run it through clang-format once. Also let's get @JohannesGaessler to take a look as well, he usually has the best ideas re performance

theo77186 · 2025-11-24T16:40:13Z

For some reason, when testing with test-backend-ops, this case fails: SOLVE_TRI(type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]), with wildly different NMSE (can be from 10^-4 to very large values). It may even occasionally pass. It only fails on my 4060Ti (sm89) but not on my 3060 (sm86). I don't see any reason the kernel would behave differently, though.

logs

[SOLVE_TRI] NMSE = 16.742420406 > 0.000000100   SOLVE_TRI(type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]): FAIL

edit: would appreciate if anyone with a sm89 GPU could reproduce this

pwilkin · 2025-11-24T17:04:48Z

For some reason, when testing with test-backend-ops, this case fails: SOLVE_TRI(type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]), with wildly different NMSE (can be from 10^-4 to very large values). It may even occasionally pass. It only fails on my 4060Ti (sm89) but not on my 3060 (sm86). I don't see any reason the kernel would behave differently, though.

logs
[SOLVE_TRI] NMSE = 16.742420406 > 0.000000100   SOLVE_TRI(type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]): FAIL
edit: would appreciate if anyone with a sm89 GPU could reproduce this

Can confirm. My 3080 (which is my CUDA0) works correctly, but my 5060 fails with the pattern you described. @am17an any ideas? Looks like some race condition, but why only on 89+ arch cards?

pwilkin · 2025-11-24T17:16:11Z

Never mind; needed to add a guard, apparently for 30x0 it didn't mind.

ggml/src/ggml-cuda/solve_tri.cu

JohannesGaessler · 2025-11-24T19:23:21Z

ggml/src/ggml-cuda/solve_tri.cu

+    const float * const A_batch = (const float *) ((const char *) A + i02 * nb02 + i03 * nb03);
+    const float * const B_batch = (const float *) ((const char *) B + i02 * nb12 + i03 * nb13);
+    float *             X_batch = (float *) ((char *) X + i02 * nb2 + i03 * nb3);


Generally speaking it is preferable to pass the strides in units of float instead of char.

JohannesGaessler · 2025-11-24T19:27:29Z

ggml/src/ggml-cuda/solve_tri.cuh

@@ -0,0 +1,5 @@
+#include "common.cuh"
+
+#define CUDA_SOLVE_TRI_BLOCK_SIZE 256


The kernel does not respect this define. More generally, you are launching a kernel with up to 32*32=1024 threads which is in principle still possible but becomes problematic in terms of register pressure. My recommendation would be to launch at most 256 threads, to specify this upper limit via __launch_bounds__, and to handle the cases which currently use > 8 warps with a loop.

ggml/src/ggml-cuda/solve_tri.cu

JohannesGaessler · 2025-11-24T19:41:11Z

Never mind; needed to add a guard, apparently for 30x0 it didn't mind.

I think the reason you needed to add a guard is because the number of warps can be != a power of 2.

jeffbolznv · 2025-11-25T06:06:44Z

Have a look at #17486, on my system the perf is >2x better than this cuda kernel. There's probably still some room for improvement when K is small.

pwilkin · 2025-11-25T12:33:14Z

Have a look at #17486, on my system the perf is >2x better than this cuda kernel. There's probably still some room for improvement when K is small.

And here I thought I could get away with it easily ;) oh well, guess it's time to learn how to use NSight.

pwilkin · 2025-11-26T22:02:10Z

I managed to get a further ~30% improvement by removing the division-by-zero guard (it doesn't make much sense there anyway). Now I'm in this bizarre state where my 3080 is twice as slow as my 5060 and I don't know why:

  Device description: NVIDIA GeForce RTX 3080
  Device memory: 9871 MB (11009 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                    65520 runs -    15.42 us/run -      152 kB/run -    9.40 GB/s
  
    Device description: NVIDIA GeForce RTX 5060 Ti
  Device memory: 15848 MB (11091 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   131040 runs -     7.87 us/run -      152 kB/run -   18.41 GB/s

Any idea what could be the reason?

pwilkin · 2025-11-26T22:21:08Z

Got another 20% with moving second sync outside of loop, but difference between 3080 and 5060 persists.

wsbagnsv1 · 2025-11-27T03:46:48Z

Ive found some more optimizations for your kernel which reliably improved performance by a good amount on both my gpus

Current pr:

Device description: NVIDIA GeForce RTX 4070 Ti
Device memory: 12281 MB (11036 MB free)

SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):           90090 runs -   11.26 us/run -   152 kB/run -   12.87 GB/s

Device description: NVIDIA GeForce RTX 2070
Device memory: 8191 MB (7144 MB free)

SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):           57330 runs -   18.01 us/run -   152 kB/run -    8.05 GB/s

My improved kernel:

Device description: NVIDIA GeForce RTX 4070 Ti
Device memory: 12281 MB (11036 MB free)

SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):          147420 runs -    6.93 us/run -   152 kB/run -   20.91 GB/s

Device description: NVIDIA GeForce RTX 2070
Device memory: 8191 MB (7144 MB free)

SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):           81900 runs -   13.28 us/run -   152 kB/run -   10.91 GB/s

Device	Metric	Baseline	Improved	Gain
RTX 4070 Ti	Bandwidth	12.87 GB/s	20.91 GB/s	+62.5%
	Latency	11.26 us	6.93 us	-38.5%
RTX 2070	Bandwidth	8.05 GB/s	10.91 GB/s	+35.5%
	Latency	18.01 us	13.28 us	-26.3%

Ive ran more than 25 runs and the gain is consistent.

Here are the proposed changes all in solve_tri.cu:

@@ -47,7 +47,8 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
     float *             X_batch = (float *) ((char *) X + i02 * nb2 + i03 * nb3);
 
     __shared__ float sA[MAX_N_FAST * MAX_N_FAST];
-    __shared__ float sX[MAX_N_FAST * (MAX_K_FAST + 1)];
+    __shared__ float sX[MAX_N_FAST * MAX_K_FAST];
+    __shared__ float sXt[MAX_K_FAST * MAX_N_FAST];
 
     const int offset = threadIdx.x + threadIdx.y * blockDim.x;
 
@@ -60,7 +61,7 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
     }
 
 #pragma unroll
-    for (int i = 0; i < n * (k + 1); i += k * WARP_SIZE) {
+    for (int i = 0; i < n * k; i += k * WARP_SIZE) {
         int i0 = i + offset;
         if (i0 < n * k) {
             sX[i0] = B_batch[i0];
@@ -69,22 +70,52 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
 
     __syncthreads();
 
+    // Transpose sX (row-major) to sXt (column-major per column)
+#pragma unroll 2
+    for (int rr = 0; rr < 2; ++rr) {
+        int row = rr * WARP_SIZE + lane;
+        if (row < n) {
+            sXt[col_idx * n + row] = sX[row * k + col_idx];
+        }
+    }
+
+    __syncthreads();
+
+    // Forward substitution
 #pragma unroll
     for (int row = 0; row < n; ++row) {
         float sum = 0.0f;
 
-#pragma unroll
-        for (int j = lane; j < row; j += WARP_SIZE) {
-            sum += sA[row * n + j] * sX[j * k + col_idx];
+        // First warp
+        {
+            int j = lane;
+            if (j < row) {
+                sum += sA[row * n + j] * sXt[col_idx * n + j];
+            }
+        }
+        // Second warp
+        if (row >= WARP_SIZE) {
+            int j = WARP_SIZE + lane;
+            if (j < row) {
+                sum += sA[row * n + j] * sXt[col_idx * n + j];
+            }
         }
 
         sum = warp_reduce_sum(sum);
 
         if (lane == 0) {
-            const float b_val  = sX[row * k + col_idx];  // Value from B
-            const float a_diag = sA[row * n + row];
-            // no safeguards for division by zero because that indicates corrupt data anyway
-            sX[row * k + col_idx] = (b_val - sum) / a_diag;
+            const float diag = sA[row * n + row];
+            const float b_val = sXt[col_idx * n + row];
+            sXt[col_idx * n + row] = (b_val - sum) / diag;
+        }
+    }
+
+    // Transpose back sXt to sX (row-major)
+#pragma unroll 2
+    for (int rr = 0; rr < 2; ++rr) {
+        int row = rr * WARP_SIZE + lane;
+        if (row < n) {
+            sX[row * k + col_idx] = sXt[col_idx * n + row];
         }
     }

As for how I was able to improve the kernel, I created a new framework for open evolve that basically uses nsight to improve the kernel via evolution and uses llama.cpp itself as the benchmark and i got this kernel after only 6 iterations.

I hope this helps!
(im gonna let it run again, maybe well gonna see even more improvements (; )

pwilkin · 2025-11-27T06:45:17Z

@wsbagnsv1 nice one, I was literally going to do the transpose trick as the next optimization :)

pwilkin · 2025-11-27T09:55:15Z

@wsbagnsv1 btw, care to upload your openevolve evaluator somewhere?

pwilkin · 2025-11-27T10:16:34Z

@wsbagnsv1 BTW it's always good to look at the generated kernels as well :)

Your kernel has completely unneeded writes to an extra array - you only write to sXt and then back to X_batch directly (transpose on the fly), you don't need sX under that approach at all.

pwilkin · 2025-11-27T10:17:08Z

@JohannesGaessler @am17an I think this is as far as I can push this, I believe it's already pretty well optimized.

am17an · 2025-11-27T10:20:41Z

Nice! I think you need another clang-format and preferably pass the stride in units of float rather than bytes

pwilkin · 2025-11-27T11:46:19Z

Cleaned it up. Final results:

  Device description: NVIDIA GeForce RTX 3080
  Device memory: 9871 MB (15969 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   106470 runs -     9.51 us/run -      152 kB/run -   15.24 GB/s
  
  Device description: NVIDIA GeForce RTX 5060 Ti
  Device memory: 15848 MB (15958 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   147420 runs -     7.11 us/run -      152 kB/run -   20.38 GB/s

am17an

Just a new nitpicks, can merge post that

ggml/src/ggml-cuda/solve_tri.cu

wsbagnsv · 2025-11-27T15:01:32Z

@wsbagnsv1 btw, care to upload your openevolve evaluator somewhere?

Yeah have to clean it up a bit and I'll put it on my github, had to hardcode the paths for the compilation etc 😅
It's not that hard to run, basically it compiles Llama.cpp each time with the new kernel and runs perf as well as nsight on it (; aim to do that either today or tomorrow (;

wsbagnsv · 2025-11-27T15:21:23Z

@wsbagnsv1 BTW it's always good to look at the generated kernels as well :)

Your kernel has completely unneeded writes to an extra array - you only write to sXt and then back to X_batch directly (transpose on the fly), you don't need sX under that approach at all.

Probably because it was a rather early iteration, ill check the later ones once I'm back on my pc but yeah you're right didn't have much time left before sleep and saw that massive improvement and just copy pasted it basically so its at least online 😅

jeffbolznv · 2025-11-27T17:16:36Z

Perf on my system:

Backend 1/3: CUDA0
  Device description: NVIDIA GeForce RTX 5090
  Device memory: 32606 MB (30991 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   139230 runs -     7.20 us/run -      152 kB/run -   20.15 GB/s
  SOLVE_TRI(type=f32,ne_lhs=[128,128,4,1],ne_rhs=[8,128,4,1]): not supported
  Backend CUDA0: OK
Backend 2/3: CUDA1
  Device description: NVIDIA GeForce RTX 4070
  Device memory: 12281 MB (11106 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   147420 runs -     6.91 us/run -      152 kB/run -   20.99 GB/s
  SOLVE_TRI(type=f32,ne_lhs=[128,128,4,1],ne_rhs=[8,128,4,1]): not supported

theo77186 · 2025-11-27T17:32:57Z

my own results

4060Ti and 3060 performance

ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 23731724
Backend 1/4: CUDA0
  Device description: NVIDIA GeForce RTX 4060 Ti
  Device memory: 15982 MB (23175 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   155610 runs -     6.67 us/run -      152 kB/run -   21.73 GB/s
  SOLVE_TRI(type=f32,ne_lhs=[128,128,4,1],ne_rhs=[8,128,4,1]): not supported
  Backend CUDA0: OK
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 23597052
Backend 2/4: CUDA1
  Device description: NVIDIA GeForce RTX 3060
  Device memory: 11947 MB (23043 MB free)

  SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]):                   122850 runs -     8.69 us/run -      152 kB/run -   16.69 GB/s
  SOLVE_TRI(type=f32,ne_lhs=[128,128,4,1],ne_rhs=[8,128,4,1]): not supported
  Backend CUDA1: OK

Curiously, it seems there isn't much scaling within a given GPU family...

pwilkin · 2025-11-27T17:38:58Z

@theo77186 it's a small kernel, doesn't utilize all the cores, thus no scaling. Probably would scale more on a tiled version.

wsbagnsv1 · 2025-11-27T18:25:20Z

@theo77186 it's a small kernel, doesn't utilize all the cores, thus no scaling. Probably would scale more on a tiled version.

Ive just created my fork of openevolve and uploaded my changes with the solve_tri kernel, maybe it can be used to find a proper optimized tiled version of this kernel too if we even need it, though wed have to change the parser in the evaluator and the config.yaml. But thats for another pr i guess 😉
https://github.com/wsbagnsv1/openevolve-cuda-trisolve

pwilkin · 2025-11-27T22:03:47Z

@am17an Aight, CI tests are fine, think we can merge.

JohannesGaessler · 2025-11-27T15:20:24Z

ggml/src/ggml-cuda/solve_tri.cu

+    float *             X_batch = (float *) (X + i02 * nb2 / sizeof(float) + i03 * nb3 / sizeof(float));
+
+    __shared__ float sA[MAX_N_FAST * MAX_N_FAST];
+    __shared__ float sXt[MAX_N_FAST * (MAX_K_FAST + 1)];


Simply changing the size of a 1D allocation like this does nothing to fix shared memory bank conflicts. You have to actually access elements with the padded stride. One way to do this automatically is to change the array shape to be 2D and to pad the last dimension.

@JohannesGaessler You're right, I didn't think this one through :) will try to fix it and submit a separate PR.

JohannesGaessler · 2025-11-28T07:22:45Z

Sorry, I forgot to press the submit button on my review. It is one more optimization that could be done, though for any optimizations you should first check what percentage of the runtime the operation takes up in the first place (because that is the maximum percentage that you can shave off).

SOLVE_TRI CUDA kernel for small matrices

0e6fd86

pwilkin requested a review from slaren as a code owner November 23, 2025 22:55

pwilkin mentioned this pull request Nov 23, 2025

Model: Qwen3 Next #16095

Merged

loci-dev mentioned this pull request Nov 23, 2025

UPSTREAM PR #17457: SOLVE_TRI CUDA kernel for small matrices auroralabs-loci/llama.cpp#299

Open

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 23, 2025

am17an reviewed Nov 24, 2025

View reviewed changes

Changes from review

4836963

optimize

42d6d58

Merge pull request #2 from am17an/solve_tri_cuda_opt

084d650

optimize

am17an reviewed Nov 24, 2025

View reviewed changes

Remove unrolls

002d26e

am17an reviewed Nov 24, 2025

View reviewed changes

pwilkin added 2 commits November 24, 2025 14:33

Refactor using if constexpr

e21a0f8

Change to switch

b2d870e

am17an approved these changes Nov 24, 2025

View reviewed changes

clang-format

376d4be

am17an requested a review from JohannesGaessler November 24, 2025 14:24

Add guards

4e8524c

JohannesGaessler reviewed Nov 24, 2025

View reviewed changes

Remove unneeded division by zero guard

c5cd33a

pwilkin requested a review from ggerganov as a code owner November 26, 2025 21:55

pwilkin added 2 commits November 26, 2025 23:08

But not like this...

6b11712

Move second sync outside of loop

f19cdf8

Move to column-based.

3a24c92

Cleanup

6bf2328

Correct clang-format

18fb138

am17an approved these changes Nov 27, 2025

View reviewed changes

Minor

ea4dc88

am17an merged commit cd0e3a7 into ggml-org:master Nov 28, 2025
72 of 74 checks passed

JohannesGaessler reviewed Nov 28, 2025

View reviewed changes

		@@ -0,0 +1,5 @@
		#include "common.cuh"

		#define CUDA_SOLVE_TRI_BLOCK_SIZE 256

SOLVE_TRI CUDA kernel for small matrices #17457

SOLVE_TRI CUDA kernel for small matrices #17457

Conversation

pwilkin commented Nov 23, 2025

Uh oh!

wsbagnsv1 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pwilkin commented Nov 24, 2025

Uh oh!

am17an commented Nov 24, 2025

Uh oh!

am17an commented Nov 24, 2025

Uh oh!

am17an Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

am17an Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pwilkin commented Nov 24, 2025

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

theo77186 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 24, 2025

Uh oh!

pwilkin commented Nov 24, 2025

Uh oh!

Uh oh!

JohannesGaessler Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Nov 24, 2025

Uh oh!

jeffbolznv commented Nov 25, 2025

Uh oh!

pwilkin commented Nov 25, 2025

Uh oh!

pwilkin commented Nov 26, 2025

Uh oh!

pwilkin commented Nov 26, 2025

Uh oh!

wsbagnsv1 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 27, 2025

Uh oh!

pwilkin commented Nov 27, 2025

Uh oh!

pwilkin commented Nov 27, 2025

Uh oh!

pwilkin commented Nov 27, 2025

Uh oh!

am17an commented Nov 27, 2025

Uh oh!

pwilkin commented Nov 27, 2025

Uh oh!

am17an left a comment

wsbagnsv1 commented Nov 24, 2025 •

edited

Loading

theo77186 commented Nov 24, 2025 •

edited

Loading

wsbagnsv1 commented Nov 27, 2025 •

edited

Loading

wsbagnsv commented Nov 27, 2025 •

edited

Loading

wsbagnsv commented Nov 27, 2025 •

edited

Loading

wsbagnsv1 commented Nov 27, 2025 •

edited

Loading