CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

am17an · 2026-05-24T14:02:34Z

Overview

Implement FWHT for CUDA, speed-up for cases when we quantize the kv-cache.

Performance on a 5090 with -ctk q8_0 -ctv q8_0

Model	Test	t/s master	t/s cuda-fwt	Speedup
gemma4 26B.A4B Q4_K_M	pp2048	13587.89	13809.20	1.02
gemma4 26B.A4B Q4_K_M	pp2048@d1024	12425.01	12553.32	1.01
gemma4 26B.A4B Q4_K_M	pp2048@d2048	12158.21	12291.42	1.01
gemma4 26B.A4B Q4_K_M	pp2048@d4096	11710.89	11913.97	1.02
gemma4 26B.A4B Q4_K_M	pp2048@d8192	10982.21	11214.12	1.02
gemma4 26B.A4B Q4_K_M	pp2048@d16384	9702.60	9776.75	1.01
gemma4 26B.A4B Q4_K_M	tg128	223.81	243.90	1.09
gemma4 26B.A4B Q4_K_M	tg128@d1024	210.06	228.02	1.09
gemma4 26B.A4B Q4_K_M	tg128@d2048	217.53	235.28	1.08
gemma4 26B.A4B Q4_K_M	tg128@d4096	216.76	234.05	1.08
gemma4 26B.A4B Q4_K_M	tg128@d8192	209.40	226.06	1.08
gemma4 26B.A4B Q4_K_M	tg128@d16384	204.54	219.74	1.07

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for review after initial implementation

JohannesGaessler · 2026-05-25T07:59:21Z

+
+    cudaStream_t                         stream = ctx.stream();
+    dim3                                 grid_dims(num_blocks, 1, 1);
+    dim3                                 block_dims(WARP_SIZE, rows_per_block, 1);


Suggested change

dim3 block_dims(WARP_SIZE, rows_per_block, 1);

dim3 block_dims(WARP_SIZE, rows_per_block, 1); // TODO support for warp size 64

Unless you want to implement it in this PR. It would need a bit of extra logic for warp size selection due to potential out-of-bounds memory accesses for e.g. head size 96.

I think the code would only pass pow-of-2 N here

Oh right, in that case it should be unproblematic to use a warp size of 64. It should be a simple change so it would make sense to include it from the get-go - I'll push a quick commit.

JohannesGaessler · 2026-05-25T10:35:34Z

Sorry, I accidentally pushed to the wrong branch. I'm currently still prototyping because the performance impact on CDNA seems to be negative both with the original kernel and with the warp size 64 patch I made.

am17an · 2026-05-25T10:54:57Z

That maybe due to register spilling I guess

JohannesGaessler · 2026-05-25T13:02:19Z

Sorry, the previous report was wrong. I had accidentally swapped the commits when I compared the performance so I incorrectly thought the code had gotten slower. This is the correct performance:

Performance

GPU	Model	Microbatch size	Test	t/s `5d246a7`	t/s `d034c9d`	Speedup
MI60 / MI50	gemma 2B Q4_0	512	pp512	3564.39	3615.72	1.01
MI60 / MI50	gemma 2B Q4_0	512	tg128	180.67	197.38	1.09
MI60 / MI50	gemma4 26B.A4B Q4_0	512	pp512	1457.31	1508.11	1.03
MI60 / MI50	gemma4 26B.A4B Q4_0	512	tg128	73.25	84.36	1.15
MI60 / MI50	llama 1B Q4_0	512	pp512	6178.74	6360.82	1.03
MI60 / MI50	llama 1B Q4_0	512	tg128	284.59	338.01	1.19
MI60 / MI50	llama 8B Q4_0	512	pp512	1203.37	1219.37	1.01
MI60 / MI50	llama 8B Q4_0	512	tg128	83.24	92.24	1.11
MI100	gemma 2B Q4_0	512	pp512	8499.77	8513.13	1.00
MI100	gemma 2B Q4_0	512	tg128	159.60	231.11	1.45
MI100	gemma4 26B.A4B Q4_0	512	pp512	2461.92	2483.88	1.01
MI100	gemma4 26B.A4B Q4_0	512	tg128	67.78	90.41	1.33
MI100	llama 1B Q4_0	512	pp512	14556.48	15051.33	1.03
MI100	llama 1B Q4_0	512	tg128	274.45	390.36	1.42
MI100	llama 8B Q4_0	512	pp512	2789.66	2810.10	1.01
MI100	llama 8B Q4_0	512	tg128	87.89	122.86	1.40
P40	gemma 2B Q4_0	512	pp512	3216.77	3267.77	1.02
P40	gemma 2B Q4_0	512	tg128	110.04	114.97	1.04
P40	gemma4 26B.A4B Q4_0	512	pp512	1145.17	1167.92	1.02
P40	gemma4 26B.A4B Q4_0	512	tg128	53.56	55.66	1.04
P40	llama 1B Q4_0	512	pp512	5747.50	5849.62	1.02
P40	llama 1B Q4_0	512	tg128	209.89	219.48	1.05
P40	llama 8B Q4_0	512	pp512	1016.64	1028.18	1.01
P40	llama 8B Q4_0	512	tg128	47.97	49.19	1.03
Radeon 8060S Graphics	gemma 2B Q4_0	512	pp512	1474.01	1479.63	1.00
Radeon 8060S Graphics	gemma 2B Q4_0	512	tg128	81.38	82.74	1.02
Radeon 8060S Graphics	gemma4 26B.A4B Q4_0	512	pp512	397.44	407.53	1.03
Radeon 8060S Graphics	gemma4 26B.A4B Q4_0	512	tg128	37.22	38.47	1.03
Radeon 8060S Graphics	llama 1B Q4_0	512	pp512	3046.45	3058.22	1.00
Radeon 8060S Graphics	llama 1B Q4_0	512	tg128	143.56	147.61	1.03
Radeon 8060S Graphics	llama 8B Q4_0	512	pp512	417.97	420.70	1.01
Radeon 8060S Graphics	llama 8B Q4_0	512	tg128	35.85	36.80	1.03
RTX 3090	gemma 2B Q4_0	512	pp512	15062.79	15462.32	1.03
RTX 3090	gemma 2B Q4_0	512	tg128	321.99	340.46	1.06
RTX 3090	gemma4 26B.A4B Q4_0	512	pp512	4375.72	4535.69	1.04
RTX 3090	gemma4 26B.A4B Q4_0	512	tg128	140.57	148.73	1.06
RTX 3090	llama 1B Q4_0	512	pp512	24218.15	24309.18	1.00
RTX 3090	llama 1B Q4_0	512	tg128	554.35	598.74	1.08
RTX 3090	llama 8B Q4_0	512	pp512	5340.53	5451.62	1.02
RTX 3090	llama 8B Q4_0	512	tg128	141.14	150.28	1.06
RTX 4090	gemma 2B Q4_0	512	pp512	30312.75	30493.18	1.01
RTX 4090	gemma 2B Q4_0	512	tg128	389.80	405.38	1.04
RTX 4090	gemma4 26B.A4B Q4_0	512	pp512	9722.05	9925.02	1.02
RTX 4090	gemma4 26B.A4B Q4_0	512	tg128	178.72	186.33	1.04
RTX 4090	llama 1B Q4_0	512	pp512	45541.02	48395.74	1.06
RTX 4090	llama 1B Q4_0	512	tg128	654.37	704.52	1.08
RTX 4090	llama 8B Q4_0	512	pp512	12592.45	12802.23	1.02
RTX 4090	llama 8B Q4_0	512	tg128	165.29	172.37	1.04
RTX 5090	gemma 2B Q4_0	512	pp512	37668.66	38438.07	1.02
RTX 5090	gemma 2B Q4_0	512	tg128	544.23	586.94	1.08
RTX 5090	gemma4 26B.A4B Q4_0	512	pp512	11911.24	12233.33	1.03
RTX 5090	gemma4 26B.A4B Q4_0	512	tg128	235.92	258.06	1.09
RTX 5090	llama 8B Q4_0	512	pp512	15788.55	15960.27	1.01
RTX 5090	llama 8B Q4_0	512	tg128	248.27	266.44	1.07
RX 6800	gemma 2B Q4_0	512	pp512	3046.83	3113.75	1.02
RX 6800	gemma 2B Q4_0	512	tg128	123.03	128.29	1.04
RX 6800	gemma4 26B.A4B Q4_0	512	pp512	1137.03	1180.53	1.04
RX 6800	gemma4 26B.A4B Q4_0	512	tg128	54.51	59.23	1.09
RX 6800	llama 1B Q4_0	512	pp512	5101.82	5255.71	1.03
RX 6800	llama 1B Q4_0	512	tg128	221.69	235.64	1.06
RX 6800	llama 8B Q4_0	512	pp512	947.85	962.97	1.02
RX 6800	llama 8B Q4_0	512	tg128	65.16	70.21	1.08
RX 9060 XT	gemma 2B Q4_0	512	pp512	7159.41	7466.79	1.04
RX 9060 XT	gemma 2B Q4_0	512	tg128	104.27	119.32	1.14
RX 9060 XT	gemma4 26B.A4B Q4_0	512	pp512	2035.16	2119.07	1.04
RX 9060 XT	gemma4 26B.A4B Q4_0	512	tg128	49.49	53.86	1.09
RX 9060 XT	llama 1B Q4_0	512	pp512	11331.98	12018.36	1.06
RX 9060 XT	llama 1B Q4_0	512	tg128	207.11	211.99	1.02
RX 9060 XT	llama 8B Q4_0	512	pp512	2556.26	2671.17	1.04
RX 9060 XT	llama 8B Q4_0	512	tg128	56.45	58.03	1.03
V100-PCIE-32GB	gemma 2B Q4_0	512	pp512	8708.15	8940.56	1.03
V100-PCIE-32GB	gemma 2B Q4_0	512	tg128	230.05	245.31	1.07
V100-PCIE-32GB	gemma4 26B.A4B Q4_0	512	pp512	1562.38	1607.19	1.03
V100-PCIE-32GB	gemma4 26B.A4B Q4_0	512	tg128	88.77	94.66	1.07
V100-PCIE-32GB	llama 1B Q4_0	512	pp512	14161.96	14459.13	1.02
V100-PCIE-32GB	llama 1B Q4_0	512	tg128	357.49	386.40	1.08
V100-PCIE-32GB	llama 8B Q4_0	512	pp512	2979.80	3022.35	1.01
V100-PCIE-32GB	llama 8B Q4_0	512	tg128	108.40	115.30	1.06

LLaMA 3 1b has a head size of 64, LLaMA 3 8b a head size of 128, Gemma 2b 256, Gemma 4 26b 512. All tests are done with -ctk q8_0 -ctv q8_0. Mostly the changes provide a small but appreciable speedup. For some AMD GPUs the difference is quite substantial though, particularly for the MI100. I think the code is running into poorly optimized GEMM variants without the new kernel.

JohannesGaessler · 2026-05-25T13:04:31Z

I forgot: on the MI100 I implemented a warp size of 64 but this only provided a speedup of like 1%, the bulk of the speedup comes from the work of @am17an .

This reverts commit c1f1e28.

ServeurpersoCom · 2026-05-25T18:49:40Z

Sorry, I ended up here in my bisect for a regression.
Bisect brackets it strictly: 5a4126a good, c1f1e28 bad. All models output garbage on CUDA, and reverting this commit on top of HEAD restores clean output.
I'm digging to see what it is...

ServeurpersoCom · 2026-05-25T19:18:22Z

Repro / narrow down:

./build/bin/llama-completion -m ../gemma-4-E2B-it-UD-Q8_K_XL.gguf -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --jinja -p "salut" -n 32 --temp 0 -s 1
Garbage output:
( // \work수tot ofhas c Wait- =~MW1:}\achron^{-嫌ท้ายوا in요-rangian<lower acaso-h

./build/bin/llama-completion -m ../gemma-4-E2B-it-UD-Q8_K_XL.gguf -ngl 999 -fa on --jinja -p "salut" -n 32 --temp 0 -s 1
Output OK (first 32 tokens)

So it's the KV cache quantization that triggers the bug!

CUDA: add fast walsh-hadamard transform

8299d15

am17an requested review from a team and ggerganov as code owners May 24, 2026 14:02

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 24, 2026

CISC approved these changes May 24, 2026

View reviewed changes

JohannesGaessler reviewed May 25, 2026

View reviewed changes

am17an and others added 2 commits May 25, 2026 16:16

review: add unrolls + change size_t -> int

6ee12a2

warp size 64

d034c9d

JohannesGaessler force-pushed the cuda-fwt branch from 22c360f to 6ee12a2 Compare May 25, 2026 10:36

JohannesGaessler approved these changes May 25, 2026

View reviewed changes

am17an added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 25, 2026

CISC approved these changes May 25, 2026

View reviewed changes

am17an merged commit c1f1e28 into ggml-org:master May 25, 2026
22 of 50 checks passed

am17an deleted the cuda-fwt branch May 25, 2026 13:12

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request May 25, 2026

Revert "CUDA: add fast walsh-hadamard transform (ggml-org#23615)"

b8dedf0

This reverts commit c1f1e28.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: add fast walsh-hadamard transform#23615

CUDA: add fast walsh-hadamard transform#23615
am17an merged 3 commits into
ggml-org:masterfrom
am17an:cuda-fwt

am17an commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler May 25, 2026 •

edited

Loading

Uh oh!

am17an May 25, 2026

Uh oh!

JohannesGaessler May 25, 2026

Uh oh!

JohannesGaessler commented May 25, 2026

Uh oh!

am17an commented May 25, 2026

Uh oh!

JohannesGaessler commented May 25, 2026

Uh oh!

JohannesGaessler commented May 25, 2026

Uh oh!

Uh oh!

ServeurpersoCom commented May 25, 2026 •

edited

Loading

Uh oh!

ServeurpersoCom commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	dim3 block_dims(WARP_SIZE, rows_per_block, 1);
	dim3 block_dims(WARP_SIZE, rows_per_block, 1); // TODO support for warp size 64

Conversation

am17an commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an May 25, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler May 25, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented May 25, 2026

Uh oh!

am17an commented May 25, 2026

Uh oh!

JohannesGaessler commented May 25, 2026

Uh oh!

JohannesGaessler commented May 25, 2026

Uh oh!

Uh oh!

ServeurpersoCom commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented May 24, 2026 •

edited

Loading

JohannesGaessler May 25, 2026 •

edited

Loading

ServeurpersoCom commented May 25, 2026 •

edited

Loading