Skip to content

CUDA: add fast walsh-hadamard transform#23615

Merged
am17an merged 3 commits into
ggml-org:masterfrom
am17an:cuda-fwt
May 25, 2026
Merged

CUDA: add fast walsh-hadamard transform#23615
am17an merged 3 commits into
ggml-org:masterfrom
am17an:cuda-fwt

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 24, 2026

Overview

Implement FWHT for CUDA, speed-up for cases when we quantize the kv-cache.

Performance on a 5090 with -ctk q8_0 -ctv q8_0

Model Test t/s master t/s cuda-fwt Speedup
gemma4 26B.A4B Q4_K_M pp2048 13587.89 13809.20 1.02
gemma4 26B.A4B Q4_K_M pp2048@d1024 12425.01 12553.32 1.01
gemma4 26B.A4B Q4_K_M pp2048@d2048 12158.21 12291.42 1.01
gemma4 26B.A4B Q4_K_M pp2048@d4096 11710.89 11913.97 1.02
gemma4 26B.A4B Q4_K_M pp2048@d8192 10982.21 11214.12 1.02
gemma4 26B.A4B Q4_K_M pp2048@d16384 9702.60 9776.75 1.01
gemma4 26B.A4B Q4_K_M tg128 223.81 243.90 1.09
gemma4 26B.A4B Q4_K_M tg128@d1024 210.06 228.02 1.09
gemma4 26B.A4B Q4_K_M tg128@d2048 217.53 235.28 1.08
gemma4 26B.A4B Q4_K_M tg128@d4096 216.76 234.05 1.08
gemma4 26B.A4B Q4_K_M tg128@d8192 209.40 226.06 1.08
gemma4 26B.A4B Q4_K_M tg128@d16384 204.54 219.74 1.07

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for review after initial implementation

@am17an am17an requested review from a team and ggerganov as code owners May 24, 2026 14:02
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 24, 2026
Comment thread ggml/src/ggml-cuda/fwht.cu Outdated
Comment thread ggml/src/ggml-cuda/fwht.cu Outdated
Comment thread ggml/src/ggml-cuda/fwht.cu Outdated
Comment thread ggml/src/ggml-cuda/fwht.cu Outdated
Comment thread ggml/src/ggml-cuda/fwht.cu
Comment thread ggml/src/ggml-cuda/fwht.cu Outdated

cudaStream_t stream = ctx.stream();
dim3 grid_dims(num_blocks, 1, 1);
dim3 block_dims(WARP_SIZE, rows_per_block, 1);
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dim3 block_dims(WARP_SIZE, rows_per_block, 1);
dim3 block_dims(WARP_SIZE, rows_per_block, 1); // TODO support for warp size 64

Unless you want to implement it in this PR. It would need a bit of extra logic for warp size selection due to potential out-of-bounds memory accesses for e.g. head size 96.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code would only pass pow-of-2 N here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, in that case it should be unproblematic to use a warp size of 64. It should be a simple change so it would make sense to include it from the get-go - I'll push a quick commit.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Sorry, I accidentally pushed to the wrong branch. I'm currently still prototyping because the performance impact on CDNA seems to be negative both with the original kernel and with the warp size 64 patch I made.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 25, 2026

That maybe due to register spilling I guess

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Sorry, the previous report was wrong. I had accidentally swapped the commits when I compared the performance so I incorrectly thought the code had gotten slower. This is the correct performance:

Performance
GPU Model Microbatch size Test t/s 5d246a7 t/s d034c9d Speedup
MI60 / MI50 gemma 2B Q4_0 512 pp512 3564.39 3615.72 1.01
MI60 / MI50 gemma 2B Q4_0 512 tg128 180.67 197.38 1.09
MI60 / MI50 gemma4 26B.A4B Q4_0 512 pp512 1457.31 1508.11 1.03
MI60 / MI50 gemma4 26B.A4B Q4_0 512 tg128 73.25 84.36 1.15
MI60 / MI50 llama 1B Q4_0 512 pp512 6178.74 6360.82 1.03
MI60 / MI50 llama 1B Q4_0 512 tg128 284.59 338.01 1.19
MI60 / MI50 llama 8B Q4_0 512 pp512 1203.37 1219.37 1.01
MI60 / MI50 llama 8B Q4_0 512 tg128 83.24 92.24 1.11
MI100 gemma 2B Q4_0 512 pp512 8499.77 8513.13 1.00
MI100 gemma 2B Q4_0 512 tg128 159.60 231.11 1.45
MI100 gemma4 26B.A4B Q4_0 512 pp512 2461.92 2483.88 1.01
MI100 gemma4 26B.A4B Q4_0 512 tg128 67.78 90.41 1.33
MI100 llama 1B Q4_0 512 pp512 14556.48 15051.33 1.03
MI100 llama 1B Q4_0 512 tg128 274.45 390.36 1.42
MI100 llama 8B Q4_0 512 pp512 2789.66 2810.10 1.01
MI100 llama 8B Q4_0 512 tg128 87.89 122.86 1.40
P40 gemma 2B Q4_0 512 pp512 3216.77 3267.77 1.02
P40 gemma 2B Q4_0 512 tg128 110.04 114.97 1.04
P40 gemma4 26B.A4B Q4_0 512 pp512 1145.17 1167.92 1.02
P40 gemma4 26B.A4B Q4_0 512 tg128 53.56 55.66 1.04
P40 llama 1B Q4_0 512 pp512 5747.50 5849.62 1.02
P40 llama 1B Q4_0 512 tg128 209.89 219.48 1.05
P40 llama 8B Q4_0 512 pp512 1016.64 1028.18 1.01
P40 llama 8B Q4_0 512 tg128 47.97 49.19 1.03
Radeon 8060S Graphics gemma 2B Q4_0 512 pp512 1474.01 1479.63 1.00
Radeon 8060S Graphics gemma 2B Q4_0 512 tg128 81.38 82.74 1.02
Radeon 8060S Graphics gemma4 26B.A4B Q4_0 512 pp512 397.44 407.53 1.03
Radeon 8060S Graphics gemma4 26B.A4B Q4_0 512 tg128 37.22 38.47 1.03
Radeon 8060S Graphics llama 1B Q4_0 512 pp512 3046.45 3058.22 1.00
Radeon 8060S Graphics llama 1B Q4_0 512 tg128 143.56 147.61 1.03
Radeon 8060S Graphics llama 8B Q4_0 512 pp512 417.97 420.70 1.01
Radeon 8060S Graphics llama 8B Q4_0 512 tg128 35.85 36.80 1.03
RTX 3090 gemma 2B Q4_0 512 pp512 15062.79 15462.32 1.03
RTX 3090 gemma 2B Q4_0 512 tg128 321.99 340.46 1.06
RTX 3090 gemma4 26B.A4B Q4_0 512 pp512 4375.72 4535.69 1.04
RTX 3090 gemma4 26B.A4B Q4_0 512 tg128 140.57 148.73 1.06
RTX 3090 llama 1B Q4_0 512 pp512 24218.15 24309.18 1.00
RTX 3090 llama 1B Q4_0 512 tg128 554.35 598.74 1.08
RTX 3090 llama 8B Q4_0 512 pp512 5340.53 5451.62 1.02
RTX 3090 llama 8B Q4_0 512 tg128 141.14 150.28 1.06
RTX 4090 gemma 2B Q4_0 512 pp512 30312.75 30493.18 1.01
RTX 4090 gemma 2B Q4_0 512 tg128 389.80 405.38 1.04
RTX 4090 gemma4 26B.A4B Q4_0 512 pp512 9722.05 9925.02 1.02
RTX 4090 gemma4 26B.A4B Q4_0 512 tg128 178.72 186.33 1.04
RTX 4090 llama 1B Q4_0 512 pp512 45541.02 48395.74 1.06
RTX 4090 llama 1B Q4_0 512 tg128 654.37 704.52 1.08
RTX 4090 llama 8B Q4_0 512 pp512 12592.45 12802.23 1.02
RTX 4090 llama 8B Q4_0 512 tg128 165.29 172.37 1.04
RTX 5090 gemma 2B Q4_0 512 pp512 37668.66 38438.07 1.02
RTX 5090 gemma 2B Q4_0 512 tg128 544.23 586.94 1.08
RTX 5090 gemma4 26B.A4B Q4_0 512 pp512 11911.24 12233.33 1.03
RTX 5090 gemma4 26B.A4B Q4_0 512 tg128 235.92 258.06 1.09
RTX 5090 llama 8B Q4_0 512 pp512 15788.55 15960.27 1.01
RTX 5090 llama 8B Q4_0 512 tg128 248.27 266.44 1.07
RX 6800 gemma 2B Q4_0 512 pp512 3046.83 3113.75 1.02
RX 6800 gemma 2B Q4_0 512 tg128 123.03 128.29 1.04
RX 6800 gemma4 26B.A4B Q4_0 512 pp512 1137.03 1180.53 1.04
RX 6800 gemma4 26B.A4B Q4_0 512 tg128 54.51 59.23 1.09
RX 6800 llama 1B Q4_0 512 pp512 5101.82 5255.71 1.03
RX 6800 llama 1B Q4_0 512 tg128 221.69 235.64 1.06
RX 6800 llama 8B Q4_0 512 pp512 947.85 962.97 1.02
RX 6800 llama 8B Q4_0 512 tg128 65.16 70.21 1.08
RX 9060 XT gemma 2B Q4_0 512 pp512 7159.41 7466.79 1.04
RX 9060 XT gemma 2B Q4_0 512 tg128 104.27 119.32 1.14
RX 9060 XT gemma4 26B.A4B Q4_0 512 pp512 2035.16 2119.07 1.04
RX 9060 XT gemma4 26B.A4B Q4_0 512 tg128 49.49 53.86 1.09
RX 9060 XT llama 1B Q4_0 512 pp512 11331.98 12018.36 1.06
RX 9060 XT llama 1B Q4_0 512 tg128 207.11 211.99 1.02
RX 9060 XT llama 8B Q4_0 512 pp512 2556.26 2671.17 1.04
RX 9060 XT llama 8B Q4_0 512 tg128 56.45 58.03 1.03
V100-PCIE-32GB gemma 2B Q4_0 512 pp512 8708.15 8940.56 1.03
V100-PCIE-32GB gemma 2B Q4_0 512 tg128 230.05 245.31 1.07
V100-PCIE-32GB gemma4 26B.A4B Q4_0 512 pp512 1562.38 1607.19 1.03
V100-PCIE-32GB gemma4 26B.A4B Q4_0 512 tg128 88.77 94.66 1.07
V100-PCIE-32GB llama 1B Q4_0 512 pp512 14161.96 14459.13 1.02
V100-PCIE-32GB llama 1B Q4_0 512 tg128 357.49 386.40 1.08
V100-PCIE-32GB llama 8B Q4_0 512 pp512 2979.80 3022.35 1.01
V100-PCIE-32GB llama 8B Q4_0 512 tg128 108.40 115.30 1.06

LLaMA 3 1b has a head size of 64, LLaMA 3 8b a head size of 128, Gemma 2b 256, Gemma 4 26b 512. All tests are done with -ctk q8_0 -ctv q8_0. Mostly the changes provide a small but appreciable speedup. For some AMD GPUs the difference is quite substantial though, particularly for the MI100. I think the code is running into poorly optimized GEMM variants without the new kernel.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I forgot: on the MI100 I implemented a warp size of 64 but this only provided a speedup of like 1%, the bulk of the speedup comes from the work of @am17an .

@am17an am17an added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 25, 2026
@am17an am17an merged commit c1f1e28 into ggml-org:master May 25, 2026
22 of 50 checks passed
@am17an am17an deleted the cuda-fwt branch May 25, 2026 13:12
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request May 25, 2026
@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented May 25, 2026

Sorry, I ended up here in my bisect for a regression.
Bisect brackets it strictly: 5a4126a good, c1f1e28 bad. All models output garbage on CUDA, and reverting this commit on top of HEAD restores clean output.
I'm digging to see what it is...

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Repro / narrow down:

./build/bin/llama-completion -m ../gemma-4-E2B-it-UD-Q8_K_XL.gguf -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --jinja -p "salut" -n 32 --temp 0 -s 1
Garbage output:
( // \work수tot ofhas c Wait- =~MW1:}\achron^{-嫌ท้ายوا in요-rangian<lower acaso-h

./build/bin/llama-completion -m ../gemma-4-E2B-it-UD-Q8_K_XL.gguf -ngl 999 -fa on --jinja -p "salut" -n 32 --temp 0 -s 1
Output OK (first 32 tokens)

So it's the KV cache quantization that triggers the bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants