Skip to content

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD#22880

Open
JohannesGaessler wants to merge 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-rdna3-9
Open

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD#22880
JohannesGaessler wants to merge 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-rdna3-9

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented May 9, 2026

This PR adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.

I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.

Performance
GPU Model Microbatch size Test t/s master t/s 56ac96f Speedup
MI100 gemma 2B Q4_0 1 pp512@d16384 231.91 233.07 1.00
MI100 gemma 2B Q4_0 2 pp512@d16384 403.61 406.23 1.01
MI100 gemma 2B Q4_0 4 pp512@d16384 592.43 596.93 1.01
MI100 gemma 2B Q4_0 8 pp512@d16384 908.69 918.44 1.01
MI100 gemma 2B Q4_0 16 pp512@d16384 1434.79 1512.24 1.05
MI100 gemma 2B Q4_0 32 pp512@d16384 2168.48 2618.17 1.21
MI100 gemma 2B Q4_0 64 pp512@d16384 2656.99 3581.12 1.35
MI100 gemma 2B Q4_0 128 pp512@d16384 3010.65 4513.86 1.50
MI100 gemma 2B Q4_0 256 pp512@d16384 3252.60 5384.82 1.66
MI100 gemma 2B Q4_0 512 pp512@d16384 3387.00 5747.90 1.70
MI100 llama 1B Q4_0 1 pp512@d16384 358.08 359.35 1.00
MI100 llama 1B Q4_0 2 pp512@d16384 576.59 581.96 1.01
MI100 llama 1B Q4_0 4 pp512@d16384 1013.01 1094.39 1.08
MI100 llama 1B Q4_0 8 pp512@d16384 1377.28 1545.60 1.12
MI100 llama 1B Q4_0 16 pp512@d16384 2488.31 2318.50 0.93
MI100 llama 1B Q4_0 32 pp512@d16384 3401.14 3625.15 1.07
MI100 llama 1B Q4_0 64 pp512@d16384 4496.28 4756.22 1.06
MI100 llama 1B Q4_0 128 pp512@d16384 5881.00 6131.16 1.04
MI100 llama 1B Q4_0 256 pp512@d16384 6638.90 7134.26 1.07
MI100 llama 1B Q4_0 512 pp512@d16384 6815.82 7447.02 1.09
MI100 llama 8B Q4_0 1 pp512@d16384 105.38 104.54 0.99
MI100 llama 8B Q4_0 2 pp512@d16384 170.46 167.64 0.98
MI100 llama 8B Q4_0 4 pp512@d16384 271.67 268.41 0.99
MI100 llama 8B Q4_0 8 pp512@d16384 348.25 369.86 1.06
MI100 llama 8B Q4_0 16 pp512@d16384 556.74 679.96 1.22
MI100 llama 8B Q4_0 32 pp512@d16384 1039.03 1032.56 0.99
MI100 llama 8B Q4_0 64 pp512@d16384 1296.46 1286.34 0.99
MI100 llama 8B Q4_0 128 pp512@d16384 1485.89 1481.49 1.00
MI100 llama 8B Q4_0 256 pp512@d16384 1577.17 1573.43 1.00
MI100 llama 8B Q4_0 512 pp512@d16384 1715.11 1692.06 0.99
Radeon 8060S Graphics llama 1B Q4_0 1 pp512@d16384 134.42 134.37 1.00
Radeon 8060S Graphics llama 1B Q4_0 2 pp512@d16384 216.73 216.78 1.00
Radeon 8060S Graphics llama 1B Q4_0 4 pp512@d16384 399.06 394.77 0.99
Radeon 8060S Graphics llama 1B Q4_0 8 pp512@d16384 678.00 677.47 1.00
Radeon 8060S Graphics llama 1B Q4_0 16 pp512@d16384 571.09 1273.99 2.23
Radeon 8060S Graphics llama 1B Q4_0 32 pp512@d16384 844.28 1422.14 1.68
Radeon 8060S Graphics llama 1B Q4_0 64 pp512@d16384 959.86 1692.88 1.76
Radeon 8060S Graphics llama 1B Q4_0 128 pp512@d16384 916.60 1401.77 1.53
Radeon 8060S Graphics llama 1B Q4_0 256 pp512@d16384 1051.51 1748.15 1.66
Radeon 8060S Graphics llama 1B Q4_0 512 pp512@d16384 1042.84 1989.48 1.91
Radeon 8060S Graphics llama 8B Q4_0 1 pp512@d16384 31.57 31.69 1.00
Radeon 8060S Graphics llama 8B Q4_0 2 pp512@d16384 58.26 58.29 1.00
Radeon 8060S Graphics llama 8B Q4_0 4 pp512@d16384 94.16 106.31 1.13
Radeon 8060S Graphics llama 8B Q4_0 8 pp512@d16384 141.32 167.97 1.19
Radeon 8060S Graphics llama 8B Q4_0 16 pp512@d16384 216.09 286.63 1.33
Radeon 8060S Graphics llama 8B Q4_0 32 pp512@d16384 208.76 282.63 1.35
Radeon 8060S Graphics llama 8B Q4_0 64 pp512@d16384 275.13 363.79 1.32
Radeon 8060S Graphics llama 8B Q4_0 128 pp512@d16384 222.79 257.61 1.16
Radeon 8060S Graphics llama 8B Q4_0 256 pp512@d16384 234.88 271.95 1.16
Radeon 8060S Graphics llama 8B Q4_0 512 pp512@d16384 245.86 289.23 1.18
RX 9060 XT llama 1B Q4_0 1 pp512@d16384 211.56 211.43 1.00
RX 9060 XT llama 1B Q4_0 2 pp512@d16384 272.87 272.60 1.00
RX 9060 XT llama 1B Q4_0 4 pp512@d16384 455.29 455.05 1.00
RX 9060 XT llama 1B Q4_0 8 pp512@d16384 901.92 900.20 1.00
RX 9060 XT llama 1B Q4_0 16 pp512@d16384 1681.58 1732.82 1.03
RX 9060 XT llama 1B Q4_0 32 pp512@d16384 2334.67 2450.39 1.05
RX 9060 XT llama 1B Q4_0 64 pp512@d16384 2252.35 2345.10 1.04
RX 9060 XT llama 1B Q4_0 128 pp512@d16384 2003.39 2166.13 1.08
RX 9060 XT llama 1B Q4_0 256 pp512@d16384 2501.93 2775.43 1.11
RX 9060 XT llama 1B Q4_0 512 pp512@d16384 2578.93 2932.83 1.14
RX 9060 XT llama 8B Q4_0 1 pp512@d16384 46.89 46.88 1.00
RX 9060 XT llama 8B Q4_0 2 pp512@d16384 73.80 73.94 1.00
RX 9060 XT llama 8B Q4_0 4 pp512@d16384 129.62 129.56 1.00
RX 9060 XT llama 8B Q4_0 8 pp512@d16384 168.63 168.96 1.00
RX 9060 XT llama 8B Q4_0 16 pp512@d16384 452.65 472.65 1.04
RX 9060 XT llama 8B Q4_0 32 pp512@d16384 570.67 654.66 1.15
RX 9060 XT llama 8B Q4_0 64 pp512@d16384 560.04 633.44 1.13
RX 9060 XT llama 8B Q4_0 128 pp512@d16384 454.79 502.97 1.11
RX 9060 XT llama 8B Q4_0 256 pp512@d16384 495.69 553.83 1.12
RX 9060 XT llama 8B Q4_0 512 pp512@d16384 506.11 565.03 1.12

@lhl sorry for the long delay but this is (close to) the kernel in favor of which I declined #16827 .

Requirements

@JohannesGaessler JohannesGaessler requested a review from a team as a code owner May 9, 2026 19:43
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 9, 2026
@JohannesGaessler JohannesGaessler requested a review from IMbackK May 10, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants