HIP: RDNA3 mma FA, faster AMD transpose, tune AMD by JohannesGaessler · Pull Request #22880 · ggml-org/llama.cpp

JohannesGaessler · 2026-05-09T19:43:23Z

This PR adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.

I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.

Performance

GPU	Model	Microbatch size	Test	t/s master	t/s `56ac96f`	Speedup
MI100	gemma 2B Q4_0	1	pp512@d16384	231.91	233.07	1.00
MI100	gemma 2B Q4_0	2	pp512@d16384	403.61	406.23	1.01
MI100	gemma 2B Q4_0	4	pp512@d16384	592.43	596.93	1.01
MI100	gemma 2B Q4_0	8	pp512@d16384	908.69	918.44	1.01
MI100	gemma 2B Q4_0	16	pp512@d16384	1434.79	1512.24	1.05
MI100	gemma 2B Q4_0	32	pp512@d16384	2168.48	2618.17	1.21
MI100	gemma 2B Q4_0	64	pp512@d16384	2656.99	3581.12	1.35
MI100	gemma 2B Q4_0	128	pp512@d16384	3010.65	4513.86	1.50
MI100	gemma 2B Q4_0	256	pp512@d16384	3252.60	5384.82	1.66
MI100	gemma 2B Q4_0	512	pp512@d16384	3387.00	5747.90	1.70
MI100	llama 1B Q4_0	1	pp512@d16384	358.08	359.35	1.00
MI100	llama 1B Q4_0	2	pp512@d16384	576.59	581.96	1.01
MI100	llama 1B Q4_0	4	pp512@d16384	1013.01	1094.39	1.08
MI100	llama 1B Q4_0	8	pp512@d16384	1377.28	1545.60	1.12
MI100	llama 1B Q4_0	16	pp512@d16384	2488.31	2318.50	0.93
MI100	llama 1B Q4_0	32	pp512@d16384	3401.14	3625.15	1.07
MI100	llama 1B Q4_0	64	pp512@d16384	4496.28	4756.22	1.06
MI100	llama 1B Q4_0	128	pp512@d16384	5881.00	6131.16	1.04
MI100	llama 1B Q4_0	256	pp512@d16384	6638.90	7134.26	1.07
MI100	llama 1B Q4_0	512	pp512@d16384	6815.82	7447.02	1.09
MI100	llama 8B Q4_0	1	pp512@d16384	105.38	104.54	0.99
MI100	llama 8B Q4_0	2	pp512@d16384	170.46	167.64	0.98
MI100	llama 8B Q4_0	4	pp512@d16384	271.67	268.41	0.99
MI100	llama 8B Q4_0	8	pp512@d16384	348.25	369.86	1.06
MI100	llama 8B Q4_0	16	pp512@d16384	556.74	679.96	1.22
MI100	llama 8B Q4_0	32	pp512@d16384	1039.03	1032.56	0.99
MI100	llama 8B Q4_0	64	pp512@d16384	1296.46	1286.34	0.99
MI100	llama 8B Q4_0	128	pp512@d16384	1485.89	1481.49	1.00
MI100	llama 8B Q4_0	256	pp512@d16384	1577.17	1573.43	1.00
MI100	llama 8B Q4_0	512	pp512@d16384	1715.11	1692.06	0.99
Radeon 8060S Graphics	llama 1B Q4_0	1	pp512@d16384	134.42	134.37	1.00
Radeon 8060S Graphics	llama 1B Q4_0	2	pp512@d16384	216.73	216.78	1.00
Radeon 8060S Graphics	llama 1B Q4_0	4	pp512@d16384	399.06	394.77	0.99
Radeon 8060S Graphics	llama 1B Q4_0	8	pp512@d16384	678.00	677.47	1.00
Radeon 8060S Graphics	llama 1B Q4_0	16	pp512@d16384	571.09	1273.99	2.23
Radeon 8060S Graphics	llama 1B Q4_0	32	pp512@d16384	844.28	1422.14	1.68
Radeon 8060S Graphics	llama 1B Q4_0	64	pp512@d16384	959.86	1692.88	1.76
Radeon 8060S Graphics	llama 1B Q4_0	128	pp512@d16384	916.60	1401.77	1.53
Radeon 8060S Graphics	llama 1B Q4_0	256	pp512@d16384	1051.51	1748.15	1.66
Radeon 8060S Graphics	llama 1B Q4_0	512	pp512@d16384	1042.84	1989.48	1.91
Radeon 8060S Graphics	llama 8B Q4_0	1	pp512@d16384	31.57	31.69	1.00
Radeon 8060S Graphics	llama 8B Q4_0	2	pp512@d16384	58.26	58.29	1.00
Radeon 8060S Graphics	llama 8B Q4_0	4	pp512@d16384	94.16	106.31	1.13
Radeon 8060S Graphics	llama 8B Q4_0	8	pp512@d16384	141.32	167.97	1.19
Radeon 8060S Graphics	llama 8B Q4_0	16	pp512@d16384	216.09	286.63	1.33
Radeon 8060S Graphics	llama 8B Q4_0	32	pp512@d16384	208.76	282.63	1.35
Radeon 8060S Graphics	llama 8B Q4_0	64	pp512@d16384	275.13	363.79	1.32
Radeon 8060S Graphics	llama 8B Q4_0	128	pp512@d16384	222.79	257.61	1.16
Radeon 8060S Graphics	llama 8B Q4_0	256	pp512@d16384	234.88	271.95	1.16
Radeon 8060S Graphics	llama 8B Q4_0	512	pp512@d16384	245.86	289.23	1.18
RX 9060 XT	llama 1B Q4_0	1	pp512@d16384	211.56	211.43	1.00
RX 9060 XT	llama 1B Q4_0	2	pp512@d16384	272.87	272.60	1.00
RX 9060 XT	llama 1B Q4_0	4	pp512@d16384	455.29	455.05	1.00
RX 9060 XT	llama 1B Q4_0	8	pp512@d16384	901.92	900.20	1.00
RX 9060 XT	llama 1B Q4_0	16	pp512@d16384	1681.58	1732.82	1.03
RX 9060 XT	llama 1B Q4_0	32	pp512@d16384	2334.67	2450.39	1.05
RX 9060 XT	llama 1B Q4_0	64	pp512@d16384	2252.35	2345.10	1.04
RX 9060 XT	llama 1B Q4_0	128	pp512@d16384	2003.39	2166.13	1.08
RX 9060 XT	llama 1B Q4_0	256	pp512@d16384	2501.93	2775.43	1.11
RX 9060 XT	llama 1B Q4_0	512	pp512@d16384	2578.93	2932.83	1.14
RX 9060 XT	llama 8B Q4_0	1	pp512@d16384	46.89	46.88	1.00
RX 9060 XT	llama 8B Q4_0	2	pp512@d16384	73.80	73.94	1.00
RX 9060 XT	llama 8B Q4_0	4	pp512@d16384	129.62	129.56	1.00
RX 9060 XT	llama 8B Q4_0	8	pp512@d16384	168.63	168.96	1.00
RX 9060 XT	llama 8B Q4_0	16	pp512@d16384	452.65	472.65	1.04
RX 9060 XT	llama 8B Q4_0	32	pp512@d16384	570.67	654.66	1.15
RX 9060 XT	llama 8B Q4_0	64	pp512@d16384	560.04	633.44	1.13
RX 9060 XT	llama 8B Q4_0	128	pp512@d16384	454.79	502.97	1.11
RX 9060 XT	llama 8B Q4_0	256	pp512@d16384	495.69	553.83	1.12
RX 9060 XT	llama 8B Q4_0	512	pp512@d16384	506.11	565.03	1.12

@lhl sorry for the long delay but this is (close to) the kernel in favor of which I declined #16827 .

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD

356de78

JohannesGaessler requested a review from a team as a code owner May 9, 2026 19:43

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 9, 2026

am17an approved these changes May 10, 2026

View reviewed changes

JohannesGaessler requested a review from IMbackK May 10, 2026 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD#22880

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD#22880
JohannesGaessler wants to merge 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-rdna3-9

JohannesGaessler commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohannesGaessler commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JohannesGaessler commented May 9, 2026 •

edited

Loading