Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #16827

lhl · 2025-10-28T19:29:36Z

In the HIP BUILD docs -DGGML_HIP_ROCWMMA_FATTN=ON is recommended for improved FA performance for RDNA3+/CDNA and in broad pp512/tg128 performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.

I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.

Dramatically improve long context WMMA prefill improvements on RDNA3: increased HIP occupancy and reduced LDS footprint via adaptive KQ stride; pp speedups without touching CUDA or the deprecated Volta WMMA path.
Fix long‑context decode regression on rocWMMA builds: decode now uses HIP’s tuned VEC/TILE selection instead of WMMA, aligning performance with the HIP baseline.
Remove HIP‑side TILE pruning in WMMA builds: matches HIP‑only behavior and avoids device traps, binary growth for all tiles was neglible, ~+4 MiB to the build
Add a decode‑time (HIP+rocWMMA only) safety guard: if a predicted TILE split has no config, fall back to VEC. This guard is not present in HIP‑only builds but seemed like a good idea to and avoid crashes on unusual dims.
Changes are gated to ROCWMMA/HIP only; no impact to CUDA or the legacy Volta WMMA path.

The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:

Llama 3.2 1B Q4_K_M

Previous rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	WMMA	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512	4703.28	4884.42	3.85%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d1024	4076.03	4204.81	3.16%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d4096	2936.89	2959.54	0.77%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d16384	1350.48	1265.62	-6.28%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d65536	424.76	360.24	-15.19%

Decode (tg)

model	size	params	test	HIP	WMMA	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128	195.65	193.01	-1.35%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d1024	188.79	182.6	-3.28%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d4096	173.36	143.51	-17.22%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d16384	126.86	87.53	-31.01%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d65536	64.62	27.35	-57.68%

My rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512	4703.28	4970.14	5.67%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d1024	4076.03	4575.18	12.25%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d4096	2936.89	3788.92	29.01%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d16384	1350.48	2064.78	52.89%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d65536	424.76	706.46	66.32%

Decode (tg)

model	size	params	test	HIP	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128	195.65	195.59	-0.03%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d1024	188.79	188.84	0.03%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d4096	173.36	173.28	-0.05%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d16384	126.86	127.01	0.12%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d65536	64.62	64.55	-0.10%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512	4884.42	4970.14	1.75%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d1024	4204.81	4575.18	8.81%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d4096	2959.54	3788.92	28.02%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d16384	1265.62	2064.78	63.14%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d65536	360.24	706.46	96.11%

Decode (tg)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128	193.01	195.59	1.34%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d1024	182.6	188.84	3.42%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d4096	143.51	173.28	20.74%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d16384	87.53	127.01	45.11%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d65536	27.35	64.55	136.06%

gpt-oss-20b F16/MXFP4

Previous rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	WMMA	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512	1472.01	1513.79	2.84%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d1024	1387.58	1417.45	2.15%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d4096	1175.72	1205.37	2.52%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d16384	713.9	669.77	-6.18%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d65536	277.58	227.24	-18.14%

Decode (tg)

model	size	params	test	HIP	WMMA	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128	49.92	50.23	0.61%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d1024	49.27	48.65	-1.26%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d4096	48.15	45.11	-6.32%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d16384	44.38	32.91	-25.85%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d65536	34.76	14.63	-57.92%

My rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512	1472.01	1495.97	1.63%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d1024	1387.58	1456.15	4.94%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d4096	1175.72	1347.75	14.63%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d16384	713.9	962.98	34.89%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d65536	277.58	426.81	53.76%

Decode (tg)

model	size	params	test	HIP	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128	49.92	49.9	-0.04%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d1024	49.27	49.21	-0.11%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d4096	48.15	48.05	-0.20%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d16384	44.38	44.34	-0.11%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d65536	34.76	34.77	0.03%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512	1513.79	1495.97	-1.18%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d1024	1417.45	1456.15	2.73%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d4096	1205.37	1347.75	11.81%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d16384	669.77	962.98	43.78%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d65536	227.24	426.81	87.83%

Decode (tg)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128	50.23	49.9	-0.64%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d1024	48.65	49.21	1.16%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d4096	45.11	48.05	6.53%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d16384	32.91	44.34	34.72%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d65536	14.63	34.77	137.71%

I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...

…idency on HIP via __launch_bounds__ (min 2 blocks/SM)\n- Adaptive KQ stride on HIP: 128 for D<=128 to reduce LDS footprint\n- Update loops and launch to use the adaptive stride; bump nwarps for small D\n- No behavior change on CUDA; improves prefill perf on RDNA3

…E and adding a safe fallback\n\n- Do not select WMMA for decode on HIP; fall through to VEC/TILE\n- Remove WMMA TILE pruning on HIP to avoid device traps; keep for CUDA WMMA\n- Add decode-time guard: if predicted TILE split has no config, select VEC\n- Remove ad-hoc env overrides and debug prints

JohannesGaessler · 2025-10-28T19:44:25Z

I'm sorry to say this but this PR is coming at a very inopportune time. The history behind the WMMA kernel is that I first wrote it for NVIDIA GPUs using the "high-level" CUDA WMMA interface. However, that is a fundamentally bad way to use tensor cores because you need to go registers -> SRAM -> registers in order to get a well-defined memory layout. For this reason I later wrote the MMA kernel that directly uses PTX instructions and is much faster. However, because the tensor core instructions used there are only available on NVIDIA GPUs that are Turing or newer I kept the WMMA kernel for Volta. At some point rocWMMA support was added since despite the flawed nature of the kernel it was still faster than the alternatives.

However, one of my immediate next goals is to add support for Volta tensor cores, AMD WMMA instructions (not to be confused with the NVIDIA WMMA interface), and AMD MFMA instructions to the MMA kernel and then remove the WMMA kernel - the V100 and the MI 100 that I need for development arrived just this week. I very much expect a proper MMA implementation to be faster than the WMMA kernel so I don't want to make any more changes to it until it is removed. If it turns out that the kernel in this PR is still faster at the end I will reconsider.

lhl · 2025-10-28T20:39:51Z

OK, well, I guess you know the timing of the replacement best, it might be easier to modify the BUILD.md to not recommend using -DGGML_HIP_ROCWMMA_FATTN=ON - as based on my understanding of the code using that flag will always be massively slower at long context tg. As you can see from the included perf charts, posted these aren't small differences either, but massive drops even at 4K context.

The gpt-oss-120b runs finished btw, these mirror the settings and should be directly comparable to @ggerganov's DGX Spark performance sweeps.

With the current rocWMMA implementation both the pp and tg are massively degraded at 32K. The PR reduces the pp and tg by huge percentages (even keeping the tg on par w/ the Spark)

ROCm w/ rocWMMA

Test	DGX	STXH	%
pp2048	1689.47	1006.65	+67.8%
pp2048@d4096	1733.41	790.45	+119.3%
pp2048@d8192	1705.93	603.83	+182.5%
pp2048@d16384	1514.78	405.53	+273.5%
pp2048@d32768	1221.23	223.82	+445.6%

Test	DGX	STXH	%
tg32	52.87	46.56	+13.6%
tg32@d4096	51.02	38.25	+33.4%
tg32@d8192	48.46	32.65	+48.4%
tg32@d16384	44.78	25.50	+75.6%
tg32@d32768	38.76	17.82	+117.5%

My Tuned rocWMMA

Test	DGX	STXH	%
pp2048	1689.47	977.22	+72.9%
pp2048@d4096	1733.41	878.54	+97.3%
pp2048@d8192	1705.93	743.36	+129.5%
pp2048@d16384	1514.78	587.25	+157.9%
pp2048@d32768	1221.23	407.87	+199.4%

Test	DGX	STXH	%
tg32	52.87	48.97	+8.0%
tg32@d4096	51.02	45.42	+12.3%
tg32@d8192	48.46	43.55	+11.3%
tg32@d16384	44.78	40.91	+9.5%
tg32@d32768	38.76	36.43	+6.4%

JohannesGaessler · 2025-10-28T21:04:38Z

it might be easier to modify the BUILD.md to not recommend using -DGGML_HIP_ROCWMMA_FATTN=ON - as based on my understanding of the code using that flag will always be massively slower at long context tg

The context here is that until very recently the AMD performance for the FA kernels not using rocWMMA was massively gimped and I only recently started taking AMD more seriously when the MI50 prices came down. Yes, I could put effort towards figuring out for which GPUs it is better to use which suboptimal implementation and documenting but I would rather put that effort towards writing better code that is universally the best choice.

darkbasic · 2025-10-29T08:40:30Z

I very much expect a proper MMA implementation to be faster than the WMMA kernel so I don't want to make any more changes to it until it is removed. If it turns out that the kernel in this PR is still faster at the end I will reconsider.

What's your ETA for that? While I understand your point of view this PR is extremely small and doubles the performance of llama.cpp on AMD, so it will have a huge impact as a stop measure until your new implementation is deemed ready. Is there a high risk of regressions?

In the meantime if someone wants to test this PR via a docker container: https://github.com/kyuz0/amd-strix-halo-toolboxes/pull/11/files#diff-cab8ae85e621fa22745cdfac4af09471a22dcf162c9fc92dbb5c5de9af68bd8a

git clone -b rocm-7alpha https://github.com/darkbasic/amd-strix-halo-toolboxes.git
cd amd-strix-halo-toolboxes/toolboxes
podman build -f Dockerfile.rocm-7alpha-rocwmma-improved -t localhost/rocm-7alpha-rocwmma-improved .
toolbox create llama-rocm-7alpha-rocwmma-improved \
  --image localhost/rocm-7alpha-rocwmma-improved \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
toolbox enter llama-rocm-7alpha-rocwmma-improved

jammm · 2025-10-29T08:59:43Z

@JohannesGaessler let's get this merged? I understand your concerns but this will go a long way in bridging the gap between Vulkan and ROCm backends. We can then move over to your new MMA implementation once that's ready. But for now the perf gains are too good to let go.

BTW I heard we went you a strix halo. Did you receive it yet? Let us know if you face any issues setting it up. Cheers :)

JohannesGaessler · 2025-10-29T09:58:10Z

I am currently adding V100 support for mma.cuh, that will probably take me 1-3 days of work in total. After that I'll add support for AMD MFMA and AMD WMMA instructions, for MFMA I'll first need some extra components to arrive since the MI100 is incompatible with the motherboard that I wanted to use originally. The MMA instructions will then need to be applied to the Flaashattention MMA kernel which is where most of the work will be.

The ETA will depend on me having to work on other things, it's probably like a month. I will not merge this PR as-is. If you want to use it make a branch that doesn't impose a maintenance burden on master.

jammm · 2025-10-29T10:34:03Z

I am currently adding V100 support for mma.cuh, that will probably take me 1-3 days of work in total. After that I'll add support for AMD MFMA and AMD WMMA instructions, for MFMA I'll first need some extra components to arrive since the MI100 is incompatible with the motherboard that I wanted to use originally. The MMA instructions will then need to be applied to the Flaashattention MMA kernel which is where most of the work will be.

The ETA will depend on me having to work on other things, it's probably like a month. I will not merge this PR as-is. If you want to use it make a branch that doesn't impose a maintenance burden on master.

Sounds good. What would it take to get this PR merged? Why is it a maintenance burden ?

JohannesGaessler · 2025-10-29T11:00:09Z

As I've said before: I will not merge this PR unless it turns out that the MMA kernel is bad/unviable with AMD WMMA instructions. There is no need to put code on master that is going to replaced soon anyways, just use the other branch.

darkbasic · 2025-10-29T11:49:22Z

There is no need to put code on master that is going to replaced soon anyways

Does that mean that you plan to drop the WMMA kernel altogether? Because if it's going to stay I don't see why it would pose a maintenance burden. Are you worried about potentially regressing CUDA?

JohannesGaessler · 2025-10-29T12:01:11Z

Yes, as I said before, the plan is to remove the WMMA kernel. The concept of the kernel is fundamentally bad and I only implemented it like that in the first place because NVIDIA is hiding the correct way to use tensor cores in their PTX documentation.

darkbasic · 2025-10-29T12:45:20Z

@lhl your patches don't play well performance-wise against latest master.

These are the results with your branch on my HP ZBook Ultra G1a:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |          pp2048 |        882.51 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |            tg32 |         44.83 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d4096 |        732.37 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d4096 |         41.28 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d8192 |        634.86 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d8192 |         40.02 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d16384 |        508.47 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d16384 |         37.55 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d32768 |        353.59 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d32768 |         33.61 ± 0.00 |

This is with the same branch rebased against master:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |          pp2048 |        882.59 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |            tg32 |         47.36 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d4096 |        665.48 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d4096 |         40.58 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d8192 |        598.00 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d8192 |         37.82 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d16384 |        474.92 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d16384 |         32.63 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d32768 |        338.85 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d32768 |         23.60 ± 0.00 |

tg32 @ d32768 is much worse when rebased.

lhl · 2025-10-29T15:44:25Z

@darkbasic if you're a dev trying to get to the bottom of it, it looks like there were only 2 CUDA commits between when you posted and from my branch so it should be relatively easy to bisect the offending commit and see what's up. It might be illuminating: lhl/llama.cpp@rocm-wmma-tune...ggml-org:llama.cpp:f549b0007dbdd683215820f7229ce180a12b191d

If you're just looking for the best llama.cpp performance for a model you use, I think for Strix Halo, your best approach is to run your own sweeps on Vulkan AMDVLK, Vulkan RADV, HIP, and HIP rocWMMA, and my patched rocWMMA and pick the best one. Not ideal to say the least, but shouganai.

(From my testing, the tuned rocWMMA is the best performing for pp and tg across context lengths, but I've only tried a few models and tested on gfx1151 so it's by no means exhaustive. I just tested b6877 vs my build and perf isn't even close on gpt-oss-20b.)

I know there are some people like @hjc4869 who maintain their own forks, I don't plan to. Like everyone else, I'm plenty busy and it's not my goal to add anyone else's pile either. I thought I'd share this because the perf improvement was not insignificant but if this isn't going to be merged, nbd, the code is out there.

I will mention one more thing that I hope won't get lost, but the guard for VEC fallback when there is no suitable TILE I added is something that is not in the regular HIP path and may be one of the potential causes of the segfaults that some users are running across when using the ROCm backend.

darkbasic · 2025-10-29T15:55:05Z

If you're just looking for the best llama.cpp performance for a model you use, I think for Strix Halo, your best approach is to run your own sweeps on Vulkan AMDVLK, Vulkan RADV, HIP, and HIP rocWMMA, and my patched rocWMMA and pick the best one.

@lhl I did (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37791#note_3152770) and your branch is indeed the fastest thing I've tried so far. Somehow latest commits don't play well with it and I wondered if you were aware of it.

If someone is interested here are the results with your branch:

$ llama-bench -ngl 999 -mmp 0 -fa 1 -m ~/models/gpt-oss-120b/mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -p 2048 -n 32 -d 0,4096,8192,16384,32768,65536 -ub 2048 -r 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |          pp2048 |        885.22 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |            tg32 |         45.14 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d4096 |        733.66 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d4096 |         41.47 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d8192 |        644.68 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d8192 |         40.13 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d16384 |        506.08 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d16384 |         37.79 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d32768 |        344.54 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d32768 |         33.60 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d65536 |        173.85 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d65536 |         27.69 ± 0.00 |

build: a45e1cd6 (6867)

I will mention one more thing that I hope won't get lost, but the guard for VEC fallback when there is no suitable TILE I added is something that is not in the regular HIP path and may be one of the potential causes of the segfaults that some users are running across when using the ROCm backend.

That's one of the biggest selling points in my opinion. Even without rocwmma ROCm-7.10 is basically unusable for me due to the crashes. Your patches drastically improve the situation. I am still experiencing a few crashes, but it's like two orders of magnitude better than before.

eugr · 2025-10-30T01:42:51Z

Hmm, I'm getting lower performance on long context using master branch than last week, with the same ROCm version and compile flags.
HIP, no rocWMMA.

Last week:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	1000.93 ± 1.23
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	47.46 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	827.34 ± 1.99
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	44.20 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	701.68 ± 2.36
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	42.39 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	503.49 ± 0.90
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	39.61 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	344.36 ± 0.80
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	35.32 ± 0.01

Today:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	998.67 ± 2.46
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	52.27 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	775.61 ± 6.49
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	45.55 ± 0.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	667.22 ± 1.43
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	41.88 ± 0.12
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	487.42 ± 1.89
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	35.70 ± 0.05
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	333.57 ± 0.36
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	25.41 ± 0.01

darkbasic · 2025-10-30T07:15:55Z

@eugr yeah it looks like like the regression doesn't have anything to do with @lhl patches after all. Unfortunately I won't have access to my Strix Halo device for the next couple of weeks so hopefully someone else will manage to get to the bottom of it.

JohannesGaessler · 2025-10-30T08:32:20Z

Performance regression may already be fixed as of a few hours ago: #16815 #16847

lhl added 2 commits October 28, 2025 17:33

lhl requested a review from JohannesGaessler as a code owner October 28, 2025 19:29

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 28, 2025

DajanaV mentioned this pull request Oct 29, 2025

UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) auroralabs-loci/llama.cpp#12

Open

darkbasic mentioned this pull request Oct 29, 2025

Add rocm-7alpha, rocm-7alpha-rocwmma and rocm-7alpha-rocwmma-improved Dockerfiles kyuz0/amd-strix-halo-toolboxes#11

Open

lhl closed this Oct 29, 2025

Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #16827

Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #16827

Conversation

lhl commented Oct 28, 2025

Llama 3.2 1B Q4_K_M

Previous rocWMMA vs HIP

My rocWMMA vs HIP

My rocWMMA vs Previous rocWMMA

gpt-oss-20b F16/MXFP4

Previous rocWMMA vs HIP

My rocWMMA vs HIP

My rocWMMA vs Previous rocWMMA

Uh oh!

JohannesGaessler commented Oct 28, 2025

Uh oh!

lhl commented Oct 28, 2025

ROCm w/ rocWMMA

My Tuned rocWMMA

Uh oh!

JohannesGaessler commented Oct 28, 2025

Uh oh!

darkbasic commented Oct 29, 2025

Uh oh!

jammm commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented Oct 29, 2025

Uh oh!

JohannesGaessler commented Oct 29, 2025

Uh oh!

darkbasic commented Oct 29, 2025

Uh oh!

JohannesGaessler commented Oct 29, 2025

Uh oh!

darkbasic commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhl commented Oct 29, 2025

Uh oh!

darkbasic commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugr commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darkbasic commented Oct 30, 2025

Uh oh!

JohannesGaessler commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jammm commented Oct 29, 2025 •

edited

Loading

JohannesGaessler commented Oct 29, 2025 •

edited

Loading

darkbasic commented Oct 29, 2025 •

edited

Loading

darkbasic commented Oct 29, 2025 •

edited

Loading

eugr commented Oct 30, 2025 •

edited

Loading