Skip to content

Conversation

@lhl
Copy link

@lhl lhl commented Oct 28, 2025

In the HIP BUILD docs -DGGML_HIP_ROCWMMA_FATTN=ON is recommended for improved FA performance for RDNA3+/CDNA and in broad pp512/tg128 performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.

I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.

  • Dramatically improve long context WMMA prefill improvements on RDNA3: increased HIP occupancy and reduced LDS footprint via adaptive KQ stride; pp speedups without touching CUDA or the deprecated Volta WMMA path.
  • Fix long‑context decode regression on rocWMMA builds: decode now uses HIP’s tuned VEC/TILE selection instead of WMMA, aligning performance with the HIP baseline.
  • Remove HIP‑side TILE pruning in WMMA builds: matches HIP‑only behavior and avoids device traps, binary growth for all tiles was neglible, ~+4 MiB to the build
  • Add a decode‑time (HIP+rocWMMA only) safety guard: if a predicted TILE split has no config, fall back to VEC. This guard is not present in HIP‑only builds but seemed like a good idea to and avoid crashes on unusual dims.
  • Changes are gated to ROCWMMA/HIP only; no impact to CUDA or the legacy Volta WMMA path.

The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:

Llama 3.2 1B Q4_K_M

Previous rocWMMA vs HIP

Prefill (pp)

model size params test HIP WMMA Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 4703.28 4884.42 3.85%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d1024 4076.03 4204.81 3.16%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d4096 2936.89 2959.54 0.77%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d16384 1350.48 1265.62 -6.28%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d65536 424.76 360.24 -15.19%

Decode (tg)

model size params test HIP WMMA Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 195.65 193.01 -1.35%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d1024 188.79 182.6 -3.28%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d4096 173.36 143.51 -17.22%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d16384 126.86 87.53 -31.01%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d65536 64.62 27.35 -57.68%

My rocWMMA vs HIP

Prefill (pp)

model size params test HIP lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 4703.28 4970.14 5.67%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d1024 4076.03 4575.18 12.25%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d4096 2936.89 3788.92 29.01%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d16384 1350.48 2064.78 52.89%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d65536 424.76 706.46 66.32%

Decode (tg)

model size params test HIP lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 195.65 195.59 -0.03%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d1024 188.79 188.84 0.03%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d4096 173.36 173.28 -0.05%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d16384 126.86 127.01 0.12%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d65536 64.62 64.55 -0.10%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model size params test default-rocwmma lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 4884.42 4970.14 1.75%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d1024 4204.81 4575.18 8.81%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d4096 2959.54 3788.92 28.02%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d16384 1265.62 2064.78 63.14%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d65536 360.24 706.46 96.11%

Decode (tg)

model size params test default-rocwmma lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 193.01 195.59 1.34%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d1024 182.6 188.84 3.42%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d4096 143.51 173.28 20.74%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d16384 87.53 127.01 45.11%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d65536 27.35 64.55 136.06%

gpt-oss-20b F16/MXFP4

Previous rocWMMA vs HIP

Prefill (pp)

model size params test HIP WMMA Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 1472.01 1513.79 2.84%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d1024 1387.58 1417.45 2.15%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d4096 1175.72 1205.37 2.52%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d16384 713.9 669.77 -6.18%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d65536 277.58 227.24 -18.14%

Decode (tg)

model size params test HIP WMMA Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 49.92 50.23 0.61%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d1024 49.27 48.65 -1.26%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d4096 48.15 45.11 -6.32%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d16384 44.38 32.91 -25.85%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d65536 34.76 14.63 -57.92%

My rocWMMA vs HIP

Prefill (pp)

model size params test HIP lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 1472.01 1495.97 1.63%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d1024 1387.58 1456.15 4.94%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d4096 1175.72 1347.75 14.63%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d16384 713.9 962.98 34.89%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d65536 277.58 426.81 53.76%

Decode (tg)

model size params test HIP lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 49.92 49.9 -0.04%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d1024 49.27 49.21 -0.11%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d4096 48.15 48.05 -0.20%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d16384 44.38 44.34 -0.11%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d65536 34.76 34.77 0.03%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model size params test default-rocwmma lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 1513.79 1495.97 -1.18%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d1024 1417.45 1456.15 2.73%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d4096 1205.37 1347.75 11.81%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d16384 669.77 962.98 43.78%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d65536 227.24 426.81 87.83%

Decode (tg)

model size params test default-rocwmma lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 50.23 49.9 -0.64%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d1024 48.65 49.21 1.16%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d4096 45.11 48.05 6.53%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d16384 32.91 44.34 34.72%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d65536 14.63 34.77 137.71%

I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...

lhl added 2 commits October 28, 2025 17:33
…idency on HIP via __launch_bounds__ (min 2 blocks/SM)\n- Adaptive KQ stride on HIP: 128 for D<=128 to reduce LDS footprint\n- Update loops and launch to use the adaptive stride; bump nwarps for small D\n- No behavior change on CUDA; improves prefill perf on RDNA3
…E and adding a safe fallback\n\n- Do not select WMMA for decode on HIP; fall through to VEC/TILE\n- Remove WMMA TILE pruning on HIP to avoid device traps; keep for CUDA WMMA\n- Add decode-time guard: if predicted TILE split has no config, select VEC\n- Remove ad-hoc env overrides and debug prints
@lhl lhl requested a review from JohannesGaessler as a code owner October 28, 2025 19:29
@JohannesGaessler
Copy link
Collaborator

I'm sorry to say this but this PR is coming at a very inopportune time. The history behind the WMMA kernel is that I first wrote it for NVIDIA GPUs using the "high-level" CUDA WMMA interface. However, that is a fundamentally bad way to use tensor cores because you need to go registers -> SRAM -> registers in order to get a well-defined memory layout. For this reason I later wrote the MMA kernel that directly uses PTX instructions and is much faster. However, because the tensor core instructions used there are only available on NVIDIA GPUs that are Turing or newer I kept the WMMA kernel for Volta. At some point rocWMMA support was added since despite the flawed nature of the kernel it was still faster than the alternatives.

However, one of my immediate next goals is to add support for Volta tensor cores, AMD WMMA instructions (not to be confused with the NVIDIA WMMA interface), and AMD MFMA instructions to the MMA kernel and then remove the WMMA kernel - the V100 and the MI 100 that I need for development arrived just this week. I very much expect a proper MMA implementation to be faster than the WMMA kernel so I don't want to make any more changes to it until it is removed. If it turns out that the kernel in this PR is still faster at the end I will reconsider.

@lhl
Copy link
Author

lhl commented Oct 28, 2025

OK, well, I guess you know the timing of the replacement best, it might be easier to modify the BUILD.md to not recommend using -DGGML_HIP_ROCWMMA_FATTN=ON - as based on my understanding of the code using that flag will always be massively slower at long context tg. As you can see from the included perf charts, posted these aren't small differences either, but massive drops even at 4K context.

The gpt-oss-120b runs finished btw, these mirror the settings and should be directly comparable to @ggerganov's DGX Spark performance sweeps.

With the current rocWMMA implementation both the pp and tg are massively degraded at 32K. The PR reduces the pp and tg by huge percentages (even keeping the tg on par w/ the Spark)

ROCm w/ rocWMMA

Test DGX STXH %
pp2048 1689.47 1006.65 +67.8%
pp2048@d4096 1733.41 790.45 +119.3%
pp2048@d8192 1705.93 603.83 +182.5%
pp2048@d16384 1514.78 405.53 +273.5%
pp2048@d32768 1221.23 223.82 +445.6%
Test DGX STXH %
tg32 52.87 46.56 +13.6%
tg32@d4096 51.02 38.25 +33.4%
tg32@d8192 48.46 32.65 +48.4%
tg32@d16384 44.78 25.50 +75.6%
tg32@d32768 38.76 17.82 +117.5%

My Tuned rocWMMA

Test DGX STXH %
pp2048 1689.47 977.22 +72.9%
pp2048@d4096 1733.41 878.54 +97.3%
pp2048@d8192 1705.93 743.36 +129.5%
pp2048@d16384 1514.78 587.25 +157.9%
pp2048@d32768 1221.23 407.87 +199.4%
Test DGX STXH %
tg32 52.87 48.97 +8.0%
tg32@d4096 51.02 45.42 +12.3%
tg32@d8192 48.46 43.55 +11.3%
tg32@d16384 44.78 40.91 +9.5%
tg32@d32768 38.76 36.43 +6.4%

@JohannesGaessler
Copy link
Collaborator

it might be easier to modify the BUILD.md to not recommend using -DGGML_HIP_ROCWMMA_FATTN=ON - as based on my understanding of the code using that flag will always be massively slower at long context tg

The context here is that until very recently the AMD performance for the FA kernels not using rocWMMA was massively gimped and I only recently started taking AMD more seriously when the MI50 prices came down. Yes, I could put effort towards figuring out for which GPUs it is better to use which suboptimal implementation and documenting but I would rather put that effort towards writing better code that is universally the best choice.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 28, 2025
@darkbasic
Copy link

I very much expect a proper MMA implementation to be faster than the WMMA kernel so I don't want to make any more changes to it until it is removed. If it turns out that the kernel in this PR is still faster at the end I will reconsider.

What's your ETA for that? While I understand your point of view this PR is extremely small and doubles the performance of llama.cpp on AMD, so it will have a huge impact as a stop measure until your new implementation is deemed ready. Is there a high risk of regressions?

In the meantime if someone wants to test this PR via a docker container: https://github.com/kyuz0/amd-strix-halo-toolboxes/pull/11/files#diff-cab8ae85e621fa22745cdfac4af09471a22dcf162c9fc92dbb5c5de9af68bd8a

git clone -b rocm-7alpha https://github.com/darkbasic/amd-strix-halo-toolboxes.git
cd amd-strix-halo-toolboxes/toolboxes
podman build -f Dockerfile.rocm-7alpha-rocwmma-improved -t localhost/rocm-7alpha-rocwmma-improved .
toolbox create llama-rocm-7alpha-rocwmma-improved \
  --image localhost/rocm-7alpha-rocwmma-improved \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
toolbox enter llama-rocm-7alpha-rocwmma-improved

@jammm
Copy link
Contributor

jammm commented Oct 29, 2025

@JohannesGaessler let's get this merged? I understand your concerns but this will go a long way in bridging the gap between Vulkan and ROCm backends. We can then move over to your new MMA implementation once that's ready. But for now the perf gains are too good to let go.

BTW I heard we went you a strix halo. Did you receive it yet? Let us know if you face any issues setting it up. Cheers :)

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 29, 2025

I am currently adding V100 support for mma.cuh, that will probably take me 1-3 days of work in total. After that I'll add support for AMD MFMA and AMD WMMA instructions, for MFMA I'll first need some extra components to arrive since the MI100 is incompatible with the motherboard that I wanted to use originally. The MMA instructions will then need to be applied to the Flaashattention MMA kernel which is where most of the work will be.

The ETA will depend on me having to work on other things, it's probably like a month. I will not merge this PR as-is. If you want to use it make a branch that doesn't impose a maintenance burden on master.

@jammm
Copy link
Contributor

jammm commented Oct 29, 2025

I am currently adding V100 support for mma.cuh, that will probably take me 1-3 days of work in total. After that I'll add support for AMD MFMA and AMD WMMA instructions, for MFMA I'll first need some extra components to arrive since the MI100 is incompatible with the motherboard that I wanted to use originally. The MMA instructions will then need to be applied to the Flaashattention MMA kernel which is where most of the work will be.

The ETA will depend on me having to work on other things, it's probably like a month. I will not merge this PR as-is. If you want to use it make a branch that doesn't impose a maintenance burden on master.

Sounds good. What would it take to get this PR merged? Why is it a maintenance burden ?

@JohannesGaessler
Copy link
Collaborator

As I've said before: I will not merge this PR unless it turns out that the MMA kernel is bad/unviable with AMD WMMA instructions. There is no need to put code on master that is going to replaced soon anyways, just use the other branch.

@darkbasic
Copy link

There is no need to put code on master that is going to replaced soon anyways

Does that mean that you plan to drop the WMMA kernel altogether? Because if it's going to stay I don't see why it would pose a maintenance burden. Are you worried about potentially regressing CUDA?

@JohannesGaessler
Copy link
Collaborator

Yes, as I said before, the plan is to remove the WMMA kernel. The concept of the kernel is fundamentally bad and I only implemented it like that in the first place because NVIDIA is hiding the correct way to use tensor cores in their PTX documentation.

@darkbasic
Copy link

darkbasic commented Oct 29, 2025

@lhl your patches don't play well performance-wise against latest master.

These are the results with your branch on my HP ZBook Ultra G1a:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |          pp2048 |        882.51 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |            tg32 |         44.83 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d4096 |        732.37 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d4096 |         41.28 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d8192 |        634.86 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d8192 |         40.02 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d16384 |        508.47 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d16384 |         37.55 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d32768 |        353.59 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d32768 |         33.61 ± 0.00 |

This is with the same branch rebased against master:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |          pp2048 |        882.59 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |            tg32 |         47.36 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d4096 |        665.48 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d4096 |         40.58 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d8192 |        598.00 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d8192 |         37.82 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d16384 |        474.92 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d16384 |         32.63 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d32768 |        338.85 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d32768 |         23.60 ± 0.00 |

tg32 @ d32768 is much worse when rebased.

@lhl
Copy link
Author

lhl commented Oct 29, 2025

@darkbasic if you're a dev trying to get to the bottom of it, it looks like there were only 2 CUDA commits between when you posted and from my branch so it should be relatively easy to bisect the offending commit and see what's up. It might be illuminating: lhl/llama.cpp@rocm-wmma-tune...ggml-org:llama.cpp:f549b0007dbdd683215820f7229ce180a12b191d

If you're just looking for the best llama.cpp performance for a model you use, I think for Strix Halo, your best approach is to run your own sweeps on Vulkan AMDVLK, Vulkan RADV, HIP, and HIP rocWMMA, and my patched rocWMMA and pick the best one. Not ideal to say the least, but shouganai.

(From my testing, the tuned rocWMMA is the best performing for pp and tg across context lengths, but I've only tried a few models and tested on gfx1151 so it's by no means exhaustive. I just tested b6877 vs my build and perf isn't even close on gpt-oss-20b.)

I know there are some people like @hjc4869 who maintain their own forks, I don't plan to. Like everyone else, I'm plenty busy and it's not my goal to add anyone else's pile either. I thought I'd share this because the perf improvement was not insignificant but if this isn't going to be merged, nbd, the code is out there.

I will mention one more thing that I hope won't get lost, but the guard for VEC fallback when there is no suitable TILE I added is something that is not in the regular HIP path and may be one of the potential causes of the segfaults that some users are running across when using the ROCm backend.

@lhl lhl closed this Oct 29, 2025
@darkbasic
Copy link

darkbasic commented Oct 29, 2025

If you're just looking for the best llama.cpp performance for a model you use, I think for Strix Halo, your best approach is to run your own sweeps on Vulkan AMDVLK, Vulkan RADV, HIP, and HIP rocWMMA, and my patched rocWMMA and pick the best one.

@lhl I did (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37791#note_3152770) and your branch is indeed the fastest thing I've tried so far. Somehow latest commits don't play well with it and I wondered if you were aware of it.

If someone is interested here are the results with your branch:

$ llama-bench -ngl 999 -mmp 0 -fa 1 -m ~/models/gpt-oss-120b/mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -p 2048 -n 32 -d 0,4096,8192,16384,32768,65536 -ub 2048 -r 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |          pp2048 |        885.22 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |            tg32 |         45.14 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d4096 |        733.66 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d4096 |         41.47 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |  pp2048 @ d8192 |        644.68 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |    tg32 @ d8192 |         40.13 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d16384 |        506.08 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d16384 |         37.79 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d32768 |        344.54 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d32768 |         33.60 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 | pp2048 @ d65536 |        173.85 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       | 999 |     2048 |  1 |    0 |   tg32 @ d65536 |         27.69 ± 0.00 |

build: a45e1cd6 (6867)

I will mention one more thing that I hope won't get lost, but the guard for VEC fallback when there is no suitable TILE I added is something that is not in the regular HIP path and may be one of the potential causes of the segfaults that some users are running across when using the ROCm backend.

That's one of the biggest selling points in my opinion. Even without rocwmma ROCm-7.10 is basically unusable for me due to the crashes. Your patches drastically improve the situation. I am still experiencing a few crashes, but it's like two orders of magnitude better than before.

@eugr
Copy link

eugr commented Oct 30, 2025

Hmm, I'm getting lower performance on long context using master branch than last week, with the same ROCm version and compile flags.
HIP, no rocWMMA.

Last week:

model                                 size     params backend            test                  t/s
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm                 pp2048       1000.93 ± 1.23
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm                   tg32         47.46 ± 0.02
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm         pp2048 @ d4096        827.34 ± 1.99
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm           tg32 @ d4096         44.20 ± 0.01
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm         pp2048 @ d8192        701.68 ± 2.36
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm           tg32 @ d8192         42.39 ± 0.04
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm        pp2048 @ d16384        503.49 ± 0.90
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm          tg32 @ d16384         39.61 ± 0.02
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm        pp2048 @ d32768        344.36 ± 0.80
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm          tg32 @ d32768         35.32 ± 0.01

Today:

model       size     params backend            test t/s
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm                 pp2048        998.67 ± 2.46
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm                   tg32         52.27 ± 0.00
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm         pp2048 @ d4096        775.61 ± 6.49
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm           tg32 @ d4096         45.55 ± 0.11
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm         pp2048 @ d8192        667.22 ± 1.43
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm           tg32 @ d8192         41.88 ± 0.12
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm        pp2048 @ d16384        487.42 ± 1.89
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm          tg32 @ d16384         35.70 ± 0.05
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm        pp2048 @ d32768        333.57 ± 0.36
gpt-oss 120B MXFP4 MoE           59.02 GiB   116.83 B ROCm          tg32 @ d32768 25.41 ± 0.01

@darkbasic
Copy link

@eugr yeah it looks like like the regression doesn't have anything to do with @lhl patches after all. Unfortunately I won't have access to my Strix Halo device for the next couple of weeks so hopefully someone else will manage to get to the bottom of it.

@JohannesGaessler
Copy link
Collaborator

Performance regression may already be fixed as of a few hours ago: #16815 #16847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants