Skip to content

Eval bug: [ROCm/HIP] Severe Performance and Stability Regression for MoE Models since 4146d6a1a #17014

@earthquake02

Description

@earthquake02

Name and Version

since b6902

Operating systems

Linux

GGML backends

HIP

Hardware

System Information

  • Hardware: AI MAX 395 (Strix Halo) / AMD Radeon Graphics (gfx1151)
  • OS: Ubuntu 24.04.3 LTS
  • Kernel: 6.14.0-1015-oem
  • ROCm Version: 7.1.0 (Installed from official .deb package)
  • Build Command:
    HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
        cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release \
        && cmake --build build --config Release -- -j 32

Models

Qwen3-30B-A3B-Q8_0,gpt-oss-120b-MXFP4

Problem description & steps to reproduce

Problem Description

I have identified a significant performance and stability regression on the ROCm/HIP backend when running Mixture-of-Experts (MoE) models. The issue was introduced in commit 4146d6a and persists to the latest master branch (e.g., a5c07dc / b6949 at the time of this report).

The regression manifests in two ways depending on the model:

  1. Performance Degradation: For Qwen3-30B-A3B-Q8_0, the Prompt Processing (pp) speed drops by approximately 15-20%.
  2. System Crash: For a larger MoE model like gpt-oss-120b-MXFP4, running llama-bench causes an immediate system freeze (black screen), requiring a hard reboot.

Steps to Reproduce

  1. Set up the environment with the system specifications listed above.
  2. Test the last known "good" commit: git checkout 8da3c0e20, compile, and run llama-bench. The benchmarks will complete successfully with high performance.
  3. Test the first "bad" commit: git checkout 4146d6a1a, clean and re-compile.
  4. Run llama-bench on Qwen3-30B-A3B-Q8_0 to observe the performance drop.
  5. (Caution!) Run llama-bench on gpt-oss-120b-MXFP4 to observe the system crash.

Analysis & Conclusion

Through git log and targeted testing, the regression was pinpointed to a single commit:

  • 4146d6a1a - CUDA: add expert reduce kernel (#16857)

The title of the commit strongly suggests the cause. The key observation is that the performance degradation primarily affects Prompt Processing (pp) speed, while Token Generation (tg) speed remains largely unchanged.

This indicates that the new "expert reduce kernel," while implemented for CUDA, has a detrimental effect on the HIP backend for MoE models. The issue scales with model size, leading from performance loss to a catastrophic system failure.

Full logs are provided in the section below.

First Bad Commit

4146d6a

Relevant log output

## Expected Behavior Log (on commit 8da3c0e20 / b6901)

### Log for Qwen3-30B-A3B-Instruct-2507-Q8_0 (Good Performance):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        909.15 ± 1.89 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |            tg32 |         53.09 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        675.32 ± 0.75 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         46.15 ± 0.14 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        537.88 ± 0.92 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.33 ± 0.14 |

build: 8da3c0e20 (6901)

### Log for gpt-oss-120b-MXFP4 (Stable Execution):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        969.49 ± 2.14 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |            tg32 |         51.60 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        787.09 ± 1.48 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         44.98 ± 0.08 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        672.62 ± 2.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         41.45 ± 0.06 |

build: 8da3c0e20 (6901)

## Actual Behavior Logs

### Log on First Bad Commit (4146d6a1a / b6902):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        758.15 ± 1.66 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |            tg32 |         54.12 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        594.29 ± 1.42 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         46.60 ± 0.08 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        482.33 ± 0.18 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.68 ± 0.09 |

build: 4146d6a1a (6902)

### Log Confirming Issue Persists on Latest Master (a5c07dcd7 / b6949):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        771.78 ± 1.28 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |            tg32 |         53.74 ± 0.06 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        591.12 ± 2.10 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         46.62 ± 0.17 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        483.90 ± 0.62 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.52 ± 0.10 |

build: a5c07dcd7 (6949)

### Result for gpt-oss-120b-MXFP4 (System Crash):
Running the `llama-bench` command on this model results in a complete system freeze on commit `4146d6a1a` and all subsequent commits. The display goes black, and the machine becomes unresponsive, necessitating a forced shutdown via the power button. There is no log output because of the immediate crash.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions