Eval bug: [ROCm/HIP] Severe Performance and Stability Regression for MoE Models since 4146d6a1a

### Name and Version

since b6902

### Operating systems

Linux

### GGML backends

HIP

### Hardware

### System Information

*   **Hardware:** AI MAX 395 (Strix Halo) / AMD Radeon Graphics (gfx1151)
*   **OS:** Ubuntu 24.04.3 LTS
*   **Kernel:** 6.14.0-1015-oem
*   **ROCm Version:** 7.1.0 (Installed from official `.deb` package)
*   **Build Command:**
    ```bash
    HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
        cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release \
        && cmake --build build --config Release -- -j 32
    ```

### Models

Qwen3-30B-A3B-Q8_0，gpt-oss-120b-MXFP4

### Problem description & steps to reproduce

### Problem Description

I have identified a significant performance and stability regression on the ROCm/HIP backend when running Mixture-of-Experts (MoE) models. The issue was introduced in commit 4146d6a1a and persists to the latest master branch **(e.g., a5c07dcd7 / b6949 at the time of this report).**

The regression manifests in two ways depending on the model:
1.  **Performance Degradation:** For `Qwen3-30B-A3B-Q8_0`, the Prompt Processing (pp) speed drops by approximately 15-20%.
2.  **System Crash:** For a larger MoE model like `gpt-oss-120b-MXFP4`, running `llama-bench` causes an immediate system freeze (black screen), requiring a hard reboot.

### Steps to Reproduce

1.  **Set up the environment** with the system specifications listed above.
2.  **Test the last known "good" commit:** `git checkout 8da3c0e20`, compile, and run `llama-bench`. The benchmarks will complete successfully with high performance.
3.  **Test the first "bad" commit:** `git checkout 4146d6a1a`, clean and re-compile.
4.  Run `llama-bench` on `Qwen3-30B-A3B-Q8_0` to observe the performance drop.
5.  (Caution!) Run `llama-bench` on `gpt-oss-120b-MXFP4` to observe the system crash.

### Analysis & Conclusion

Through `git log` and targeted testing, the regression was pinpointed to a single commit:
*   **`4146d6a1a` - `CUDA: add expert reduce kernel (#16857)`**

The title of the commit strongly suggests the cause. The key observation is that the performance degradation primarily affects **Prompt Processing (`pp`) speed**, while Token Generation (`tg`) speed remains largely unchanged.

This indicates that the new "expert reduce kernel," while implemented for CUDA, has a detrimental effect on the HIP backend for MoE models. The issue scales with model size, leading from performance loss to a catastrophic system failure.

**Full logs are provided in the section below.**

### First Bad Commit

4146d6a1a

### Relevant log output

```shell
## Expected Behavior Log (on commit 8da3c0e20 / b6901)

### Log for Qwen3-30B-A3B-Instruct-2507-Q8_0 (Good Performance):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        909.15 ± 1.89 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |            tg32 |         53.09 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        675.32 ± 0.75 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         46.15 ± 0.14 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        537.88 ± 0.92 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.33 ± 0.14 |

build: 8da3c0e20 (6901)

### Log for gpt-oss-120b-MXFP4 (Stable Execution):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        969.49 ± 2.14 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |            tg32 |         51.60 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        787.09 ± 1.48 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         44.98 ± 0.08 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        672.62 ± 2.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         41.45 ± 0.06 |

build: 8da3c0e20 (6901)

## Actual Behavior Logs

### Log on First Bad Commit (4146d6a1a / b6902):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        758.15 ± 1.66 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |            tg32 |         54.12 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        594.29 ± 1.42 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         46.60 ± 0.08 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        482.33 ± 0.18 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.68 ± 0.09 |

build: 4146d6a1a (6902)

### Log Confirming Issue Persists on Latest Master (a5c07dcd7 / b6949):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |          pp2048 |        771.78 ± 1.28 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |            tg32 |         53.74 ± 0.06 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d4096 |        591.12 ± 2.10 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d4096 |         46.62 ± 0.17 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |  pp2048 @ d8192 |        483.90 ± 0.62 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.52 ± 0.10 |

build: a5c07dcd7 (6949)

### Result for gpt-oss-120b-MXFP4 (System Crash):
Running the `llama-bench` command on this model results in a complete system freeze on commit `4146d6a1a` and all subsequent commits. The display goes black, and the machine becomes unresponsive, necessitating a forced shutdown via the power button. There is no log output because of the immediate crash.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: [ROCm/HIP] Severe Performance and Stability Regression for MoE Models since 4146d6a1a #17014

Name and Version

Operating systems

GGML backends

Hardware

System Information

Models

Problem description & steps to reproduce

Problem Description

Steps to Reproduce

Analysis & Conclusion

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: [ROCm/HIP] Severe Performance and Stability Regression for MoE Models since 4146d6a1a #17014

Description

Name and Version

Operating systems

GGML backends

Hardware

System Information

Models

Problem description & steps to reproduce

Problem Description

Steps to Reproduce

Analysis & Conclusion

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions