-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
since b6902
Operating systems
Linux
GGML backends
HIP
Hardware
System Information
- Hardware: AI MAX 395 (Strix Halo) / AMD Radeon Graphics (gfx1151)
- OS: Ubuntu 24.04.3 LTS
- Kernel: 6.14.0-1015-oem
- ROCm Version: 7.1.0 (Installed from official
.debpackage) - Build Command:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release \ && cmake --build build --config Release -- -j 32
Models
Qwen3-30B-A3B-Q8_0,gpt-oss-120b-MXFP4
Problem description & steps to reproduce
Problem Description
I have identified a significant performance and stability regression on the ROCm/HIP backend when running Mixture-of-Experts (MoE) models. The issue was introduced in commit 4146d6a and persists to the latest master branch (e.g., a5c07dc / b6949 at the time of this report).
The regression manifests in two ways depending on the model:
- Performance Degradation: For
Qwen3-30B-A3B-Q8_0, the Prompt Processing (pp) speed drops by approximately 15-20%. - System Crash: For a larger MoE model like
gpt-oss-120b-MXFP4, runningllama-benchcauses an immediate system freeze (black screen), requiring a hard reboot.
Steps to Reproduce
- Set up the environment with the system specifications listed above.
- Test the last known "good" commit:
git checkout 8da3c0e20, compile, and runllama-bench. The benchmarks will complete successfully with high performance. - Test the first "bad" commit:
git checkout 4146d6a1a, clean and re-compile. - Run
llama-benchonQwen3-30B-A3B-Q8_0to observe the performance drop. - (Caution!) Run
llama-benchongpt-oss-120b-MXFP4to observe the system crash.
Analysis & Conclusion
Through git log and targeted testing, the regression was pinpointed to a single commit:
4146d6a1a-CUDA: add expert reduce kernel (#16857)
The title of the commit strongly suggests the cause. The key observation is that the performance degradation primarily affects Prompt Processing (pp) speed, while Token Generation (tg) speed remains largely unchanged.
This indicates that the new "expert reduce kernel," while implemented for CUDA, has a detrimental effect on the HIP backend for MoE models. The issue scales with model size, leading from performance loss to a catastrophic system failure.
Full logs are provided in the section below.
First Bad Commit
Relevant log output
## Expected Behavior Log (on commit 8da3c0e20 / b6901)
### Log for Qwen3-30B-A3B-Instruct-2507-Q8_0 (Good Performance):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 | 909.15 ± 1.89 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 | 53.09 ± 0.02 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 @ d4096 | 675.32 ± 0.75 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 @ d4096 | 46.15 ± 0.14 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 @ d8192 | 537.88 ± 0.92 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 @ d8192 | 39.33 ± 0.14 |
build: 8da3c0e20 (6901)
### Log for gpt-oss-120b-MXFP4 (Stable Execution):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | pp2048 | 969.49 ± 2.14 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | tg32 | 51.60 ± 0.02 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | pp2048 @ d4096 | 787.09 ± 1.48 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | tg32 @ d4096 | 44.98 ± 0.08 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | pp2048 @ d8192 | 672.62 ± 2.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | tg32 @ d8192 | 41.45 ± 0.06 |
build: 8da3c0e20 (6901)
## Actual Behavior Logs
### Log on First Bad Commit (4146d6a1a / b6902):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 | 758.15 ± 1.66 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 | 54.12 ± 0.02 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 @ d4096 | 594.29 ± 1.42 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 @ d4096 | 46.60 ± 0.08 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 @ d8192 | 482.33 ± 0.18 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 @ d8192 | 39.68 ± 0.09 |
build: 4146d6a1a (6902)
### Log Confirming Issue Persists on Latest Master (a5c07dcd7 / b6949):
nklar@nklar-GTR-Pro:~/llama.cpp$ ~/llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF/qwen3-30b-a3b-instruct-2507-q8_0.gguf -fa 1 -p 2048 -n 32 -ub 2048 -d 0,4096,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 | 771.78 ± 1.28 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 | 53.74 ± 0.06 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 @ d4096 | 591.12 ± 2.10 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 @ d4096 | 46.62 ± 0.17 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | pp2048 @ d8192 | 483.90 ± 0.62 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | tg32 @ d8192 | 39.52 ± 0.10 |
build: a5c07dcd7 (6949)
### Result for gpt-oss-120b-MXFP4 (System Crash):
Running the `llama-bench` command on this model results in a complete system freeze on commit `4146d6a1a` and all subsequent commits. The display goes black, and the machine becomes unresponsive, necessitating a forced shutdown via the power button. There is no log output because of the immediate crash.