-
Couldn't load subscription status.
- Fork 13.5k
Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6835 (5cca254)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
2x Radeon Pro W7900
Models
unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF
unsloth/GLM-4.5-Air-GGUF
Problem description & steps to reproduce
When using Vulkan with split mode Row or Layer (Multi GPU), performance is significantly worse than split mode None (single GPU).
On the exact same configuration, ROCm does not see a massive performance degradation from split mode None to Layer.
Vulkan is significantly faster than ROCm for models that can fit in a single model due to this, but for models that require multiple GPU's it falls behind significantly. Ideally, the "None" split mode performance would be much closer to the layer performance as it is in ROCm, so bigger models can run faster
First Bad Commit
No response
Relevant log output
ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf -ngl 999 -fa on -p 4096 -sm none,row,layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | sm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | Vulkan | 999 | none | pp4096 | 1481.89 ± 3.87 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | Vulkan | 999 | none | tg128 | 120.28 ± 0.07 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | Vulkan | 999 | row | pp4096 | 1347.79 ± 2.61 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | Vulkan | 999 | row | tg128 | 79.87 ± 0.10 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | Vulkan | 999 | layer | pp4096 | 1343.55 ± 2.32 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | Vulkan | 999 | layer | tg128 | 79.98 ± 0.09 |
build: cec5edbca (6798)
ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf -ngl 999 -fa on -p 4096 -sm none,row,layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | sm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | ROCm | 999 | none | pp4096 | 1561.62 ± 4.64 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | ROCm | 999 | none | tg128 | 85.69 ± 0.14 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | ROCm | 999 | row | pp4096 | 1299.90 ± 6.33 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | ROCm | 999 | row | tg128 | 61.03 ± 0.06 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | ROCm | 999 | layer | pp4096 | 1554.36 ± 12.90 |
| qwen3moe 30B.A3B Q8_0 | 33.51 GiB | 30.53 B | ROCm | 999 | layer | tg128 | 82.55 ± 0.21 |
build: cec5edbca (6798)
ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q2_K.gguf -ngl 999 -fa on -p 2048 -sm none,row,layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | sm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | ROCm | 999 | none | pp2048 | 195.31 ± 2.03 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | ROCm | 999 | none | tg128 | 51.46 ± 0.01 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | ROCm | 999 | row | pp2048 | 198.01 ± 0.52 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | ROCm | 999 | row | tg128 | 38.00 ± 0.04 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | ROCm | 999 | layer | pp2048 | 299.73 ± 0.61 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | ROCm | 999 | layer | tg128 | 52.06 ± 0.03 |
build: cec5edbca (6798)
ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q2_K.gguf -ngl 999 -fa on -p 2048 -sm none,row,layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | sm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | Vulkan | 999 | none | pp2048 | 464.58 ± 2.47 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | Vulkan | 999 | none | tg128 | 73.87 ± 0.17 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | Vulkan | 999 | row | pp2048 | 443.92 ± 0.78 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | Vulkan | 999 | row | tg128 | 43.49 ± 0.04 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | Vulkan | 999 | layer | pp2048 | 442.43 ± 0.86 |
| glm4moe 106B.A12B Q2_K - Medium | 41.96 GiB | 110.47 B | Vulkan | 999 | layer | tg128 | 43.50 ± 0.04 |
build: cec5edbca (6798)