Skip to content

Eval bug: Slowdown when using Vulkan Multi-GPU #16767

@AbdullahMPrograms

Description

@AbdullahMPrograms

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6835 (5cca254)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

Vulkan

Hardware

2x Radeon Pro W7900

Models

unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF
unsloth/GLM-4.5-Air-GGUF

Problem description & steps to reproduce

When using Vulkan with split mode Row or Layer (Multi GPU), performance is significantly worse than split mode None (single GPU).

On the exact same configuration, ROCm does not see a massive performance degradation from split mode None to Layer.

Vulkan is significantly faster than ROCm for models that can fit in a single model due to this, but for models that require multiple GPU's it falls behind significantly. Ideally, the "None" split mode performance would be much closer to the layer performance as it is in ROCm, so bigger models can run faster

First Bad Commit

No response

Relevant log output

ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf -ngl 999 -fa on -p 4096 -sm none,row,layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |  none |          pp4096 |       1481.89 ± 3.87 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |  none |           tg128 |        120.28 ± 0.07 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |   row |          pp4096 |       1347.79 ± 2.61 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |   row |           tg128 |         79.87 ± 0.10 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 | layer |          pp4096 |       1343.55 ± 2.32 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 | layer |           tg128 |         79.98 ± 0.09 |

build: cec5edbca (6798)

ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf -ngl 999 -fa on -p 4096 -sm none,row,layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
  Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |  none |          pp4096 |       1561.62 ± 4.64 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |  none |           tg128 |         85.69 ± 0.14 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |   row |          pp4096 |       1299.90 ± 6.33 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |   row |           tg128 |         61.03 ± 0.06 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 | layer |          pp4096 |      1554.36 ± 12.90 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 | layer |           tg128 |         82.55 ± 0.21 |

build: cec5edbca (6798)


ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q2_K.gguf -ngl 999 -fa on -p 2048 -sm none,row,layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
  Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |  none |          pp2048 |        195.31 ± 2.03 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |  none |           tg128 |         51.46 ± 0.01 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |   row |          pp2048 |        198.01 ± 0.52 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |   row |           tg128 |         38.00 ± 0.04 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 | layer |          pp2048 |        299.73 ± 0.61 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 | layer |           tg128 |         52.06 ± 0.03 |

build: cec5edbca (6798)

ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q2_K.gguf -ngl 999 -fa on -p 2048 -sm none,row,layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |  none |          pp2048 |        464.58 ± 2.47 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |  none |           tg128 |         73.87 ± 0.17 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |   row |          pp2048 |        443.92 ± 0.78 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |   row |           tg128 |         43.49 ± 0.04 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 | layer |          pp2048 |        442.43 ± 0.86 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 | layer |           tg128 |         43.50 ± 0.04 |

build: cec5edbca (6798)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions