Eval bug: Slowdown when using Vulkan Multi-GPU

### Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
  Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6835 (5cca2542a)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

Vulkan

### Hardware

2x Radeon Pro W7900

### Models

unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF
unsloth/GLM-4.5-Air-GGUF

### Problem description & steps to reproduce

When using Vulkan with split mode Row or Layer (Multi GPU), performance is significantly worse than split mode None (single GPU).

On the exact same configuration, ROCm does not see a massive performance degradation from split mode None to Layer.

Vulkan is significantly faster than ROCm for models that can fit in a single model due to this, but for models that require multiple GPU's it falls behind significantly. Ideally, the "None" split mode performance would be much closer to the layer performance as it is in ROCm, so bigger models can run faster

### First Bad Commit

_No response_

### Relevant log output

```shell
ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf -ngl 999 -fa on -p 4096 -sm none,row,layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |  none |          pp4096 |       1481.89 ± 3.87 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |  none |           tg128 |        120.28 ± 0.07 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |   row |          pp4096 |       1347.79 ± 2.61 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 |   row |           tg128 |         79.87 ± 0.10 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 | layer |          pp4096 |       1343.55 ± 2.32 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | Vulkan     | 999 | layer |           tg128 |         79.98 ± 0.09 |

build: cec5edbca (6798)

ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf -ngl 999 -fa on -p 4096 -sm none,row,layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
  Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |  none |          pp4096 |       1561.62 ± 4.64 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |  none |           tg128 |         85.69 ± 0.14 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |   row |          pp4096 |       1299.90 ± 6.33 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 |   row |           tg128 |         61.03 ± 0.06 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 | layer |          pp4096 |      1554.36 ± 12.90 |
| qwen3moe 30B.A3B Q8_0          |  33.51 GiB |    30.53 B | ROCm       | 999 | layer |           tg128 |         82.55 ± 0.21 |

build: cec5edbca (6798)


ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q2_K.gguf -ngl 999 -fa on -p 2048 -sm none,row,layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
  Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |  none |          pp2048 |        195.31 ± 2.03 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |  none |           tg128 |         51.46 ± 0.01 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |   row |          pp2048 |        198.01 ± 0.52 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 |   row |           tg128 |         38.00 ± 0.04 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 | layer |          pp2048 |        299.73 ± 0.61 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | ROCm       | 999 | layer |           tg128 |         52.06 ± 0.03 |

build: cec5edbca (6798)

ultimis@ultimis-desktop:~$ ./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q2_K.gguf -ngl 999 -fa on -p 2048 -sm none,row,layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon PRO W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |  none |          pp2048 |        464.58 ± 2.47 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |  none |           tg128 |         73.87 ± 0.17 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |   row |          pp2048 |        443.92 ± 0.78 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 |   row |           tg128 |         43.49 ± 0.04 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 | layer |          pp2048 |        442.43 ± 0.86 |
| glm4moe 106B.A12B Q2_K - Medium |  41.96 GiB |   110.47 B | Vulkan     | 999 | layer |           tg128 |         43.50 ± 0.04 |

build: cec5edbca (6798)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Eval bug: Slowdown when using Vulkan Multi-GPU #16767

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: Slowdown when using Vulkan Multi-GPU #16767

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions