gfx1151 nwarps, tile sizing to curb VGPR pressure by pedapudi · Pull Request #21344 · ggml-org/llama.cpp

pedapudi · 2026-04-03T00:29:41Z

Follow up on issue #21284

Tune MMQ tile sizes, warp counts, and MMVQ parameter tables for RDNA3_5 gfx1151
a. MMQ: mmq_x_max=48, mmq_y=64, nwarps=4 for RDNA3_5 to balance VGPR usage and occupancy
b. Note: I took the opportunity for a minor refactor replacing nested ternary operators to improve readability and reduce opportunity for errors (especially after I made a mistake while piling on the ternary operations).
RDNA3_5 gets its own mmvq_parameter_table_id instead of falling back to RDNA2
a. Results in nwraps calculation falling to 1.

1 is more important than 2, but 2 is still helpful on the mmvq paths. And it sets up for future per-quant tuning.

Benchmarks

Built with cmake flags

cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" -DGGML_HIP_ROCWMMA_FATTN=OFF -DCMAKE_BUILD_TYPE=Release -DLLAMA_OPENSSL=ON -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" -DHIP_PLATFORM=amd -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF

Before (build `7c7d6ce` / 8642)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        181.71 ± 4.97 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        267.51 ± 3.68 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        369.82 ± 1.01 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        415.54 ± 3.10 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        496.87 ± 7.36 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        474.49 ± 1.84 |

build: 7c7d6ce5c (8642)

After (build 955df3551 / 8643)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        314.45 ± 5.96 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        411.83 ± 3.54 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        491.57 ± 1.53 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        487.68 ± 3.70 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        544.94 ± 5.87 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        509.05 ± 2.25 |

build: 955df3551 (8643)

Speedup

Test	Before	After	Change
pp128	181.71 ± 4.97	314.45 ± 5.96	+73.0%
pp256	267.51 ± 3.68	411.83 ± 3.54	+53.9%
pp512	369.82 ± 1.01	491.57 ± 1.53	+32.9%
pp1024	415.54 ± 3.10	487.68 ± 3.70	+17.4%
pp2048	496.87 ± 7.36	544.94 ± 5.87	+9.7%
pp4096	474.49 ± 1.84	509.05 ± 2.25	+7.3%

Requirements

I have read and agree with the contributing guidelines: Yes, of course.
AI usage disclosure: Yes, of course. This is a straightforward change to implement, but the PR description is formatted with LLM assistance.

pedapudi · 2026-04-03T00:31:16Z

cc @am17an @IMbackK

Thank you both for encouraging sending this change. These are just the minimal changes to curb register pressure.

IIIIIllllIIIIIlllll · 2026-04-03T04:39:54Z

Strangely, there's been no improvement on my machine.
My operating environment:

--- OS Info ---
Distributor ID:	Ubuntu
Description:	Ubuntu 25.10
Release:	25.10
Codename:	questing
Linux MarkPC 6.18.20-061820-generic #202603251312 SMP PREEMPT_DYNAMIC Wed Mar 25 17:11:38 UTC 2026 x86_64 GNU/Linux


--- amd-smi Info ---
+------------------------------------------------------------------------------+
| AMD-SMI 26.2.2+e1a6bc5663    amdgpu version: 6.18.20-061820 ROCm version: 7.2.1    |
| VBIOS version: 00107962                                                      |
| Platform: Linux Baremetal                                                    |
|-------------------------------------+----------------------------------------|
| BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |
| GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |
|=====================================+========================================|
| 0000:c6:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |
|   0       0     N/A             N/A | N/A        N/A              497/512 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes:                                                                   |
|  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |
|==============================================================================|
|    0      62645  llama-server          33.1 GB  410.2 KB    33.9 GB  N/A     |
+------------------------------------------------------------------------------+

Is there any other information required?

pedapudi · 2026-04-03T05:14:01Z

@IIIIIllllIIIIIlllll thanks for looking. I'm also on Ubuntu 25.10 using ROCm 7.2.1. I'm surprised that you aren't seeing any improvement.

I don't know how you applied the patch, built llama.cpp, or how you tested. I'll update the PR with my cmake flags.

IIIIIllllIIIIIlllll · 2026-04-03T05:20:36Z

Here is my command:

/home/mark/App/llama.cpp/llamacpp/llama.cpp-gfx1151/build/bin/llama-server -m /home/mark/Models/Q4/Qwen3.5-122B-A10B-Q4_K_M/Qwen3.5-122B-A10B-PRISM-LITE-Dynamic.gguf 
--port 8090 
--mmproj /home/mark/Models/Q4/Qwen3.5-122B-A10B-Q4_K_M/mmproj-F32.gguf 
--main-gpu 0 
--ctx-size 262144 
--flash-attn on 
--no-mmap 
--temp 1 
--top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.02 --frequency-penalty 0.0 
--batch-size 4096 --ubatch-size 4096 
--parallel 2 
--cache-ram -1 
--cache-type-k f16 --cache-type-v f16 
--threads -1 
--seed -1 
-dio 
--no-webui 
--chat-template-file /home/mark/App/llama.cpp/cache/Qwen3.5-122B-A10B-Q4_K_M.jinja 
--metrics 
--slot-save-path /home/mark/App/llama.cpp/cache 
--alias Qwen3.5-122B-A10B-Q4_K_M 
--timeout 36000 --host 0.0.0.0

The model: https://huggingface.co/Ex0bit/Qwen3.5-122B-A10B-PRISM-LITE-GGUF

My testing method: I started the model using llama-server and then submitted an 8192 request (accurately created by the model's tokenizer) to test its performance.

tbocek · 2026-04-03T08:17:38Z

I tested this PR with more models (arch linux 6.19, rocm 7.2.1, vulkan 1.4.341): Mistral-Small-4-119B-2603-UD-Q4_K_XL, Qwen3.5-122B-A10B-UD-Q4_K_XL, NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL. (I added Vulkan, since I was also curious about the comparison ROCm/Vulkan). For ROCm, this PR performs consistently better for the 3 tested models Mistral, Qwen3.5, NVIDIA-Nemotron (with the excetion of -1% for Qwen, which also may be noise).

In summary (ROCm before/after):

Mistral-Small-4-119B (ROCm)

Test	Before (t/s)	After (t/s)	Diff
pp128	151.30	303.14	+100.4%
pp256	244.01	428.78	+75.7%
pp512	366.67	567.00	+54.6%
pp1024	441.42	562.80	+27.5%
pp2048	418.10	442.79	+5.9%
pp4096	288.80	307.12	+6.3%

Qwen3.5-122B-A10B (ROCm)

Test	Before (t/s)	After (t/s)	Diff
pp128	176.54	267.37	+51.4%
pp256	255.82	386.09	+50.9%
pp512	319.26	419.97	+31.5%
pp1024	358.75	393.37	+9.6%
pp2048	359.18	397.16	+10.6%
pp4096	300.95	297.92	-1.0%

NVIDIA-Nemotron-3-Super-120B-A12B (ROCm)

Test	Before (t/s)	After (t/s)	Diff
pp128	124.14	212.07	+70.8%
pp256	191.08	277.29	+45.1%
pp512	271.97	347.18	+27.7%
pp1024	345.20	383.91	+11.2%
pp2048	366.87	377.38	+2.9%
pp4096	350.06	352.29	+0.6%

Master: (Before)

./build/bin/llama-bench --model /mnt/models/Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        151.30 ± 1.96 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        244.01 ± 6.83 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        366.67 ± 3.17 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        441.42 ± 7.76 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        418.10 ± 0.91 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        288.80 ± 1.49 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        161.56 ± 3.66 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        251.50 ± 4.28 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        376.11 ± 1.03 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        533.79 ± 1.82 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        657.96 ± 0.71 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        605.62 ± 6.37 |

build: 0c58ba336 (8646)

./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        176.54 ± 4.67 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        255.82 ± 3.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        319.26 ± 1.92 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        358.75 ± 2.75 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        359.18 ± 3.07 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        300.95 ± 1.03 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        174.99 ± 1.82 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        230.27 ± 2.11 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        293.34 ± 1.95 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        337.85 ± 2.68 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        368.81 ± 2.46 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        378.01 ± 1.54 |

build: 0c58ba336 (8646)

./build/bin/llama-bench --model /mnt/models/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        124.14 ± 2.06 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        191.08 ± 1.84 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        271.97 ± 1.00 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        345.20 ± 0.76 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        366.87 ± 1.80 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        350.06 ± 0.98 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        106.06 ± 1.61 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        147.87 ± 0.90 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        208.72 ± 1.37 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        263.62 ± 0.78 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        278.56 ± 0.22 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        279.49 ± 0.73 |

build: 0c58ba336 (8646)

This PR: (After)

./build/bin/llama-bench --model /mnt/models/Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        303.14 ± 3.49 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |       428.78 ± 10.34 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        567.00 ± 5.69 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |       562.80 ± 11.20 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        442.79 ± 4.12 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        307.12 ± 1.13 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        163.24 ± 3.95 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        254.25 ± 3.54 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        379.97 ± 1.61 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        537.16 ± 1.76 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        661.61 ± 1.03 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        610.18 ± 4.81 |

build: e0e3c3fc6 (8643)

./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |       267.37 ± 79.41 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        386.09 ± 3.41 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        419.97 ± 2.61 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        393.37 ± 1.21 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        397.16 ± 4.76 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        297.92 ± 1.44 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        176.12 ± 2.25 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        230.28 ± 2.05 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        299.17 ± 2.00 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        345.02 ± 3.12 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        379.44 ± 2.18 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        382.89 ± 1.47 |

build: e0e3c3fc6 (8643)

./build/bin/llama-bench --model /mnt/models/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        212.07 ± 1.94 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        277.29 ± 1.95 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        347.18 ± 0.46 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        383.91 ± 0.89 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        377.38 ± 0.48 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        352.29 ± 1.36 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        104.82 ± 1.55 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        128.66 ± 0.49 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        168.68 ± 1.01 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        214.23 ± 0.68 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        235.00 ± 0.97 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        283.27 ± 0.28 |

build: e0e3c3fc6 (8643)

And just for curiosity, the comparison Vulkan/ROCm (after this PR):

ROCm vs Vulkan highlights/lowlights (after PR)

Model	Test	ROCm (t/s)	Vulkan (t/s)	Delta
Mistral-Small-4-119B	pp128	303.14	163.24	ROCm +85.7%
Mistral-Small-4-119B	pp4096	307.12	610.18	Vulkan +98.7%
Qwen3.5-122B-A10B	pp256	386.09	230.28	ROCm +67.6%
Qwen3.5-122B-A10B	pp4096	297.92	382.89	Vulkan +28.5%
Nemotron-120B-A12B	pp256	277.29	128.66	ROCm +115.5%
Nemotron-120B-A12B	pp4096	352.29	283.27	ROCm +24.4%

pedapudi · 2026-04-03T16:38:21Z

@tbocek Thank you for doing the cross model testing! I'm glad the PR shows uplift across models. I have some questions about your empirical numbers which look lower than what I see with llama-bench for the Qwen3.5 122B model. I re-ran llama-bench with Unsloth's UD Q4_K_XL to match your run, and I see:

BEFORE

$ ./bin/llama-bench --model /home/sunil/models/unsloth/qwen35-122b-ud-q4_k_xl/unsloth_Qwen3.5-122B-A10B-GGUF_UD-Q4_K_XL_Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        181.31 ± 4.92 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        269.33 ± 3.56 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        355.28 ± 1.51 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        418.42 ± 1.23 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        429.09 ± 5.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        405.33 ± 2.56 |

build: 7c7d6ce5c (8642)

AFTER

./bin/llama-bench --model /home/sunil/models/unsloth/qwen35-122b-ud-q4_k_xl/unsloth_Qwen3.5-122B-A10B-GGUF_UD-Q4_K_XL_Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        314.28 ± 5.58 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        411.32 ± 3.45 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        488.98 ± 2.14 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        442.81 ± 1.63 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        553.57 ± 5.64 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        494.13 ± 3.09 |

build: e0e3c3fc6 (8643)

(note the differing build IDs but they are not material here.)

Both baseline and with this PR, the prefill t/s is higher. It's possibly worth normalizing for whatever differences exist between the setups, but it seems your findings corroborate some of the relative improvement with the PR. The falloff you are noticing at higher lengths seems to have a different slope though.

tbocek · 2026-04-03T19:28:28Z

@pedapudi I just realized, I had different build flags, I was using ROCWMMA_FATTN=ON. Now with Qwen3.5-122B-A10B-UD-Q4_K_XL with ROCWMMA_FATTN=OFF. Now the number are closer to yours.

Summary:

Test	Before PR, WMMA OFF	After PR, WMMA OFF	Δ%
pp128	180.48	311.26	+72.5%
pp256	264.46	400.74	+51.5%
pp512	343.20	453.72	+32.2%
pp1024	395.79	453.36	+14.5%
pp2048	449.01	476.48	+6.1%
pp4096	455.48	530.14	+16.4%

before PR:

./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        180.48 ± 4.79 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        264.46 ± 3.32 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        343.20 ± 1.81 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        395.79 ± 2.76 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        449.01 ± 8.23 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        455.48 ± 3.18 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        176.31 ± 2.27 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        234.34 ± 2.29 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        302.47 ± 2.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        347.72 ± 0.88 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        381.04 ± 2.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        380.39 ± 1.71 |

build: 7c7d6ce5c (8642)

and after PR:

./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        311.26 ± 6.10 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        400.74 ± 3.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        453.72 ± 2.08 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        453.36 ± 3.34 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        476.48 ± 4.58 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        530.14 ± 5.57 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        173.87 ± 2.08 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        232.94 ± 1.94 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        293.93 ± 2.89 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        334.94 ± 0.97 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        342.86 ± 0.83 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        373.97 ± 1.80 |

build: e0e3c3fc6 (8643)

pedapudi · 2026-04-03T21:19:03Z

@tbocek So glad that you were able to identify the discrepancy! Thanks again for testing.

am17an · 2026-04-05T10:13:48Z

@IMbackK @JohannesGaessler does this look okay? I don't have the hardware to test but a lot of people seem to have confirmed it looks good.

IMbackK · 2026-04-05T10:23:21Z

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

pedapudi · 2026-04-05T16:52:53Z

Thank you, @am17an and @IMbackK.

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

You may be right, and it's a reasonable point to raise here, but I would like to offer a few things for you to consider before holding back on incremental improvements (within the current structure of the code, at least):

The proposed solution here for RDNA 3.5 is better for gfx1150 than HEAD :) it's verified to be better for gfx1151, but this should be a net improvement for both gfx1150 and gfx1151.
The current solution leaves a bunch of headroom in VGPRs for mul mat kernels (ie., it's not close to the limits for gfx1151). Even though gfx1150 has fewer (and smaller) registers, this change has enough breathing room that it's going to be about as good as it gets while staying within "multiple of 16" tile sizes. The alternative (shrinking the tiles further) will sacrifice performance. (PS. The "headroom" here is undesirable from my perspective because it's the result of me failing to increase occupancy. Silver lining is that gfx1150 isn't going to suffer, likely).
gfx1152 does not appear to change the equation here, but if someone has more context, please correct me.
Opinion: I'm going to guess that gfx1151 is more important to optimize than gfx1150 because it is the "halo" product of its line. (Someone from AMD is welcome to weigh in more authoritatively, of course :))
- Relatedly, llama.cpp likely has many more Strix Halo users than Strix Point users, so optimizing for that user base is more impactful.

As I mentioned in my original issue, there is no static sizing that's going to work across architectures.

Sidebar: One idea I'd love for llama.cpp is to have a llama-tune tool that can probe a host and self-configure things like tile sizes, but that is obviously a separate discussion.

JohannesGaessler · 2026-04-05T17:42:13Z

What are your intentions with the changes to mmvq.cu? They look wrong.

I wanted to update the structure to support RDNA 3.5 more natively and not reuse preexisting configurations. I've reverted the changes for now to not distract from the other changes. Thanks!

JohannesGaessler · 2026-04-05T17:45:43Z

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

Looking at this table, the in my opinion correct way to do it is to use the value for discrete RDNA3 GPUs for those APUs that only have 512 kiB of registers.

Mushoz · 2026-04-05T18:01:39Z

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

Well right now everything is tuned for RDNA2 which is even worse. Incremental improvements always make sense especially if they focus on the more popular cases (Strix Halo) over the less popular ones (Strix Point)

IMbackK · 2026-04-05T18:09:28Z

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

Looking at this table, the in my opinion correct way to do it is to use the value for discrete RDNA3 GPUs for those APUs that only have 512 kiB of registers.

You mean RDNA2 DGPUs presumably, the large register file RDNA3 dgpus (gfx1100 and gfx1101) also have >512 kiB of registers.

IMbackK · 2026-04-05T18:17:10Z

Regarding register pressure we really only have 3 cases:

gfx1100, gfx1101, gfx1151, and gfx12 with 768 32 wide vector registers
gfx10 and gfx1102+ with 512 32 wide registers
CDNA/GCN with 256 64 wide registers

Btw the table is wrong, the unit is not kiB, its the number of vector registers. A single register being 1024 bits for rdna and 2048 bits for gcn/cnda

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

pedapudi · 2026-04-07T01:54:27Z

Thanks for the discussion so far.

I'm a little confused as to where we are. Is the suggestion to abandon this change? If so, would someone else pick up realizing the performance opportunity? If not, what are the next steps?

Thank you!

(PS. That table indeed looks misleading :))

IMbackK · 2026-04-07T19:54:09Z

Incremental improvements always make sense especially if they focus on the more popular cases (Strix Halo) over the less popular ones (Strix Point)

I strongly disagree with this notion

I'm a little confused as to where we are. Is the suggestion to abandon this change? If so, would someone else pick up realizing the performance opportunity? If not, what are the next steps?

If the resulting kernels are indeed below 512 registers (see GGML_HIP_EXPORT_METRICS) i think this is conceptually fine and likely to be positive for the other gfx115x gpus. Ideally someone would benchmark this fact, however.

Further it makes sense to try these values on the other RDNA devices, as they would also spill and widen the filter further, i will check this on gfx1100 soon. This dosent have to be part of this pr.

pedapudi · 2026-04-08T04:34:26Z

Thank you, @IMbackK

Incremental improvements always make sense especially if they focus on the more popular cases (Strix Halo) over the less popular ones (Strix Point)

I strongly disagree with this notion

Yes, reasonable people can disagree :) I respect your position, especially from a maintainer, that no hardware will be alienated. After all, that is the crux of this PR as well.

For a later time: I think there is a structural opportunity in the implementation to support different GPU architectures with less maintenance burden, eg., organizing different config files, or an adaptive approach like my prior comment was conceptualizing using a "probe the host hardware" step.

I'm a little confused as to where we are. Is the suggestion to abandon this change? If so, would someone else pick up realizing the performance opportunity? If not, what are the next steps?

If the resulting kernels are indeed below 512 registers (see GGML_HIP_EXPORT_METRICS) i think this is conceptually fine and likely to be positive for the other gfx115x gpus. Ideally someone would benchmark this fact, however.

Further it makes sense to try these values on the other RDNA devices, as they would also spill and widen the filter further, i will check this on gfx1100 soon. This dosent have to be part of this pr.

Thank you, this is all very reasonable. AFAIK we're targeting 256 VGPR for gfx1151 (ideally slightly below to leave room for system registers). I attached the remarks from GGML_HIP_EXPORT_METRICS (which has been a useful utility in validating some additional changing outside this PR I'm trying as well!). There is still some VGPR spill (especially with IQ quantization). At least with mmq_x, if I lower it further (even for just Q8), there does not seem to have any improvement in performance and high potential for regressions.
remarks.txt

JohannesGaessler · 2026-04-08T19:14:24Z

Performance changes

GPU	Model	Microbatch size	Test	t/s b8642	t/s `7957de9`	Speedup
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	16	pp2048	438.68	466.96	1.06
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	32	pp2048	454.74	457.73	1.01
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	64	pp2048	313.94	531.23	1.69
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	128	pp2048	845.24	491.28	0.58
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	256	pp2048	881.14	525.70	0.60
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	512	pp2048	910.45	568.66	0.62
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	1024	pp2048	963.00	562.52	0.58
Radeon 8060S Graphics	llama 8B IQ1_S - 1.5625 bpw	2048	pp2048	956.29	560.47	0.59
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	16	pp2048	311.78	331.05	1.06
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	32	pp2048	408.92	408.96	1.00
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	64	pp2048	308.36	460.04	1.49
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	128	pp2048	730.49	393.17	0.54
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	256	pp2048	754.84	422.19	0.56
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	512	pp2048	758.00	456.59	0.60
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	1024	pp2048	802.82	455.28	0.57
Radeon 8060S Graphics	llama 8B IQ2_S - 2.5 bpw	2048	pp2048	805.24	453.23	0.56
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	16	pp2048	310.57	326.34	1.05
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	32	pp2048	407.04	401.34	0.99
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	64	pp2048	301.54	447.59	1.48
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	128	pp2048	818.72	398.99	0.49
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	256	pp2048	853.12	429.05	0.50
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	512	pp2048	871.31	463.56	0.53
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	1024	pp2048	916.03	464.17	0.51
Radeon 8060S Graphics	llama 8B IQ2_XS - 2.3125 bpw	2048	pp2048	926.06	463.68	0.50
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	16	pp2048	358.34	376.63	1.05
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	32	pp2048	386.58	397.47	1.03
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	64	pp2048	649.67	462.13	0.71
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	128	pp2048	412.19	374.11	0.91
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	256	pp2048	427.58	408.50	0.96
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	512	pp2048	441.95	445.16	1.01
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	1024	pp2048	443.88	442.61	1.00
Radeon 8060S Graphics	llama 8B IQ2_XXS - 2.0625 bpw	2048	pp2048	442.43	448.44	1.01
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	16	pp2048	319.54	335.23	1.05
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	32	pp2048	451.20	481.56	1.07
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	64	pp2048	594.72	576.69	0.97
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	128	pp2048	393.40	403.06	1.02
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	256	pp2048	400.63	427.06	1.07
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	512	pp2048	411.94	466.67	1.13
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	1024	pp2048	417.92	467.35	1.12
Radeon 8060S Graphics	llama 8B IQ3_S - 3.4375 bpw	2048	pp2048	420.51	469.87	1.12
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	16	pp2048	327.72	346.60	1.06
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	32	pp2048	456.79	487.36	1.07
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	64	pp2048	599.42	581.19	0.97
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	128	pp2048	413.12	425.77	1.03
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	256	pp2048	430.49	452.31	1.05
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	512	pp2048	432.88	492.82	1.14
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	1024	pp2048	451.68	498.99	1.10
Radeon 8060S Graphics	llama 8B IQ3_S mix - 3.66 bpw	2048	pp2048	449.59	498.91	1.11
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	16	pp2048	329.19	349.33	1.06
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	32	pp2048	477.12	497.45	1.04
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	64	pp2048	612.34	593.17	0.97
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	128	pp2048	395.13	413.80	1.05
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	256	pp2048	408.93	439.60	1.08
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	512	pp2048	411.48	480.91	1.17
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	1024	pp2048	423.19	479.21	1.13
Radeon 8060S Graphics	llama 8B IQ3_XS - 3.3 bpw	2048	pp2048	416.71	485.92	1.17
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	16	pp2048	335.32	357.09	1.06
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	32	pp2048	478.87	487.53	1.02
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	64	pp2048	539.94	583.22	1.08
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	128	pp2048	423.11	423.94	1.00
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	256	pp2048	439.06	449.36	1.02
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	512	pp2048	439.70	491.91	1.12
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	1024	pp2048	445.50	490.72	1.10
Radeon 8060S Graphics	llama 8B IQ3_XXS - 3.0625 bpw	2048	pp2048	445.95	491.52	1.10
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	16	pp2048	442.84	464.27	1.05
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	32	pp2048	491.15	504.31	1.03
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	64	pp2048	557.37	605.14	1.09
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	128	pp2048	369.53	436.72	1.18
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	256	pp2048	385.66	457.67	1.19
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	512	pp2048	394.82	499.66	1.27
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	1024	pp2048	403.76	499.53	1.24
Radeon 8060S Graphics	llama 8B IQ4_NL - 4.5 bpw	2048	pp2048	396.45	505.99	1.28
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	16	pp2048	468.16	481.04	1.03
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	32	pp2048	503.95	508.53	1.01
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	64	pp2048	555.81	615.74	1.11
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	128	pp2048	374.90	440.64	1.18
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	256	pp2048	390.18	461.20	1.18
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	512	pp2048	399.19	502.34	1.26
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	1024	pp2048	405.21	503.99	1.24
Radeon 8060S Graphics	llama 8B IQ4_XS - 4.25 bpw	2048	pp2048	404.14	510.02	1.26
Radeon 8060S Graphics	llama 8B Q2_K_M	16	pp2048	252.08	274.40	1.09
Radeon 8060S Graphics	llama 8B Q2_K_M	32	pp2048	378.55	368.09	0.97
Radeon 8060S Graphics	llama 8B Q2_K_M	64	pp2048	383.31	426.62	1.11
Radeon 8060S Graphics	llama 8B Q2_K_M	128	pp2048	649.61	388.86	0.60
Radeon 8060S Graphics	llama 8B Q2_K_M	256	pp2048	695.94	516.38	0.74
Radeon 8060S Graphics	llama 8B Q2_K_M	512	pp2048	877.72	633.61	0.72
Radeon 8060S Graphics	llama 8B Q2_K_M	1024	pp2048	1012.61	697.89	0.69
Radeon 8060S Graphics	llama 8B Q2_K_M	2048	pp2048	1079.09	719.41	0.67
Radeon 8060S Graphics	llama 8B Q3_K_S	16	pp2048	303.27	358.38	1.18
Radeon 8060S Graphics	llama 8B Q3_K_S	32	pp2048	400.04	401.91	1.00
Radeon 8060S Graphics	llama 8B Q3_K_S	64	pp2048	282.53	461.90	1.63
Radeon 8060S Graphics	llama 8B Q3_K_S	128	pp2048	944.51	393.53	0.42
Radeon 8060S Graphics	llama 8B Q3_K_S	256	pp2048	990.42	426.78	0.43
Radeon 8060S Graphics	llama 8B Q3_K_S	512	pp2048	1025.99	473.82	0.46
Radeon 8060S Graphics	llama 8B Q3_K_S	1024	pp2048	1058.69	475.26	0.45
Radeon 8060S Graphics	llama 8B Q3_K_S	2048	pp2048	1061.71	478.40	0.45
Radeon 8060S Graphics	llama 8B Q4_0	16	pp2048	393.96	427.41	1.08
Radeon 8060S Graphics	llama 8B Q4_0	32	pp2048	381.99	412.99	1.08
Radeon 8060S Graphics	llama 8B Q4_0	64	pp2048	543.81	475.80	0.87
Radeon 8060S Graphics	llama 8B Q4_0	128	pp2048	361.92	260.93	0.72
Radeon 8060S Graphics	llama 8B Q4_0	256	pp2048	381.71	271.42	0.71
Radeon 8060S Graphics	llama 8B Q4_0	512	pp2048	388.27	295.29	0.76
Radeon 8060S Graphics	llama 8B Q4_0	1024	pp2048	402.68	294.29	0.73
Radeon 8060S Graphics	llama 8B Q4_0	2048	pp2048	403.31	297.37	0.74
Radeon 8060S Graphics	llama 8B Q4_1	16	pp2048	402.56	419.29	1.04
Radeon 8060S Graphics	llama 8B Q4_1	32	pp2048	380.07	398.34	1.05
Radeon 8060S Graphics	llama 8B Q4_1	64	pp2048	244.36	462.19	1.89
Radeon 8060S Graphics	llama 8B Q4_1	128	pp2048	914.90	380.42	0.42
Radeon 8060S Graphics	llama 8B Q4_1	256	pp2048	969.23	404.01	0.42
Radeon 8060S Graphics	llama 8B Q4_1	512	pp2048	978.88	443.29	0.45
Radeon 8060S Graphics	llama 8B Q4_1	1024	pp2048	1043.85	448.60	0.43
Radeon 8060S Graphics	llama 8B Q4_1	2048	pp2048	1052.86	451.97	0.43
Radeon 8060S Graphics	llama 8B Q4_K_S	16	pp2048	413.22	437.53	1.06
Radeon 8060S Graphics	llama 8B Q4_K_S	32	pp2048	500.08	502.41	1.00
Radeon 8060S Graphics	llama 8B Q4_K_S	64	pp2048	624.42	605.12	0.97
Radeon 8060S Graphics	llama 8B Q4_K_S	128	pp2048	957.36	768.80	0.80
Radeon 8060S Graphics	llama 8B Q4_K_S	256	pp2048	1011.68	822.97	0.81
Radeon 8060S Graphics	llama 8B Q4_K_S	512	pp2048	1026.75	897.58	0.87
Radeon 8060S Graphics	llama 8B Q4_K_S	1024	pp2048	1080.94	908.09	0.84
Radeon 8060S Graphics	llama 8B Q4_K_S	2048	pp2048	1090.02	916.21	0.84
Radeon 8060S Graphics	llama 8B Q5_0	16	pp2048	351.31	367.75	1.05
Radeon 8060S Graphics	llama 8B Q5_0	32	pp2048	313.10	322.95	1.03
Radeon 8060S Graphics	llama 8B Q5_0	64	pp2048	505.68	379.42	0.75
Radeon 8060S Graphics	llama 8B Q5_0	128	pp2048	325.03	305.09	0.94
Radeon 8060S Graphics	llama 8B Q5_0	256	pp2048	343.77	322.88	0.94
Radeon 8060S Graphics	llama 8B Q5_0	512	pp2048	357.24	350.18	0.98
Radeon 8060S Graphics	llama 8B Q5_0	1024	pp2048	362.72	353.22	0.97
Radeon 8060S Graphics	llama 8B Q5_0	2048	pp2048	360.77	351.87	0.98
Radeon 8060S Graphics	llama 8B Q5_1	16	pp2048	291.16	301.24	1.03
Radeon 8060S Graphics	llama 8B Q5_1	32	pp2048	255.17	253.32	0.99
Radeon 8060S Graphics	llama 8B Q5_1	64	pp2048	211.09	303.02	1.44
Radeon 8060S Graphics	llama 8B Q5_1	128	pp2048	878.51	266.62	0.30
Radeon 8060S Graphics	llama 8B Q5_1	256	pp2048	941.06	286.39	0.30
Radeon 8060S Graphics	llama 8B Q5_1	512	pp2048	977.66	314.82	0.32
Radeon 8060S Graphics	llama 8B Q5_1	1024	pp2048	1023.66	318.03	0.31
Radeon 8060S Graphics	llama 8B Q5_1	2048	pp2048	1036.43	320.65	0.31
Radeon 8060S Graphics	llama 8B Q5_K_S	16	pp2048	413.63	420.21	1.02
Radeon 8060S Graphics	llama 8B Q5_K_S	32	pp2048	361.24	351.14	0.97
Radeon 8060S Graphics	llama 8B Q5_K_S	64	pp2048	234.88	416.96	1.78
Radeon 8060S Graphics	llama 8B Q5_K_S	128	pp2048	928.36	388.64	0.42
Radeon 8060S Graphics	llama 8B Q5_K_S	256	pp2048	976.75	405.59	0.42
Radeon 8060S Graphics	llama 8B Q5_K_S	512	pp2048	994.83	439.69	0.44
Radeon 8060S Graphics	llama 8B Q5_K_S	1024	pp2048	1040.16	439.73	0.42
Radeon 8060S Graphics	llama 8B Q5_K_S	2048	pp2048	1060.45	445.01	0.42
Radeon 8060S Graphics	llama 8B Q6_K	16	pp2048	329.35	336.87	1.02
Radeon 8060S Graphics	llama 8B Q6_K	32	pp2048	124.09	120.63	0.97
Radeon 8060S Graphics	llama 8B Q6_K	64	pp2048	615.70	136.98	0.22
Radeon 8060S Graphics	llama 8B Q6_K	128	pp2048	707.78	623.12	0.88
Radeon 8060S Graphics	llama 8B Q6_K	256	pp2048	737.48	641.76	0.87
Radeon 8060S Graphics	llama 8B Q6_K	512	pp2048	773.75	775.26	1.00
Radeon 8060S Graphics	llama 8B Q6_K	1024	pp2048	912.76	905.56	0.99
Radeon 8060S Graphics	llama 8B Q6_K	2048	pp2048	997.03	991.04	0.99
Radeon 8060S Graphics	llama 8B Q8_0	16	pp2048	310.12	323.85	1.04
Radeon 8060S Graphics	llama 8B Q8_0	32	pp2048	411.66	390.63	0.95
Radeon 8060S Graphics	llama 8B Q8_0	64	pp2048	534.93	462.41	0.86
Radeon 8060S Graphics	llama 8B Q8_0	128	pp2048	344.33	355.86	1.03
Radeon 8060S Graphics	llama 8B Q8_0	256	pp2048	355.69	372.63	1.05
Radeon 8060S Graphics	llama 8B Q8_0	512	pp2048	369.97	402.62	1.09
Radeon 8060S Graphics	llama 8B Q8_0	1024	pp2048	381.59	401.03	1.05
Radeon 8060S Graphics	llama 8B Q8_0	2048	pp2048	384.16	404.36	1.05

In my testing the performance changes from this PR are very inconsistent across batch sizes and data types. It cannot be merged like this.

pedapudi · 2026-04-08T21:44:29Z

@JohannesGaessler thank you for the sweep! I understand your position that the variance is not desirable.

Asking naively, is the sweep impacted by the issue described in this PR: #21282

Let me know if you have suggestions on how to reduce the variance. Thanks. I'd also appreciate your eyes on the sweep @tbocek did showing material benefit on more modern and larger models than llama 8B.

pedapudi · 2026-04-09T23:02:53Z

@JohannesGaessler

Perhaps corroborating your findings: this PR appears to be a benefit for MoE models, but a wildcard for dense model (eg., Qwen 3.5 9B).

There might be a path forward if this change is categorically not beneficial for dense models simply by reverting to the old values if the model is dense in the mmq code path? I don't quite know why this isn't beneficial (have not had a chance to look more closely). What's your opinion?

remeh · 2026-04-10T07:14:32Z

👋
On a Strix Halo, I ran some benchmarks with 3 different MoE models and with Qwen 3.5 9B.

In all these benchmarks, main is 25eec6f32 (master from the 6 of April), compared to custom which is 25eec6f32 + patch from this PR. llama.cpp compilation flags, with ROCm 7.12:

cmake -S . -B build \
        -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
        -DGGML_HIP=ON \
        -DHIP_PLATFORM=amd \
        -DGGML_CUDA_FA_ALL_QUANTS=ON \
        -DGGML_HIP_ROCWMMA_FATTN=OFF \
        -DGPU_TARGETS=gfx1151 \
        -DCMAKE_BUILD_TYPE=Release \
        -DLLAMA_OPENSSL=ON \
        -DLLAMA_BUILD_EXAMPLES=OFF \
        --fresh

First, here are some other benchmarks results with 3 MoE models at various context sizes: Gemma4 26B-A4B, Nemotron3-Super and GLM4.7-Flash. They are looking great!

+---------------------------------------+---------------------+-------------+-------------+-----------+
| Model                                 | Test                | Main (t/s)  | Custom (t/s)| Delta (%) |
+---------------------------------------+---------------------+-------------+-------------+-----------+
| GLM4.7-Flash Q4_K_M                   | pp512 @ d500        |     837.93  |    1037.35  |    +23.8% |
| GLM4.7-Flash Q4_K_M                   | pp512 @ d5000       |     399.56  |     443.86  |    +11.1% |
| GLM4.7-Flash Q4_K_M                   | pp512 @ d10000      |     256.08  |     274.03  |     +7.0% |
| GLM4.7-Flash Q4_K_M                   | pp512 @ d20000      |     148.13  |     154.00  |     +4.0% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d500        |    1146.08  |    1386.00  |    +20.9% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d5000       |     913.15  |    1053.38  |    +15.4% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d10000      |     751.92  |     852.80  |    +13.4% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d20000      |     629.00  |     693.00  |    +10.2% |
| Nemotron-3-Super-120B-A12B Q5_K_M     | pp512 @ d500        |     234.03  |     334.16  |    +42.8% |
| Nemotron-3-Super-120B-A12B Q5_K_M     | pp512 @ d5000       |     225.21  |     314.10  |    +39.5% |
| Nemotron-3-Super-120B-A12B Q5_K_M     | pp512 @ d10000      |     219.19  |     303.46  |    +38.4% |
| Nemotron-3-Super-120B-A12B Q5_K       | pp512 @ d20000      |     210.96  |     285.72  |    +35.4% |
+---------------------------------------+---------------------+-------------+-------------+-----------+

Then, here are some results with Qwen 3.5 9B, I've tried to use kind of the same params you were using in the llama 8B benchmarks you shared earlier @JohannesGaessler, varying the quants and the ubatch size. I'm wondering the reason you're interested in the different ubatch size, if you have some pointers?
Unfortunately, these ones does not look great.

./build/bin/llama-bench --model [...] -ngl 999 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 16,32,64,128,256,512,1024,2048 --batch-size 2048 -p 2048 -n 0 -r 5

| model                          |       size |     params | backend | n_ubatch |            test |                  t/s (main) |                  t/s (custom) | diff % |
| ------------------------------ | ---------: | ---------: | ------- | -------: | --------------: | --------------------------: | ----------------------------: | -----: |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        269.30 ± 0.52 |        306.54 ± 0.31 | +13.83% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        484.15 ± 0.47 |        524.67 ± 0.39 | +8.37% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        658.36 ± 0.32 |        633.19 ± 0.29 | -3.82% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        867.36 ± 0.37 |        671.28 ± 0.24 | -22.60% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        953.10 ± 0.27 |        727.43 ± 0.07 | -23.68% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        948.09 ± 0.54 |        766.46 ± 0.57 | -19.15% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |        979.33 ± 3.79 |        774.11 ± 2.03 | -20.95% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |        995.85 ± 1.42 |        761.93 ± 0.70 | -23.49% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        312.56 ± 0.07 |        324.17 ± 0.09 | +3.71% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        492.62 ± 0.38 |        530.46 ± 0.19 | +7.68% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        748.53 ± 0.54 |        650.56 ± 0.15 | -13.09% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        909.79 ± 4.44 |        804.11 ± 0.53 | -11.62% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        989.23 ± 2.28 |        873.73 ± 0.72 | -11.67% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        988.74 ± 1.24 |        931.09 ± 0.82 | -5.83% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1027.97 ± 4.62 |        914.93 ± 3.57 | -10.99% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |        985.60 ± 7.82 |        929.42 ± 4.90 | -5.70% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        375.01 ± 0.14 |        391.74 ± 0.11 | +4.46% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        467.51 ± 0.10 |        497.40 ± 0.20 | +6.39% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        834.80 ± 0.44 |        619.05 ± 0.26 | -25.84% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |      128 |          pp2048 |       1033.95 ± 0.55 |        933.30 ± 0.57 | -9.74% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1148.83 ± 1.14 |       1022.15 ± 0.49 | -11.03% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1152.68 ± 1.11 |       1085.20 ± 0.85 | -5.86% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1160.98 ± 2.05 |       1085.15 ± 2.19 | -6.53% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1171.66 ± 1.90 |       1085.44 ± 2.67 | -7.36% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        215.73 ± 0.02 |        209.64 ± 0.05 | -2.82% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        392.28 ± 0.07 |        389.85 ± 0.06 | -0.62% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        569.54 ± 0.24 |        512.97 ± 0.19 | -9.93% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        700.48 ± 0.34 |        524.62 ± 0.14 | -25.10% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        699.38 ± 0.64 |        695.53 ± 1.93 | -0.55% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        856.65 ± 1.03 |        843.56 ± 0.92 | -1.53% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |        962.25 ± 3.09 |        916.03 ± 4.51 | -4.80% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1024.81 ± 2.65 |        993.85 ± 4.50 | -3.02% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        282.20 ± 0.04 |        318.68 ± 0.09 | +12.93% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        549.85 ± 0.91 |        576.63 ± 0.32 | +4.87% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        769.48 ± 0.35 |        723.11 ± 0.20 | -6.03% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        939.80 ± 0.48 |        833.42 ± 0.31 | -11.32% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1031.24 ± 0.51 |        908.94 ± 0.40 | -11.85% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1029.60 ± 0.37 |        967.00 ± 0.60 | -6.08% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1046.62 ± 8.23 |        979.06 ± 2.87 | -6.45% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1035.41 ± 4.65 |        977.38 ± 2.71 | -5.60% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        350.44 ± 0.05 |        379.83 ± 0.05 | +8.39% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        327.12 ± 0.07 |        418.54 ± 0.50 | +27.95% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        825.10 ± 0.31 |        506.30 ± 0.12 | -38.64% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |      128 |          pp2048 |       1017.04 ± 0.92 |        919.44 ± 0.26 | -9.60% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1125.79 ± 0.56 |       1009.01 ± 0.37 | -10.37% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1129.42 ± 1.56 |       1056.94 ± 1.04 | -6.42% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1112.29 ± 2.75 |       1072.47 ± 3.07 | -3.58% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1154.69 ± 1.31 |       1053.06 ± 2.62 | -8.80% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        322.60 ± 0.13 |        343.07 ± 0.07 | +6.34% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        525.22 ± 0.44 |        573.34 ± 0.44 | +9.16% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        729.52 ± 0.38 |        703.25 ± 0.16 | -3.60% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        885.56 ± 0.34 |        808.43 ± 0.16 | -8.71% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        965.19 ± 0.47 |        883.44 ± 0.55 | -8.47% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        946.71 ± 0.85 |        959.60 ± 1.46 | +1.36% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1002.28 ± 4.05 |        984.98 ± 2.50 | -1.72% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1030.76 ± 1.32 |       1005.47 ± 2.28 | -2.45% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        335.56 ± 0.05 |        354.96 ± 0.48 | +5.78% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        554.94 ± 0.32 |        601.98 ± 0.65 | +8.48% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        783.62 ± 0.26 |        762.95 ± 0.40 | -2.64% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        955.86 ± 0.85 |        873.55 ± 0.28 | -8.61% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1050.56 ± 0.68 |        965.18 ± 0.30 | -8.13% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1039.32 ± 0.84 |       1028.08 ± 1.09 | -1.08% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1080.35 ± 3.77 |       1056.10 ± 1.94 | -2.24% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1068.03 ± 0.85 |       1040.37 ± 3.32 | -2.59% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        315.51 ± 0.06 |        330.29 ± 0.08 | +4.68% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        511.30 ± 0.32 |        547.34 ± 0.15 | +7.05% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        765.90 ± 0.28 |        682.83 ± 0.53 | -10.84% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        903.56 ± 0.58 |        841.38 ± 0.25 | -6.88% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        999.82 ± 0.82 |        920.34 ± 0.40 | -7.95% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        999.64 ± 1.28 |        967.92 ± 1.43 | -3.17% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1005.29 ± 3.33 |        992.12 ± 3.45 | -1.31% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1030.49 ± 1.05 |       1011.65 ± 2.35 | -1.83% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        284.80 ± 0.26 |        298.27 ± 0.13 | +4.73% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        448.14 ± 0.18 |        479.44 ± 0.09 | +6.99% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        635.53 ± 0.25 |        573.79 ± 0.26 | -9.71% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        757.23 ± 0.51 |        687.38 ± 0.37 | -9.22% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        836.38 ± 0.36 |        732.24 ± 0.20 | -12.45% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        796.02 ± 0.96 |        785.80 ± 1.47 | -1.28% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |        911.91 ± 3.04 |        882.34 ± 5.92 | -3.24% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |        960.51 ± 5.12 |        948.52 ± 5.60 | -1.25% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        272.09 ± 0.07 |        283.55 ± 0.05 | +4.21% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        407.91 ± 0.36 |        435.75 ± 0.10 | +6.82% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        739.65 ± 0.37 |        546.70 ± 0.20 | -26.09% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        951.41 ± 0.90 |        837.72 ± 0.57 | -11.95% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1061.03 ± 0.54 |        913.86 ± 0.50 | -13.87% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1063.48 ± 0.78 |        979.07 ± 0.90 | -7.94% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1084.12 ± 3.87 |        984.91 ± 2.14 | -9.15% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1099.03 ± 3.99 |        973.19 ± 2.36 | -11.45% |

HTH

0xSero · 2026-04-14T22:52:17Z

Tested this on a Framework Desktop (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB RAM, Fedora 43, ROCm 7.2.1) with: Qwen3.5-122B-A10B-REAP-20 Q6_K.

I used Codex to build a patched ROCm container from the PR diff and compared it against the stock ROCm
7.2.1 toolbox build. On my machine the patch significantly improves prefill, and basically
neutral for decode:

test	stock ROCm	patched ROCm	delta
pp512	262.18 t/s	354.57 t/s	+35%
tg128	20.83 t/s	21.00 t/s	+1%
pp2048+tg128	162.84 t/s	194.36 t/s	+19%
pp8192+tg128	228.66 t/s	302.44 t/s	+32%
pp16384+tg128	234.14 t/s	314.70 t/s	+34%
pp32768+tg128	221.31 t/s	291.03 t/s	+31%
pp65536+tg128	190.37 t/s	238.28 t/s	+25%
pp131072+tg128	144.44 t/s	171.66 t/s	+19%

So at least on gfx1151 + a large MoE model, this looks very real and very useful.

kyuz0 · 2026-04-15T11:27:49Z

I have added a toolbox with this PR and run the benchmark:

https://kyuz0.github.io/amd-strix-halo-toolboxes/

The benefits seem to be mostly for short context, but if you switch to the 30k context tests, results do not look great, sometimes better, sometimes worse, but not by that much.

Applies the six-edit ggml-cuda/mmq.cuh change from upstream PR ggml-org#21344 (pedapudi/llama.cpp@gfx1151-opt) that gives RDNA 3.5 its own MMQ tile and warp sizing — mmq_x_max=48, mmq_y=64, nwarps=4 — instead of inheriting the discrete RDNA3 values tuned for 7900 XTX-class hardware. Hypothesis, expected numbers (from kyuz0's independent A/B logs), and bench plan in strix-halo/mmq-rdna3_5.md. Also includes the previously uncommitted docs (codex-insights, rocm-config) and updates to NOTES, README, uma-integrated reflecting the UMA-deprioritization decision, plus a .gitignore entry for useful-repos/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gfx1151 nwarps, tile sizing to curb VGPR pressure

e0e3c3f

pedapudi requested a review from a team as a code owner April 3, 2026 00:29

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2026

JohannesGaessler reviewed Apr 5, 2026

View reviewed changes

pedapudi and others added 2 commits April 6, 2026 18:47

Apply suggestions from code review

668d4bd

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

revert changes to mmvq.cu

7957de9

0xSero mentioned this pull request Apr 15, 2026

Vulkan: MUL_MAT_ID is the dominant prefill bottleneck for large MoE models on gfx1151 #21948

Closed

0xSero mentioned this pull request Apr 15, 2026

fix: speculative decoding broken on hybrid SSM/MoE (Qwen3.5 MoE) #20075

Open

Conversation

pedapudi commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Before (build 7c7d6ce / 8642)

After (build 955df3551 / 8643)

Speedup

Requirements

Uh oh!

pedapudi commented Apr 3, 2026

Uh oh!

IIIIIllllIIIIIlllll commented Apr 3, 2026

Uh oh!

pedapudi commented Apr 3, 2026

Uh oh!

IIIIIllllIIIIIlllll commented Apr 3, 2026

Uh oh!

tbocek commented Apr 3, 2026

Uh oh!

pedapudi commented Apr 3, 2026

Uh oh!

tbocek commented Apr 3, 2026

Uh oh!

pedapudi commented Apr 3, 2026

Uh oh!

am17an commented Apr 5, 2026

Uh oh!

IMbackK commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedapudi commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

pedapudi Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Apr 5, 2026

Uh oh!

Mushoz commented Apr 5, 2026

Uh oh!

IMbackK commented Apr 5, 2026

Uh oh!

IMbackK commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedapudi commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedapudi commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Apr 8, 2026

Uh oh!

pedapudi commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedapudi commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remeh commented Apr 10, 2026

Uh oh!

0xSero commented Apr 14, 2026

Uh oh!

kyuz0 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

pedapudi commented Apr 3, 2026 •

edited

Loading

Before (build `7c7d6ce` / 8642)

IMbackK commented Apr 5, 2026 •

edited

Loading

pedapudi commented Apr 5, 2026 •

edited

Loading

IMbackK commented Apr 5, 2026 •

edited

Loading

pedapudi commented Apr 7, 2026 •

edited

Loading

IMbackK commented Apr 7, 2026 •

edited

Loading

pedapudi commented Apr 8, 2026 •

edited

Loading

pedapudi commented Apr 8, 2026 •

edited

Loading

pedapudi commented Apr 9, 2026 •

edited

Loading