Skip to content

gfx1151 nwarps, tile sizing to curb VGPR pressure#21344

Open
pedapudi wants to merge 3 commits intoggml-org:masterfrom
pedapudi:gfx1151-opt
Open

gfx1151 nwarps, tile sizing to curb VGPR pressure#21344
pedapudi wants to merge 3 commits intoggml-org:masterfrom
pedapudi:gfx1151-opt

Conversation

@pedapudi
Copy link
Copy Markdown

@pedapudi pedapudi commented Apr 3, 2026

Follow up on issue #21284

  1. Tune MMQ tile sizes, warp counts, and MMVQ parameter tables for RDNA3_5 gfx1151
    a. MMQ: mmq_x_max=48, mmq_y=64, nwarps=4 for RDNA3_5 to balance VGPR usage and occupancy
    b. Note: I took the opportunity for a minor refactor replacing nested ternary operators to improve readability and reduce opportunity for errors (especially after I made a mistake while piling on the ternary operations).
  2. RDNA3_5 gets its own mmvq_parameter_table_id instead of falling back to RDNA2
    a. Results in nwraps calculation falling to 1.

1 is more important than 2, but 2 is still helpful on the mmvq paths. And it sets up for future per-quant tuning.

Benchmarks

Built with cmake flags

cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" -DGGML_HIP_ROCWMMA_FATTN=OFF -DCMAKE_BUILD_TYPE=Release -DLLAMA_OPENSSL=ON -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" -DHIP_PLATFORM=amd -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF

Before (build 7c7d6ce / 8642)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        181.71 ± 4.97 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        267.51 ± 3.68 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        369.82 ± 1.01 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        415.54 ± 3.10 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        496.87 ± 7.36 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        474.49 ± 1.84 |

build: 7c7d6ce5c (8642)

After (build 955df3551 / 8643)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        314.45 ± 5.96 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        411.83 ± 3.54 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        491.57 ± 1.53 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        487.68 ± 3.70 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        544.94 ± 5.87 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        509.05 ± 2.25 |

build: 955df3551 (8643)

Speedup

Test Before After Change
pp128 181.71 ± 4.97 314.45 ± 5.96 +73.0%
pp256 267.51 ± 3.68 411.83 ± 3.54 +53.9%
pp512 369.82 ± 1.01 491.57 ± 1.53 +32.9%
pp1024 415.54 ± 3.10 487.68 ± 3.70 +17.4%
pp2048 496.87 ± 7.36 544.94 ± 5.87 +9.7%
pp4096 474.49 ± 1.84 509.05 ± 2.25 +7.3%

Requirements

  • I have read and agree with the contributing guidelines: Yes, of course.
  • AI usage disclosure: Yes, of course. This is a straightforward change to implement, but the PR description is formatted with LLM assistance.

@pedapudi pedapudi requested a review from a team as a code owner April 3, 2026 00:29
@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 3, 2026

cc @am17an @IMbackK

Thank you both for encouraging sending this change. These are just the minimal changes to curb register pressure.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2026
@IIIIIllllIIIIIlllll
Copy link
Copy Markdown

Strangely, there's been no improvement on my machine.
My operating environment:

--- OS Info ---
Distributor ID:	Ubuntu
Description:	Ubuntu 25.10
Release:	25.10
Codename:	questing
Linux MarkPC 6.18.20-061820-generic #202603251312 SMP PREEMPT_DYNAMIC Wed Mar 25 17:11:38 UTC 2026 x86_64 GNU/Linux


--- amd-smi Info ---
+------------------------------------------------------------------------------+
| AMD-SMI 26.2.2+e1a6bc5663    amdgpu version: 6.18.20-061820 ROCm version: 7.2.1    |
| VBIOS version: 00107962                                                      |
| Platform: Linux Baremetal                                                    |
|-------------------------------------+----------------------------------------|
| BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |
| GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |
|=====================================+========================================|
| 0000:c6:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |
|   0       0     N/A             N/A | N/A        N/A              497/512 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes:                                                                   |
|  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |
|==============================================================================|
|    0      62645  llama-server          33.1 GB  410.2 KB    33.9 GB  N/A     |
+------------------------------------------------------------------------------+

Is there any other information required?

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 3, 2026

@IIIIIllllIIIIIlllll thanks for looking. I'm also on Ubuntu 25.10 using ROCm 7.2.1. I'm surprised that you aren't seeing any improvement.

I don't know how you applied the patch, built llama.cpp, or how you tested. I'll update the PR with my cmake flags.

@IIIIIllllIIIIIlllll
Copy link
Copy Markdown

Here is my command:

/home/mark/App/llama.cpp/llamacpp/llama.cpp-gfx1151/build/bin/llama-server -m /home/mark/Models/Q4/Qwen3.5-122B-A10B-Q4_K_M/Qwen3.5-122B-A10B-PRISM-LITE-Dynamic.gguf 
--port 8090 
--mmproj /home/mark/Models/Q4/Qwen3.5-122B-A10B-Q4_K_M/mmproj-F32.gguf 
--main-gpu 0 
--ctx-size 262144 
--flash-attn on 
--no-mmap 
--temp 1 
--top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.02 --frequency-penalty 0.0 
--batch-size 4096 --ubatch-size 4096 
--parallel 2 
--cache-ram -1 
--cache-type-k f16 --cache-type-v f16 
--threads -1 
--seed -1 
-dio 
--no-webui 
--chat-template-file /home/mark/App/llama.cpp/cache/Qwen3.5-122B-A10B-Q4_K_M.jinja 
--metrics 
--slot-save-path /home/mark/App/llama.cpp/cache 
--alias Qwen3.5-122B-A10B-Q4_K_M 
--timeout 36000 --host 0.0.0.0

The model: https://huggingface.co/Ex0bit/Qwen3.5-122B-A10B-PRISM-LITE-GGUF

My testing method: I started the model using llama-server and then submitted an 8192 request (accurately created by the model's tokenizer) to test its performance.

5a01e6d5c9ce64eb543b39de45837d4b

@tbocek
Copy link
Copy Markdown

tbocek commented Apr 3, 2026

I tested this PR with more models (arch linux 6.19, rocm 7.2.1, vulkan 1.4.341): Mistral-Small-4-119B-2603-UD-Q4_K_XL, Qwen3.5-122B-A10B-UD-Q4_K_XL, NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL. (I added Vulkan, since I was also curious about the comparison ROCm/Vulkan). For ROCm, this PR performs consistently better for the 3 tested models Mistral, Qwen3.5, NVIDIA-Nemotron (with the excetion of -1% for Qwen, which also may be noise).

In summary (ROCm before/after):

Mistral-Small-4-119B (ROCm)

Test Before (t/s) After (t/s) Diff
pp128 151.30 303.14 +100.4%
pp256 244.01 428.78 +75.7%
pp512 366.67 567.00 +54.6%
pp1024 441.42 562.80 +27.5%
pp2048 418.10 442.79 +5.9%
pp4096 288.80 307.12 +6.3%

Qwen3.5-122B-A10B (ROCm)

Test Before (t/s) After (t/s) Diff
pp128 176.54 267.37 +51.4%
pp256 255.82 386.09 +50.9%
pp512 319.26 419.97 +31.5%
pp1024 358.75 393.37 +9.6%
pp2048 359.18 397.16 +10.6%
pp4096 300.95 297.92 -1.0%

NVIDIA-Nemotron-3-Super-120B-A12B (ROCm)

Test Before (t/s) After (t/s) Diff
pp128 124.14 212.07 +70.8%
pp256 191.08 277.29 +45.1%
pp512 271.97 347.18 +27.7%
pp1024 345.20 383.91 +11.2%
pp2048 366.87 377.38 +2.9%
pp4096 350.06 352.29 +0.6%

Master: (Before)

./build/bin/llama-bench --model /mnt/models/Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        151.30 ± 1.96 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        244.01 ± 6.83 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        366.67 ± 3.17 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        441.42 ± 7.76 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        418.10 ± 0.91 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        288.80 ± 1.49 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        161.56 ± 3.66 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        251.50 ± 4.28 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        376.11 ± 1.03 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        533.79 ± 1.82 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        657.96 ± 0.71 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        605.62 ± 6.37 |

build: 0c58ba336 (8646)
./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        176.54 ± 4.67 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        255.82 ± 3.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        319.26 ± 1.92 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        358.75 ± 2.75 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        359.18 ± 3.07 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        300.95 ± 1.03 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        174.99 ± 1.82 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        230.27 ± 2.11 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        293.34 ± 1.95 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        337.85 ± 2.68 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        368.81 ± 2.46 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        378.01 ± 1.54 |

build: 0c58ba336 (8646)
./build/bin/llama-bench --model /mnt/models/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        124.14 ± 2.06 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        191.08 ± 1.84 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        271.97 ± 1.00 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        345.20 ± 0.76 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        366.87 ± 1.80 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        350.06 ± 0.98 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        106.06 ± 1.61 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        147.87 ± 0.90 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        208.72 ± 1.37 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        263.62 ± 0.78 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        278.56 ± 0.22 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        279.49 ± 0.73 |

build: 0c58ba336 (8646)

This PR: (After)

./build/bin/llama-bench --model /mnt/models/Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        303.14 ± 3.49 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |       428.78 ± 10.34 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        567.00 ± 5.69 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |       562.80 ± 11.20 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        442.79 ± 4.12 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        307.12 ± 1.13 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        163.24 ± 3.95 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        254.25 ± 3.54 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        379.97 ± 1.61 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        537.16 ± 1.76 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        661.61 ± 1.03 |
| mistral4 ?B Q4_K - Medium      |  68.72 GiB |   118.97 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        610.18 ± 4.81 |

build: e0e3c3fc6 (8643)
./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |       267.37 ± 79.41 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        386.09 ± 3.41 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        419.97 ± 2.61 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        393.37 ± 1.21 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        397.16 ± 4.76 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        297.92 ± 1.44 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        176.12 ± 2.25 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        230.28 ± 2.05 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        299.17 ± 2.00 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        345.02 ± 3.12 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        379.44 ± 2.18 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        382.89 ± 1.47 |

build: e0e3c3fc6 (8643)
./build/bin/llama-bench --model /mnt/models/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        212.07 ± 1.94 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        277.29 ± 1.95 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        347.18 ± 0.46 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        383.91 ± 0.89 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        377.38 ± 0.48 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        352.29 ± 1.36 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        104.82 ± 1.55 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        128.66 ± 0.49 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        168.68 ± 1.01 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        214.23 ± 0.68 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        235.00 ± 0.97 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |  78.02 GiB |   120.67 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        283.27 ± 0.28 |

build: e0e3c3fc6 (8643)

And just for curiosity, the comparison Vulkan/ROCm (after this PR):

ROCm vs Vulkan highlights/lowlights (after PR)

Model Test ROCm (t/s) Vulkan (t/s) Delta
Mistral-Small-4-119B pp128 303.14 163.24 ROCm +85.7%
Mistral-Small-4-119B pp4096 307.12 610.18 Vulkan +98.7%
Qwen3.5-122B-A10B pp256 386.09 230.28 ROCm +67.6%
Qwen3.5-122B-A10B pp4096 297.92 382.89 Vulkan +28.5%
Nemotron-120B-A12B pp256 277.29 128.66 ROCm +115.5%
Nemotron-120B-A12B pp4096 352.29 283.27 ROCm +24.4%

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 3, 2026

@tbocek Thank you for doing the cross model testing! I'm glad the PR shows uplift across models. I have some questions about your empirical numbers which look lower than what I see with llama-bench for the Qwen3.5 122B model. I re-ran llama-bench with Unsloth's UD Q4_K_XL to match your run, and I see:

BEFORE

$ ./bin/llama-bench --model /home/sunil/models/unsloth/qwen35-122b-ud-q4_k_xl/unsloth_Qwen3.5-122B-A10B-GGUF_UD-Q4_K_XL_Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        181.31 ± 4.92 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        269.33 ± 3.56 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        355.28 ± 1.51 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        418.42 ± 1.23 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        429.09 ± 5.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        405.33 ± 2.56 |

build: 7c7d6ce5c (8642)

AFTER

./bin/llama-bench --model /home/sunil/models/unsloth/qwen35-122b-ud-q4_k_xl/unsloth_Qwen3.5-122B-A10B-GGUF_UD-Q4_K_XL_Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        314.28 ± 5.58 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        411.32 ± 3.45 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        488.98 ± 2.14 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        442.81 ± 1.63 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        553.57 ± 5.64 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        494.13 ± 3.09 |

build: e0e3c3fc6 (8643)

(note the differing build IDs but they are not material here.)

Both baseline and with this PR, the prefill t/s is higher. It's possibly worth normalizing for whatever differences exist between the setups, but it seems your findings corroborate some of the relative improvement with the PR. The falloff you are noticing at higher lengths seems to have a different slope though.

@tbocek
Copy link
Copy Markdown

tbocek commented Apr 3, 2026

@pedapudi I just realized, I had different build flags, I was using ROCWMMA_FATTN=ON. Now with Qwen3.5-122B-A10B-UD-Q4_K_XL with ROCWMMA_FATTN=OFF. Now the number are closer to yours.

Summary:

Test Before PR, WMMA OFF After PR, WMMA OFF Δ%
pp128 180.48 311.26 +72.5%
pp256 264.46 400.74 +51.5%
pp512 343.20 453.72 +32.2%
pp1024 395.79 453.36 +14.5%
pp2048 449.01 476.48 +6.1%
pp4096 455.48 530.14 +16.4%

before PR:

./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        180.48 ± 4.79 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        264.46 ± 3.32 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        343.20 ± 1.81 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        395.79 ± 2.76 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        449.01 ± 8.23 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        455.48 ± 3.18 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        176.31 ± 2.27 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        234.34 ± 2.29 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        302.47 ± 2.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        347.72 ± 0.88 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        381.04 ± 2.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        380.39 ± 1.71 |

build: 7c7d6ce5c (8642)

and after PR:

./build/bin/llama-bench --model /mnt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5 --device ROCM0,Vulkan0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | dev          | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------ | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp128 |        311.26 ± 6.10 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp256 |        400.74 ± 3.65 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |           pp512 |        453.72 ± 2.08 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp1024 |        453.36 ± 3.34 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp2048 |        476.48 ± 4.58 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | ROCm0        |    0 |   1 |          pp4096 |        530.14 ± 5.57 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp128 |        173.87 ± 2.08 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp256 |        232.94 ± 1.94 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |           pp512 |        293.93 ± 2.89 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp1024 |        334.94 ± 0.97 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp2048 |        342.86 ± 0.83 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | ROCm,Vulkan |  99 |     2048 |  1 | Vulkan0      |    0 |   1 |          pp4096 |        373.97 ± 1.80 |

build: e0e3c3fc6 (8643)

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 3, 2026

@tbocek So glad that you were able to identify the discrepancy! Thanks again for testing.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 5, 2026

@IMbackK @JohannesGaessler does this look okay? I don't have the hardware to test but a lot of people seem to have confirmed it looks good.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 5, 2026

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 5, 2026

Thank you, @am17an and @IMbackK.

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

You may be right, and it's a reasonable point to raise here, but I would like to offer a few things for you to consider before holding back on incremental improvements (within the current structure of the code, at least):

  • The proposed solution here for RDNA 3.5 is better for gfx1150 than HEAD :) it's verified to be better for gfx1151, but this should be a net improvement for both gfx1150 and gfx1151.
  • The current solution leaves a bunch of headroom in VGPRs for mul mat kernels (ie., it's not close to the limits for gfx1151). Even though gfx1150 has fewer (and smaller) registers, this change has enough breathing room that it's going to be about as good as it gets while staying within "multiple of 16" tile sizes. The alternative (shrinking the tiles further) will sacrifice performance. (PS. The "headroom" here is undesirable from my perspective because it's the result of me failing to increase occupancy. Silver lining is that gfx1150 isn't going to suffer, likely).
  • gfx1152 does not appear to change the equation here, but if someone has more context, please correct me.
  • Opinion: I'm going to guess that gfx1151 is more important to optimize than gfx1150 because it is the "halo" product of its line. (Someone from AMD is welcome to weigh in more authoritatively, of course :))
    • Relatedly, llama.cpp likely has many more Strix Halo users than Strix Point users, so optimizing for that user base is more impactful.

As I mentioned in my original issue, there is no static sizing that's going to work across architectures.

Sidebar: One idea I'd love for llama.cpp is to have a llama-tune tool that can probe a host and self-configure things like tile sizes, but that is obviously a separate discussion.

Comment thread ggml/src/ggml-cuda/mmq.cuh Outdated
Comment thread ggml/src/ggml-cuda/mmq.cuh Outdated
Comment thread ggml/src/ggml-cuda/mmq.cuh Outdated
Comment thread ggml/src/ggml-cuda/mmq.cuh Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your intentions with the changes to mmvq.cu? They look wrong.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to update the structure to support RDNA 3.5 more natively and not reuse preexisting configurations. I've reverted the changes for now to not distract from the other changes. Thanks!

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

Looking at this table, the in my opinion correct way to do it is to use the value for discrete RDNA3 GPUs for those APUs that only have 512 kiB of registers.

@Mushoz
Copy link
Copy Markdown

Mushoz commented Apr 5, 2026

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

Well right now everything is tuned for RDNA2 which is even worse. Incremental improvements always make sense especially if they focus on the more popular cases (Strix Halo) over the less popular ones (Strix Point)

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 5, 2026

I have yet to find the time for a proper review, but on a surface level im not convinced with the current state of this. For starters this is trying to solve register pressure by tuning the values for rdna3.5 using gfx1151 as a model, but other rdna3.5 gpus have different size register files.

Looking at this table, the in my opinion correct way to do it is to use the value for discrete RDNA3 GPUs for those APUs that only have 512 kiB of registers.

You mean RDNA2 DGPUs presumably, the large register file RDNA3 dgpus (gfx1100 and gfx1101) also have >512 kiB of registers.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 5, 2026

Regarding register pressure we really only have 3 cases:

gfx1100, gfx1101, gfx1151, and gfx12 with 768 32 wide vector registers
gfx10 and gfx1102+ with 512 32 wide registers
CDNA/GCN with 256 64 wide registers

Btw the table is wrong, the unit is not kiB, its the number of vector registers. A single register being 1024 bits for rdna and 2048 bits for gcn/cnda

pedapudi and others added 2 commits April 6, 2026 18:47
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 7, 2026

Thanks for the discussion so far.

I'm a little confused as to where we are. Is the suggestion to abandon this change? If so, would someone else pick up realizing the performance opportunity? If not, what are the next steps?

Thank you!

(PS. That table indeed looks misleading :))

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 7, 2026

Incremental improvements always make sense especially if they focus on the more popular cases (Strix Halo) over the less popular ones (Strix Point)

I strongly disagree with this notion

I'm a little confused as to where we are. Is the suggestion to abandon this change? If so, would someone else pick up realizing the performance opportunity? If not, what are the next steps?

If the resulting kernels are indeed below 512 registers (see GGML_HIP_EXPORT_METRICS) i think this is conceptually fine and likely to be positive for the other gfx115x gpus. Ideally someone would benchmark this fact, however.

Further it makes sense to try these values on the other RDNA devices, as they would also spill and widen the filter further, i will check this on gfx1100 soon. This dosent have to be part of this pr.

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 8, 2026

Thank you, @IMbackK

Incremental improvements always make sense especially if they focus on the more popular cases (Strix Halo) over the less popular ones (Strix Point)

I strongly disagree with this notion

Yes, reasonable people can disagree :) I respect your position, especially from a maintainer, that no hardware will be alienated. After all, that is the crux of this PR as well.

For a later time: I think there is a structural opportunity in the implementation to support different GPU architectures with less maintenance burden, eg., organizing different config files, or an adaptive approach like my prior comment was conceptualizing using a "probe the host hardware" step.

I'm a little confused as to where we are. Is the suggestion to abandon this change? If so, would someone else pick up realizing the performance opportunity? If not, what are the next steps?

If the resulting kernels are indeed below 512 registers (see GGML_HIP_EXPORT_METRICS) i think this is conceptually fine and likely to be positive for the other gfx115x gpus. Ideally someone would benchmark this fact, however.

Further it makes sense to try these values on the other RDNA devices, as they would also spill and widen the filter further, i will check this on gfx1100 soon. This dosent have to be part of this pr.

Thank you, this is all very reasonable. AFAIK we're targeting 256 VGPR for gfx1151 (ideally slightly below to leave room for system registers). I attached the remarks from GGML_HIP_EXPORT_METRICS (which has been a useful utility in validating some additional changing outside this PR I'm trying as well!). There is still some VGPR spill (especially with IQ quantization). At least with mmq_x, if I lower it further (even for just Q8), there does not seem to have any improvement in performance and high potential for regressions.
remarks.txt

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Performance changes
GPU Model Microbatch size Test t/s b8642 t/s 7957de9 Speedup
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 16 pp2048 438.68 466.96 1.06
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 32 pp2048 454.74 457.73 1.01
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 64 pp2048 313.94 531.23 1.69
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 128 pp2048 845.24 491.28 0.58
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 256 pp2048 881.14 525.70 0.60
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 512 pp2048 910.45 568.66 0.62
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 1024 pp2048 963.00 562.52 0.58
Radeon 8060S Graphics llama 8B IQ1_S - 1.5625 bpw 2048 pp2048 956.29 560.47 0.59
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 16 pp2048 311.78 331.05 1.06
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 32 pp2048 408.92 408.96 1.00
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 64 pp2048 308.36 460.04 1.49
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 128 pp2048 730.49 393.17 0.54
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 256 pp2048 754.84 422.19 0.56
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 512 pp2048 758.00 456.59 0.60
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 1024 pp2048 802.82 455.28 0.57
Radeon 8060S Graphics llama 8B IQ2_S - 2.5 bpw 2048 pp2048 805.24 453.23 0.56
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 16 pp2048 310.57 326.34 1.05
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 32 pp2048 407.04 401.34 0.99
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 64 pp2048 301.54 447.59 1.48
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 128 pp2048 818.72 398.99 0.49
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 256 pp2048 853.12 429.05 0.50
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 512 pp2048 871.31 463.56 0.53
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 1024 pp2048 916.03 464.17 0.51
Radeon 8060S Graphics llama 8B IQ2_XS - 2.3125 bpw 2048 pp2048 926.06 463.68 0.50
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 16 pp2048 358.34 376.63 1.05
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 32 pp2048 386.58 397.47 1.03
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 64 pp2048 649.67 462.13 0.71
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 128 pp2048 412.19 374.11 0.91
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 256 pp2048 427.58 408.50 0.96
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 512 pp2048 441.95 445.16 1.01
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 1024 pp2048 443.88 442.61 1.00
Radeon 8060S Graphics llama 8B IQ2_XXS - 2.0625 bpw 2048 pp2048 442.43 448.44 1.01
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 16 pp2048 319.54 335.23 1.05
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 32 pp2048 451.20 481.56 1.07
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 64 pp2048 594.72 576.69 0.97
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 128 pp2048 393.40 403.06 1.02
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 256 pp2048 400.63 427.06 1.07
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 512 pp2048 411.94 466.67 1.13
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 1024 pp2048 417.92 467.35 1.12
Radeon 8060S Graphics llama 8B IQ3_S - 3.4375 bpw 2048 pp2048 420.51 469.87 1.12
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 16 pp2048 327.72 346.60 1.06
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 32 pp2048 456.79 487.36 1.07
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 64 pp2048 599.42 581.19 0.97
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 128 pp2048 413.12 425.77 1.03
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 256 pp2048 430.49 452.31 1.05
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 512 pp2048 432.88 492.82 1.14
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 1024 pp2048 451.68 498.99 1.10
Radeon 8060S Graphics llama 8B IQ3_S mix - 3.66 bpw 2048 pp2048 449.59 498.91 1.11
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 16 pp2048 329.19 349.33 1.06
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 32 pp2048 477.12 497.45 1.04
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 64 pp2048 612.34 593.17 0.97
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 128 pp2048 395.13 413.80 1.05
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 256 pp2048 408.93 439.60 1.08
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 512 pp2048 411.48 480.91 1.17
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 1024 pp2048 423.19 479.21 1.13
Radeon 8060S Graphics llama 8B IQ3_XS - 3.3 bpw 2048 pp2048 416.71 485.92 1.17
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 16 pp2048 335.32 357.09 1.06
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 32 pp2048 478.87 487.53 1.02
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 64 pp2048 539.94 583.22 1.08
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 128 pp2048 423.11 423.94 1.00
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 256 pp2048 439.06 449.36 1.02
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 512 pp2048 439.70 491.91 1.12
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 1024 pp2048 445.50 490.72 1.10
Radeon 8060S Graphics llama 8B IQ3_XXS - 3.0625 bpw 2048 pp2048 445.95 491.52 1.10
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 16 pp2048 442.84 464.27 1.05
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 32 pp2048 491.15 504.31 1.03
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 64 pp2048 557.37 605.14 1.09
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 128 pp2048 369.53 436.72 1.18
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 256 pp2048 385.66 457.67 1.19
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 512 pp2048 394.82 499.66 1.27
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 1024 pp2048 403.76 499.53 1.24
Radeon 8060S Graphics llama 8B IQ4_NL - 4.5 bpw 2048 pp2048 396.45 505.99 1.28
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 16 pp2048 468.16 481.04 1.03
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 32 pp2048 503.95 508.53 1.01
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 64 pp2048 555.81 615.74 1.11
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 128 pp2048 374.90 440.64 1.18
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 256 pp2048 390.18 461.20 1.18
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 512 pp2048 399.19 502.34 1.26
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 1024 pp2048 405.21 503.99 1.24
Radeon 8060S Graphics llama 8B IQ4_XS - 4.25 bpw 2048 pp2048 404.14 510.02 1.26
Radeon 8060S Graphics llama 8B Q2_K_M 16 pp2048 252.08 274.40 1.09
Radeon 8060S Graphics llama 8B Q2_K_M 32 pp2048 378.55 368.09 0.97
Radeon 8060S Graphics llama 8B Q2_K_M 64 pp2048 383.31 426.62 1.11
Radeon 8060S Graphics llama 8B Q2_K_M 128 pp2048 649.61 388.86 0.60
Radeon 8060S Graphics llama 8B Q2_K_M 256 pp2048 695.94 516.38 0.74
Radeon 8060S Graphics llama 8B Q2_K_M 512 pp2048 877.72 633.61 0.72
Radeon 8060S Graphics llama 8B Q2_K_M 1024 pp2048 1012.61 697.89 0.69
Radeon 8060S Graphics llama 8B Q2_K_M 2048 pp2048 1079.09 719.41 0.67
Radeon 8060S Graphics llama 8B Q3_K_S 16 pp2048 303.27 358.38 1.18
Radeon 8060S Graphics llama 8B Q3_K_S 32 pp2048 400.04 401.91 1.00
Radeon 8060S Graphics llama 8B Q3_K_S 64 pp2048 282.53 461.90 1.63
Radeon 8060S Graphics llama 8B Q3_K_S 128 pp2048 944.51 393.53 0.42
Radeon 8060S Graphics llama 8B Q3_K_S 256 pp2048 990.42 426.78 0.43
Radeon 8060S Graphics llama 8B Q3_K_S 512 pp2048 1025.99 473.82 0.46
Radeon 8060S Graphics llama 8B Q3_K_S 1024 pp2048 1058.69 475.26 0.45
Radeon 8060S Graphics llama 8B Q3_K_S 2048 pp2048 1061.71 478.40 0.45
Radeon 8060S Graphics llama 8B Q4_0 16 pp2048 393.96 427.41 1.08
Radeon 8060S Graphics llama 8B Q4_0 32 pp2048 381.99 412.99 1.08
Radeon 8060S Graphics llama 8B Q4_0 64 pp2048 543.81 475.80 0.87
Radeon 8060S Graphics llama 8B Q4_0 128 pp2048 361.92 260.93 0.72
Radeon 8060S Graphics llama 8B Q4_0 256 pp2048 381.71 271.42 0.71
Radeon 8060S Graphics llama 8B Q4_0 512 pp2048 388.27 295.29 0.76
Radeon 8060S Graphics llama 8B Q4_0 1024 pp2048 402.68 294.29 0.73
Radeon 8060S Graphics llama 8B Q4_0 2048 pp2048 403.31 297.37 0.74
Radeon 8060S Graphics llama 8B Q4_1 16 pp2048 402.56 419.29 1.04
Radeon 8060S Graphics llama 8B Q4_1 32 pp2048 380.07 398.34 1.05
Radeon 8060S Graphics llama 8B Q4_1 64 pp2048 244.36 462.19 1.89
Radeon 8060S Graphics llama 8B Q4_1 128 pp2048 914.90 380.42 0.42
Radeon 8060S Graphics llama 8B Q4_1 256 pp2048 969.23 404.01 0.42
Radeon 8060S Graphics llama 8B Q4_1 512 pp2048 978.88 443.29 0.45
Radeon 8060S Graphics llama 8B Q4_1 1024 pp2048 1043.85 448.60 0.43
Radeon 8060S Graphics llama 8B Q4_1 2048 pp2048 1052.86 451.97 0.43
Radeon 8060S Graphics llama 8B Q4_K_S 16 pp2048 413.22 437.53 1.06
Radeon 8060S Graphics llama 8B Q4_K_S 32 pp2048 500.08 502.41 1.00
Radeon 8060S Graphics llama 8B Q4_K_S 64 pp2048 624.42 605.12 0.97
Radeon 8060S Graphics llama 8B Q4_K_S 128 pp2048 957.36 768.80 0.80
Radeon 8060S Graphics llama 8B Q4_K_S 256 pp2048 1011.68 822.97 0.81
Radeon 8060S Graphics llama 8B Q4_K_S 512 pp2048 1026.75 897.58 0.87
Radeon 8060S Graphics llama 8B Q4_K_S 1024 pp2048 1080.94 908.09 0.84
Radeon 8060S Graphics llama 8B Q4_K_S 2048 pp2048 1090.02 916.21 0.84
Radeon 8060S Graphics llama 8B Q5_0 16 pp2048 351.31 367.75 1.05
Radeon 8060S Graphics llama 8B Q5_0 32 pp2048 313.10 322.95 1.03
Radeon 8060S Graphics llama 8B Q5_0 64 pp2048 505.68 379.42 0.75
Radeon 8060S Graphics llama 8B Q5_0 128 pp2048 325.03 305.09 0.94
Radeon 8060S Graphics llama 8B Q5_0 256 pp2048 343.77 322.88 0.94
Radeon 8060S Graphics llama 8B Q5_0 512 pp2048 357.24 350.18 0.98
Radeon 8060S Graphics llama 8B Q5_0 1024 pp2048 362.72 353.22 0.97
Radeon 8060S Graphics llama 8B Q5_0 2048 pp2048 360.77 351.87 0.98
Radeon 8060S Graphics llama 8B Q5_1 16 pp2048 291.16 301.24 1.03
Radeon 8060S Graphics llama 8B Q5_1 32 pp2048 255.17 253.32 0.99
Radeon 8060S Graphics llama 8B Q5_1 64 pp2048 211.09 303.02 1.44
Radeon 8060S Graphics llama 8B Q5_1 128 pp2048 878.51 266.62 0.30
Radeon 8060S Graphics llama 8B Q5_1 256 pp2048 941.06 286.39 0.30
Radeon 8060S Graphics llama 8B Q5_1 512 pp2048 977.66 314.82 0.32
Radeon 8060S Graphics llama 8B Q5_1 1024 pp2048 1023.66 318.03 0.31
Radeon 8060S Graphics llama 8B Q5_1 2048 pp2048 1036.43 320.65 0.31
Radeon 8060S Graphics llama 8B Q5_K_S 16 pp2048 413.63 420.21 1.02
Radeon 8060S Graphics llama 8B Q5_K_S 32 pp2048 361.24 351.14 0.97
Radeon 8060S Graphics llama 8B Q5_K_S 64 pp2048 234.88 416.96 1.78
Radeon 8060S Graphics llama 8B Q5_K_S 128 pp2048 928.36 388.64 0.42
Radeon 8060S Graphics llama 8B Q5_K_S 256 pp2048 976.75 405.59 0.42
Radeon 8060S Graphics llama 8B Q5_K_S 512 pp2048 994.83 439.69 0.44
Radeon 8060S Graphics llama 8B Q5_K_S 1024 pp2048 1040.16 439.73 0.42
Radeon 8060S Graphics llama 8B Q5_K_S 2048 pp2048 1060.45 445.01 0.42
Radeon 8060S Graphics llama 8B Q6_K 16 pp2048 329.35 336.87 1.02
Radeon 8060S Graphics llama 8B Q6_K 32 pp2048 124.09 120.63 0.97
Radeon 8060S Graphics llama 8B Q6_K 64 pp2048 615.70 136.98 0.22
Radeon 8060S Graphics llama 8B Q6_K 128 pp2048 707.78 623.12 0.88
Radeon 8060S Graphics llama 8B Q6_K 256 pp2048 737.48 641.76 0.87
Radeon 8060S Graphics llama 8B Q6_K 512 pp2048 773.75 775.26 1.00
Radeon 8060S Graphics llama 8B Q6_K 1024 pp2048 912.76 905.56 0.99
Radeon 8060S Graphics llama 8B Q6_K 2048 pp2048 997.03 991.04 0.99
Radeon 8060S Graphics llama 8B Q8_0 16 pp2048 310.12 323.85 1.04
Radeon 8060S Graphics llama 8B Q8_0 32 pp2048 411.66 390.63 0.95
Radeon 8060S Graphics llama 8B Q8_0 64 pp2048 534.93 462.41 0.86
Radeon 8060S Graphics llama 8B Q8_0 128 pp2048 344.33 355.86 1.03
Radeon 8060S Graphics llama 8B Q8_0 256 pp2048 355.69 372.63 1.05
Radeon 8060S Graphics llama 8B Q8_0 512 pp2048 369.97 402.62 1.09
Radeon 8060S Graphics llama 8B Q8_0 1024 pp2048 381.59 401.03 1.05
Radeon 8060S Graphics llama 8B Q8_0 2048 pp2048 384.16 404.36 1.05

In my testing the performance changes from this PR are very inconsistent across batch sizes and data types. It cannot be merged like this.

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 8, 2026

@JohannesGaessler thank you for the sweep! I understand your position that the variance is not desirable.

Asking naively, is the sweep impacted by the issue described in this PR: #21282

Let me know if you have suggestions on how to reduce the variance. Thanks. I'd also appreciate your eyes on the sweep @tbocek did showing material benefit on more modern and larger models than llama 8B.

@pedapudi
Copy link
Copy Markdown
Author

pedapudi commented Apr 9, 2026

@JohannesGaessler

Perhaps corroborating your findings: this PR appears to be a benefit for MoE models, but a wildcard for dense model (eg., Qwen 3.5 9B).

There might be a path forward if this change is categorically not beneficial for dense models simply by reverting to the old values if the model is dense in the mmq code path? I don't quite know why this isn't beneficial (have not had a chance to look more closely). What's your opinion?

@remeh
Copy link
Copy Markdown

remeh commented Apr 10, 2026

👋
On a Strix Halo, I ran some benchmarks with 3 different MoE models and with Qwen 3.5 9B.

In all these benchmarks, main is 25eec6f32 (master from the 6 of April), compared to custom which is 25eec6f32 + patch from this PR. llama.cpp compilation flags, with ROCm 7.12:

cmake -S . -B build \
        -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
        -DGGML_HIP=ON \
        -DHIP_PLATFORM=amd \
        -DGGML_CUDA_FA_ALL_QUANTS=ON \
        -DGGML_HIP_ROCWMMA_FATTN=OFF \
        -DGPU_TARGETS=gfx1151 \
        -DCMAKE_BUILD_TYPE=Release \
        -DLLAMA_OPENSSL=ON \
        -DLLAMA_BUILD_EXAMPLES=OFF \
        --fresh

First, here are some other benchmarks results with 3 MoE models at various context sizes: Gemma4 26B-A4B, Nemotron3-Super and GLM4.7-Flash. They are looking great!

+---------------------------------------+---------------------+-------------+-------------+-----------+
| Model                                 | Test                | Main (t/s)  | Custom (t/s)| Delta (%) |
+---------------------------------------+---------------------+-------------+-------------+-----------+
| GLM4.7-Flash Q4_K_M                   | pp512 @ d500        |     837.93  |    1037.35  |    +23.8% |
| GLM4.7-Flash Q4_K_M                   | pp512 @ d5000       |     399.56  |     443.86  |    +11.1% |
| GLM4.7-Flash Q4_K_M                   | pp512 @ d10000      |     256.08  |     274.03  |     +7.0% |
| GLM4.7-Flash Q4_K_M                   | pp512 @ d20000      |     148.13  |     154.00  |     +4.0% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d500        |    1146.08  |    1386.00  |    +20.9% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d5000       |     913.15  |    1053.38  |    +15.4% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d10000      |     751.92  |     852.80  |    +13.4% |
| Gemma-4-26B-A4B Q6_K_L                | pp512 @ d20000      |     629.00  |     693.00  |    +10.2% |
| Nemotron-3-Super-120B-A12B Q5_K_M     | pp512 @ d500        |     234.03  |     334.16  |    +42.8% |
| Nemotron-3-Super-120B-A12B Q5_K_M     | pp512 @ d5000       |     225.21  |     314.10  |    +39.5% |
| Nemotron-3-Super-120B-A12B Q5_K_M     | pp512 @ d10000      |     219.19  |     303.46  |    +38.4% |
| Nemotron-3-Super-120B-A12B Q5_K       | pp512 @ d20000      |     210.96  |     285.72  |    +35.4% |
+---------------------------------------+---------------------+-------------+-------------+-----------+

Then, here are some results with Qwen 3.5 9B, I've tried to use kind of the same params you were using in the llama 8B benchmarks you shared earlier @JohannesGaessler, varying the quants and the ubatch size. I'm wondering the reason you're interested in the different ubatch size, if you have some pointers?
Unfortunately, these ones does not look great.

./build/bin/llama-bench --model [...] -ngl 999 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 16,32,64,128,256,512,1024,2048 --batch-size 2048 -p 2048 -n 0 -r 5
| model                          |       size |     params | backend | n_ubatch |            test |                  t/s (main) |                  t/s (custom) | diff % |
| ------------------------------ | ---------: | ---------: | ------- | -------: | --------------: | --------------------------: | ----------------------------: | -----: |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        269.30 ± 0.52 |        306.54 ± 0.31 | +13.83% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        484.15 ± 0.47 |        524.67 ± 0.39 | +8.37% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        658.36 ± 0.32 |        633.19 ± 0.29 | -3.82% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        867.36 ± 0.37 |        671.28 ± 0.24 | -22.60% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        953.10 ± 0.27 |        727.43 ± 0.07 | -23.68% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        948.09 ± 0.54 |        766.46 ± 0.57 | -19.15% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |        979.33 ± 3.79 |        774.11 ± 2.03 | -20.95% |
| qwen35 9B IQ2_S - 2.5 bpw      |   3.23 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |        995.85 ± 1.42 |        761.93 ± 0.70 | -23.49% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        312.56 ± 0.07 |        324.17 ± 0.09 | +3.71% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        492.62 ± 0.38 |        530.46 ± 0.19 | +7.68% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        748.53 ± 0.54 |        650.56 ± 0.15 | -13.09% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        909.79 ± 4.44 |        804.11 ± 0.53 | -11.62% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        989.23 ± 2.28 |        873.73 ± 0.72 | -11.67% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        988.74 ± 1.24 |        931.09 ± 0.82 | -5.83% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1027.97 ± 4.62 |        914.93 ± 3.57 | -10.99% |
| qwen35 9B IQ3_S mix - 3.66 bpw |   4.34 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |        985.60 ± 7.82 |        929.42 ± 4.90 | -5.70% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        375.01 ± 0.14 |        391.74 ± 0.11 | +4.46% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        467.51 ± 0.10 |        497.40 ± 0.20 | +6.39% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        834.80 ± 0.44 |        619.05 ± 0.26 | -25.84% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |      128 |          pp2048 |       1033.95 ± 0.55 |        933.30 ± 0.57 | -9.74% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1148.83 ± 1.14 |       1022.15 ± 0.49 | -11.03% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1152.68 ± 1.11 |       1085.20 ± 0.85 | -5.86% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1160.98 ± 2.05 |       1085.15 ± 2.19 | -6.53% |
| qwen35 9B IQ4_XS - 4.25 bpw    |   4.84 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1171.66 ± 1.90 |       1085.44 ± 2.67 | -7.36% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        215.73 ± 0.02 |        209.64 ± 0.05 | -2.82% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        392.28 ± 0.07 |        389.85 ± 0.06 | -0.62% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        569.54 ± 0.24 |        512.97 ± 0.19 | -9.93% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        700.48 ± 0.34 |        524.62 ± 0.14 | -25.10% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        699.38 ± 0.64 |        695.53 ± 1.93 | -0.55% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        856.65 ± 1.03 |        843.56 ± 0.92 | -1.53% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |        962.25 ± 3.09 |        916.03 ± 4.51 | -4.80% |
| qwen35 9B Q2_K - Medium        |   4.64 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1024.81 ± 2.65 |        993.85 ± 4.50 | -3.02% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        282.20 ± 0.04 |        318.68 ± 0.09 | +12.93% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        549.85 ± 0.91 |        576.63 ± 0.32 | +4.87% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        769.48 ± 0.35 |        723.11 ± 0.20 | -6.03% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        939.80 ± 0.48 |        833.42 ± 0.31 | -11.32% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1031.24 ± 0.51 |        908.94 ± 0.40 | -11.85% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1029.60 ± 0.37 |        967.00 ± 0.60 | -6.08% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1046.62 ± 8.23 |        979.06 ± 2.87 | -6.45% |
| qwen35 9B Q3_K - Medium        |   4.52 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1035.41 ± 4.65 |        977.38 ± 2.71 | -5.60% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        350.44 ± 0.05 |        379.83 ± 0.05 | +8.39% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        327.12 ± 0.07 |        418.54 ± 0.50 | +27.95% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        825.10 ± 0.31 |        506.30 ± 0.12 | -38.64% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |      128 |          pp2048 |       1017.04 ± 0.92 |        919.44 ± 0.26 | -9.60% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1125.79 ± 0.56 |       1009.01 ± 0.37 | -10.37% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1129.42 ± 1.56 |       1056.94 ± 1.04 | -6.42% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1112.29 ± 2.75 |       1072.47 ± 3.07 | -3.58% |
| qwen35 9B Q4_0                 |   5.06 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1154.69 ± 1.31 |       1053.06 ± 2.62 | -8.80% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        322.60 ± 0.13 |        343.07 ± 0.07 | +6.34% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        525.22 ± 0.44 |        573.34 ± 0.44 | +9.16% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        729.52 ± 0.38 |        703.25 ± 0.16 | -3.60% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        885.56 ± 0.34 |        808.43 ± 0.16 | -8.71% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        965.19 ± 0.47 |        883.44 ± 0.55 | -8.47% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        946.71 ± 0.85 |        959.60 ± 1.46 | +1.36% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1002.28 ± 4.05 |        984.98 ± 2.50 | -1.72% |
| qwen35 9B Q4_K - Medium        |   5.48 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1030.76 ± 1.32 |       1005.47 ± 2.28 | -2.45% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        335.56 ± 0.05 |        354.96 ± 0.48 | +5.78% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        554.94 ± 0.32 |        601.98 ± 0.65 | +8.48% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        783.62 ± 0.26 |        762.95 ± 0.40 | -2.64% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        955.86 ± 0.85 |        873.55 ± 0.28 | -8.61% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1050.56 ± 0.68 |        965.18 ± 0.30 | -8.13% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1039.32 ± 0.84 |       1028.08 ± 1.09 | -1.08% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1080.35 ± 3.77 |       1056.10 ± 1.94 | -2.24% |
| qwen35 9B Q4_K - Small         |   5.19 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1068.03 ± 0.85 |       1040.37 ± 3.32 | -2.59% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        315.51 ± 0.06 |        330.29 ± 0.08 | +4.68% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        511.30 ± 0.32 |        547.34 ± 0.15 | +7.05% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        765.90 ± 0.28 |        682.83 ± 0.53 | -10.84% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        903.56 ± 0.58 |        841.38 ± 0.25 | -6.88% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        999.82 ± 0.82 |        920.34 ± 0.40 | -7.95% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        999.64 ± 1.28 |        967.92 ± 1.43 | -3.17% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1005.29 ± 3.33 |        992.12 ± 3.45 | -1.31% |
| qwen35 9B Q5_K - Medium        |   6.38 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1030.49 ± 1.05 |       1011.65 ± 2.35 | -1.83% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        284.80 ± 0.26 |        298.27 ± 0.13 | +4.73% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        448.14 ± 0.18 |        479.44 ± 0.09 | +6.99% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        635.53 ± 0.25 |        573.79 ± 0.26 | -9.71% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        757.23 ± 0.51 |        687.38 ± 0.37 | -9.22% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |      256 |          pp2048 |        836.38 ± 0.36 |        732.24 ± 0.20 | -12.45% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |      512 |          pp2048 |        796.02 ± 0.96 |        785.80 ± 1.47 | -1.28% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |        911.91 ± 3.04 |        882.34 ± 5.92 | -3.24% |
| qwen35 9B Q6_K                 |   7.58 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |        960.51 ± 5.12 |        948.52 ± 5.60 | -1.25% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |       16 |          pp2048 |        272.09 ± 0.07 |        283.55 ± 0.05 | +4.21% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |       32 |          pp2048 |        407.91 ± 0.36 |        435.75 ± 0.10 | +6.82% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |       64 |          pp2048 |        739.65 ± 0.37 |        546.70 ± 0.20 | -26.09% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |      128 |          pp2048 |        951.41 ± 0.90 |        837.72 ± 0.57 | -11.95% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |      256 |          pp2048 |       1061.03 ± 0.54 |        913.86 ± 0.50 | -13.87% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |      512 |          pp2048 |       1063.48 ± 0.78 |        979.07 ± 0.90 | -7.94% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |     1024 |          pp2048 |       1084.12 ± 3.87 |        984.91 ± 2.14 | -9.15% |
| qwen35 9B Q8_0                 |   8.88 GiB |     8.95 B | ROCm    |     2048 |          pp2048 |       1099.03 ± 3.99 |        973.19 ± 2.36 | -11.45% |

HTH

@0xSero
Copy link
Copy Markdown

0xSero commented Apr 14, 2026

Tested this on a Framework Desktop (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB RAM, Fedora 43, ROCm 7.2.1) with: Qwen3.5-122B-A10B-REAP-20 Q6_K.

I used Codex to build a patched ROCm container from the PR diff and compared it against the stock ROCm
7.2.1 toolbox build. On my machine the patch significantly improves prefill, and basically
neutral for decode:

test stock ROCm patched ROCm delta
pp512 262.18 t/s 354.57 t/s +35%
tg128 20.83 t/s 21.00 t/s +1%
pp2048+tg128 162.84 t/s 194.36 t/s +19%
pp8192+tg128 228.66 t/s 302.44 t/s +32%
pp16384+tg128 234.14 t/s 314.70 t/s +34%
pp32768+tg128 221.31 t/s 291.03 t/s +31%
pp65536+tg128 190.37 t/s 238.28 t/s +25%
pp131072+tg128 144.44 t/s 171.66 t/s +19%

So at least on gfx1151 + a large MoE model, this looks very real and very useful.

@kyuz0
Copy link
Copy Markdown

kyuz0 commented Apr 15, 2026

I have added a toolbox with this PR and run the benchmark:

https://kyuz0.github.io/amd-strix-halo-toolboxes/

The benefits seem to be mostly for short context, but if you switch to the 30k context tests, results do not look great, sometimes better, sometimes worse, but not by that much.

justinappler added a commit to justinappler/llama.cpp-strix-halo that referenced this pull request Apr 19, 2026
Applies the six-edit ggml-cuda/mmq.cuh change from upstream PR ggml-org#21344
(pedapudi/llama.cpp@gfx1151-opt) that gives RDNA 3.5 its own MMQ tile
and warp sizing — mmq_x_max=48, mmq_y=64, nwarps=4 — instead of
inheriting the discrete RDNA3 values tuned for 7900 XTX-class hardware.

Hypothesis, expected numbers (from kyuz0's independent A/B logs), and
bench plan in strix-halo/mmq-rdna3_5.md. Also includes the previously
uncommitted docs (codex-insights, rocm-config) and updates to NOTES,
README, uma-integrated reflecting the UMA-deprioritization decision,
plus a .gitignore entry for useful-repos/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
justinappler added a commit to justinappler/llama.cpp-strix-halo that referenced this pull request Apr 19, 2026
Applies the six-edit ggml-cuda/mmq.cuh change from upstream PR ggml-org#21344
(pedapudi/llama.cpp@gfx1151-opt) that gives RDNA 3.5 its own MMQ tile
and warp sizing — mmq_x_max=48, mmq_y=64, nwarps=4 — instead of
inheriting the discrete RDNA3 values tuned for 7900 XTX-class hardware.

Hypothesis, expected numbers (from kyuz0's independent A/B logs), and
bench plan in strix-halo/mmq-rdna3_5.md. Also includes the previously
uncommitted docs (codex-insights, rocm-config) and updates to NOTES,
README, uma-integrated reflecting the UMA-deprioritization decision,
plus a .gitignore entry for useful-repos/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants