llama-bench performance drop since build b9437 #24708

KaySees · 2026-06-16T20:34:40Z

KaySees
Jun 16, 2026

I have been testing performance tuning for Qwen3.6-35B-A3B and Gemma-4-26B-A4B on llama-bench these few days on build b9434, and got decent numbers back, however, since build b9437 and on, performance had dropped significantly, and I can hear my GPU's fan goes full on when running the tests.

anyone got any ideas?

Answered by Kononnable

Jun 17, 2026

Can you test with different options for flash attention? #23714 changed the default for llama-bench
-fa, --flash-attn <on|off|auto> (default: auto)

View full answer

Kangaroux · 2026-06-17T00:34:22Z

Kangaroux
Jun 17, 2026

Are you using vulkan? I saw like a 6x drop in prefill speed with the new vulkan build (9672, coming from 9209). Switched to rocm and it's fine

1 reply

KaySees Jun 17, 2026
Author

i'm using CUDA, on a RTX 3070ti, my drop is like 20x, from 481 down to 18 tps PP, AND 22 down to less than 2 tps TG.

Kononnable · 2026-06-17T07:34:39Z

Kononnable
Jun 17, 2026

Can you test with different options for flash attention? #23714 changed the default for llama-bench
-fa, --flash-attn <on|off|auto> (default: auto)

4 replies

KaySees Jun 17, 2026
Author

the result posted was running with -fa on, I did ean a rest to check the difference with -fa on,off and result is very close wuth fa off slighly faster, we're talking 170tps PP (on) vs 174tps PP (off), TG difference is only in the decimals. These are with the b9437+ builds, I ran the same command line against b9434, b9436, b9437, b9646 and b967x(6/16's build). The symptom starts at b9637 and on.

Kononnable Jun 17, 2026

I don't see a lot of changes between b9434 and b9637 that could justify that, especially if you also tested b9436 and it worked correctly.
b9434...b9437

You could:

check if the difference is also observable in llama-server (or any other tool than llama-bench) - just a sanity check that those changes didn't trigger some unforeseen problems.
Try to build both versions locally and see if the regression is seen only in one of them - should exclude a possibility of dependency upgrade causing the problem.

KaySees Jun 17, 2026
Author

Unfortunatly I'm not a coder, I'll see what I can do for locally building them. On the note of running llama-server, I do see a drop in performance, however, not as drastic as llama-bench. The PP dropped to around 12 tps vs 300+tps and TG drops from ~25tps down to less than 2 tps, and that's what triggers me doing the investigation and checking when does the issue start.

KaySees Jun 17, 2026
Author

In case the command line would help, this is what I used on llama-bench:
./llama/llama-bench.exe -m d:/LLM-models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -m d:/LLM-models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -b 512 -t 8 -fa on -ctk q8_0 -fitt 2048 -fitc 96000 -d 20480 -p 2048 -o csv > llama-bench-gemma4_v_Qwen36-test-20260616=1.csv

slewsys · 2026-06-17T15:38:21Z

slewsys
Jun 17, 2026

I have also seen nearly an order-of-magnitude drop in benchmark performance of GPT 120b between b7761 and b9669+ with RTX 3090. I'm currently on Fedora 44, building against Cuda 13.3 and Intel OneAPI (with GCC v15). I do intend to git bisect llama.cpp when time permits. But yes, it might be evolving too fast for consistency :) I've seen llama.cpp's performance improve over that same period on Strix Halo running ROCm.

0 replies

KaySees · 2026-06-19T20:17:01Z

KaySees
Jun 19, 2026
Author

OK, found the culprit finally, it is actually my fault. Build 9436 and earlier won't error out on -fa on, however, since it expects either 1 or 0 as input, it will set to 0, and when I run llama-bench with the newer builds, they now recognizes on, off, auto as input, so it now runs with fa set to on.

After changing to off, everything seems to returns to normal. The difference between llama-bench and llama-server however, was even if I have -fa on in llama-server, it stills return over 600 tps for PP, and maintain ~17-18 tps avg for TG, while on llama-bench, if I set -fa 1, then it will drop the PP and TG tps as what I see as a degredation...weird.

0 replies

llama-bench performance drop since build b9437 #24708

Uh oh!

KaySees Jun 16, 2026

Replies: 4 comments · 5 replies

Uh oh!

Kangaroux Jun 17, 2026

Uh oh!

Uh oh!

KaySees Jun 17, 2026 Author

Uh oh!

Kononnable Jun 17, 2026

Uh oh!

KaySees Jun 17, 2026 Author

Uh oh!

Kononnable Jun 17, 2026

Uh oh!

KaySees Jun 17, 2026 Author

Uh oh!

KaySees Jun 17, 2026 Author

Uh oh!

slewsys Jun 17, 2026

Uh oh!

KaySees Jun 19, 2026 Author

KaySees
Jun 16, 2026

Replies: 4 comments 5 replies

Kangaroux
Jun 17, 2026

KaySees Jun 17, 2026
Author

Kononnable
Jun 17, 2026

KaySees Jun 17, 2026
Author

KaySees Jun 17, 2026
Author

KaySees Jun 17, 2026
Author

slewsys
Jun 17, 2026

KaySees
Jun 19, 2026
Author