llama-bench performance drop since build b9437 #24708
-
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 5 replies
-
|
Are you using vulkan? I saw like a 6x drop in prefill speed with the new vulkan build (9672, coming from 9209). Switched to rocm and it's fine |
Beta Was this translation helpful? Give feedback.
-
|
Can you test with different options for flash attention? #23714 changed the default for llama-bench |
Beta Was this translation helpful? Give feedback.
-
|
I have also seen nearly an order-of-magnitude drop in benchmark performance of GPT 120b between b7761 and b9669+ with RTX 3090. I'm currently on Fedora 44, building against Cuda 13.3 and Intel OneAPI (with GCC v15). I do intend to git bisect llama.cpp when time permits. But yes, it might be evolving too fast for consistency :) I've seen llama.cpp's performance improve over that same period on Strix Halo running ROCm. |
Beta Was this translation helpful? Give feedback.
-
|
OK, found the culprit finally, it is actually my fault. Build 9436 and earlier won't error out on -fa on, however, since it expects either 1 or 0 as input, it will set to 0, and when I run llama-bench with the newer builds, they now recognizes on, off, auto as input, so it now runs with fa set to on. After changing to off, everything seems to returns to normal. The difference between llama-bench and llama-server however, was even if I have -fa on in llama-server, it stills return over 600 tps for PP, and maintain ~17-18 tps avg for TG, while on llama-bench, if I set -fa 1, then it will drop the PP and TG tps as what I see as a degredation...weird. |
Beta Was this translation helpful? Give feedback.


Can you test with different options for flash attention? #23714 changed the default for llama-bench
-fa, --flash-attn <on|off|auto> (default: auto)