-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
I am doing CPU inference with a Rockchip RK3588 processor (ARMv8, Debian 12.11) on llama.cpp server (Llama 3B and Gemma 4B Q4_0 quant). Sometime between August 21st and the latest build performance fell off a cliff with this configuration, going from about 8 tps to 5tps with a context of ~2k for Llama 3B.
I confirmed that Q4_0 runtime repack is happening in both cases
repack: repack tensor blk.0.attn_q.weight with q4_0_4x4
repack: repack tensor blk.0.attn_k.weight with q4_0_4x4
...
I am adding the "-DGGML_CPU_AARCH64=ON" compile flag
cmake -B build
...
CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release --target llama-server
...
When I hard reset to a commit from August 21st git reset --hard 54a241f505d515d625767b993bfd573ecee306b9
then compile and run, performance is much better.
srv log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id 0 | task 1472 | processing task
slot update_slots: id 0 | task 1472 | new prompt, n_ctx_slot = 10816, n_keep = 0, n_prompt_tokens = 1551
slot update_slots: id 0 | task 1472 | kv cache rm [1544, end)
slot update_slots: id 0 | task 1472 | prompt processing progress, n_past = 1551, n_tokens = 7, progress = 0.004513
slot update_slots: id 0 | task 1472 | prompt done, n_past = 1551, n_tokens = 7
slot release: id 0 | task 1472 | stop processing: n_past = 2220, truncated = 0
slot print_timing: id 0 | task 1472 |
prompt eval time = 404.42 ms / 7 tokens ( 57.77 ms per token, 17.31 tokens per second)
eval time = 83339.42 ms / 670 tokens ( 124.39 ms per token, 8.04 tokens per second)
total time = 83743.83 ms / 677 tokens
Using the latest it is about 40% slower.
srv log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id 0 | task 1472 | processing task
slot update_slots: id 0 | task 1472 | new prompt, n_ctx_slot = 11008, n_keep = 0, n_prompt_tokens = 1552
slot update_slots: id 0 | task 1472 | kv cache rm [1545, end)
slot update_slots: id 0 | task 1472 | prompt processing progress, n_past = 1552, n_tokens = 7, progress = 0.004510
slot update_slots: id 0 | task 1472 | prompt done, n_past = 1552, n_tokens = 7
slot release: id 0 | task 1472 | stop processing: n_past = 2245, truncated = 0
slot print_timing: id 0 | task 1472 |
prompt eval time = 720.48 ms / 7 tokens ( 102.93 ms per token, 9.72 tokens per second)
eval time = 121327.90 ms / 694 tokens ( 174.82 ms per token, 5.72 tokens per second)
total time = 122048.37 ms / 701 tokens
I was actually getting different errors using builds from early September, so I don't know exactly when it broke but something happened between August 21st and now that make performance much much worse
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./build/bin/llama-server --model Llama-3.2-3B-Instruct-Q4_0.gguf --slot-save-path ~/Development/llama_cpp_server_state --port 8080 --ctx-size 10800 --threads 4
Problem description & steps to reproduce
Run inference with this configuration on build from August 21st and then the latest.
First Bad Commit
I was unable to pin it down due to other errors I was getting in builds from early September, but on August 21st it was working