Skip to content

Misc. bug: severe performance degredation starting in September on CPU ARM processor inference (compile flag "-DGGML_CPU_AARCH64=ON") #16242

@ekcrisp

Description

@ekcrisp

Name and Version

I am doing CPU inference with a Rockchip RK3588 processor (ARMv8, Debian 12.11) on llama.cpp server (Llama 3B and Gemma 4B Q4_0 quant). Sometime between August 21st and the latest build performance fell off a cliff with this configuration, going from about 8 tps to 5tps with a context of ~2k for Llama 3B.

I confirmed that Q4_0 runtime repack is happening in both cases

repack: repack tensor blk.0.attn_q.weight with q4_0_4x4
repack: repack tensor blk.0.attn_k.weight with q4_0_4x4
...

I am adding the "-DGGML_CPU_AARCH64=ON" compile flag

cmake -B build
...
CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release --target llama-server
...

When I hard reset to a commit from August 21st git reset --hard 54a241f505d515d625767b993bfd573ecee306b9 then compile and run, performance is much better.

srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 1472 | processing task
slot update_slots: id  0 | task 1472 | new prompt, n_ctx_slot = 10816, n_keep = 0, n_prompt_tokens = 1551
slot update_slots: id  0 | task 1472 | kv cache rm [1544, end)
slot update_slots: id  0 | task 1472 | prompt processing progress, n_past = 1551, n_tokens = 7, progress = 0.004513
slot update_slots: id  0 | task 1472 | prompt done, n_past = 1551, n_tokens = 7
slot      release: id  0 | task 1472 | stop processing: n_past = 2220, truncated = 0
slot print_timing: id  0 | task 1472 | 
prompt eval time =     404.42 ms /     7 tokens (   57.77 ms per token,    17.31 tokens per second)
       eval time =   83339.42 ms /   670 tokens (  124.39 ms per token,     8.04 tokens per second)
      total time =   83743.83 ms /   677 tokens

Using the latest it is about 40% slower.

srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 1472 | processing task
slot update_slots: id  0 | task 1472 | new prompt, n_ctx_slot = 11008, n_keep = 0, n_prompt_tokens = 1552
slot update_slots: id  0 | task 1472 | kv cache rm [1545, end)
slot update_slots: id  0 | task 1472 | prompt processing progress, n_past = 1552, n_tokens = 7, progress = 0.004510
slot update_slots: id  0 | task 1472 | prompt done, n_past = 1552, n_tokens = 7
slot      release: id  0 | task 1472 | stop processing: n_past = 2245, truncated = 0
slot print_timing: id  0 | task 1472 | 
prompt eval time =     720.48 ms /     7 tokens (  102.93 ms per token,     9.72 tokens per second)
       eval time =  121327.90 ms /   694 tokens (  174.82 ms per token,     5.72 tokens per second)
      total time =  122048.37 ms /   701 tokens

I was actually getting different errors using builds from early September, so I don't know exactly when it broke but something happened between August 21st and now that make performance much much worse

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./build/bin/llama-server --model Llama-3.2-3B-Instruct-Q4_0.gguf   --slot-save-path ~/Development/llama_cpp_server_state   --port 8080   --ctx-size 10800   --threads 4

Problem description & steps to reproduce

Run inference with this configuration on build from August 21st and then the latest.

First Bad Commit

I was unable to pin it down due to other errors I was getting in builds from early September, but on August 21st it was working

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions