Misc. bug: severe performance degredation starting in September on CPU ARM processor inference (compile flag "-DGGML_CPU_AARCH64=ON")

### Name and Version

I am doing CPU inference with a Rockchip RK3588 processor (ARMv8, Debian 12.11) on llama.cpp server (Llama 3B and Gemma 4B Q4_0 quant). Sometime between August 21st and the latest build performance fell off a cliff with this configuration, going from about 8 tps to 5tps with a context of ~2k for Llama 3B. 

I confirmed that Q4_0 runtime repack is happening in both cases
```
repack: repack tensor blk.0.attn_q.weight with q4_0_4x4
repack: repack tensor blk.0.attn_k.weight with q4_0_4x4
...
```

I am adding the "-DGGML_CPU_AARCH64=ON" compile flag
```
cmake -B build
...
CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release --target llama-server
...
```

When I hard reset to a commit from August 21st `git reset --hard 54a241f505d515d625767b993bfd573ecee306b9` then compile and run, performance is much better. 

```
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 1472 | processing task
slot update_slots: id  0 | task 1472 | new prompt, n_ctx_slot = 10816, n_keep = 0, n_prompt_tokens = 1551
slot update_slots: id  0 | task 1472 | kv cache rm [1544, end)
slot update_slots: id  0 | task 1472 | prompt processing progress, n_past = 1551, n_tokens = 7, progress = 0.004513
slot update_slots: id  0 | task 1472 | prompt done, n_past = 1551, n_tokens = 7
slot      release: id  0 | task 1472 | stop processing: n_past = 2220, truncated = 0
slot print_timing: id  0 | task 1472 | 
prompt eval time =     404.42 ms /     7 tokens (   57.77 ms per token,    17.31 tokens per second)
       eval time =   83339.42 ms /   670 tokens (  124.39 ms per token,     8.04 tokens per second)
      total time =   83743.83 ms /   677 tokens

```

Using the latest it is about 40% slower. 


```
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 1472 | processing task
slot update_slots: id  0 | task 1472 | new prompt, n_ctx_slot = 11008, n_keep = 0, n_prompt_tokens = 1552
slot update_slots: id  0 | task 1472 | kv cache rm [1545, end)
slot update_slots: id  0 | task 1472 | prompt processing progress, n_past = 1552, n_tokens = 7, progress = 0.004510
slot update_slots: id  0 | task 1472 | prompt done, n_past = 1552, n_tokens = 7
slot      release: id  0 | task 1472 | stop processing: n_past = 2245, truncated = 0
slot print_timing: id  0 | task 1472 | 
prompt eval time =     720.48 ms /     7 tokens (  102.93 ms per token,     9.72 tokens per second)
       eval time =  121327.90 ms /   694 tokens (  174.82 ms per token,     5.72 tokens per second)
      total time =  122048.37 ms /   701 tokens

```
I was actually getting different errors using builds from early September, so I don't know exactly when it broke but something happened between August 21st and now that make performance much much worse


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./build/bin/llama-server --model Llama-3.2-3B-Instruct-Q4_0.gguf   --slot-save-path ~/Development/llama_cpp_server_state   --port 8080   --ctx-size 10800   --threads 4
```

### Problem description & steps to reproduce

Run inference with this configuration on build from August 21st and then the latest. 

### First Bad Commit

I was unable to pin it down due to other errors I was getting in builds from early September, but on August 21st it was working

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: severe performance degredation starting in September on CPU ARM processor inference (compile flag "-DGGML_CPU_AARCH64=ON") #16242

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: severe performance degredation starting in September on CPU ARM processor inference (compile flag "-DGGML_CPU_AARCH64=ON") #16242

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions