Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Nov 13, 2025

This is a continuation of #17030 after a performance regression was reported.

Perplexity Comparison (Repack vs Non-Repack)

Command:

MODELS="unsloth/Qwen3-8B-128K-GGUF:Q4_0 ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF:Q4_0 LiquidAI/LFM2-700M-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0"
for d in build-cpu-aarm64 build-cpu-aarm64-norepack; do
    for model in $MODELS; do
        ${d}/bin/llama-perplexity -hf "$model" -f ./wikitext-2-raw/wiki.test.raw --chunks 20 -dev none
    done
done
Model Repack PPL Non-Repack PPL
LFM2-700M Q4_0 20.3324 ± 0.87133 20.3324 ± 0.87133
LFM2-1.2B Q4_0 15.7524 ± 0.63304 15.7524 ± 0.63304
Meta-Llama-3.1-8B-Instruct Q4_0 8.6578 ± 0.30323 8.6578 ± 0.30323
Qwen3-8B-128K Q4_0 11.1735 ± 0.48175 11.1735 ± 0.48175

Llama-bench

(M4 Max)

model size params backend threads fa test t/s
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 pp256 148.88 ± 0.60
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 tg128 47.71 ± 0.35
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 pp256 151.26 ± 1.94
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 tg128 43.47 ± 0.78
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 pp256 3248.97 ± 32.82
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 tg128 562.68 ± 7.35
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 pp256 1585.66 ± 13.60
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 tg128 349.23 ± 2.42

build: c77bafd (6967) THIS PR

model size params backend threads fa test t/s
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 pp256 148.80 ± 0.18
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 tg128 48.50 ± 0.81
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 pp256 160.24 ± 0.76
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 tg128 45.60 ± 0.17
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 pp256 3269.37 ± 22.99
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 tg128 595.18 ± 3.34
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 pp256 1606.13 ± 8.51
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 tg128 362.24 ± 3.19

build: 2776db6 (7047) MASTER

@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 13, 2025

@max-krasnyansky can you please give this PR a shot and let me know if the perf is fixed? I've simplified a lot the chunking (essentially left it as it is for 2d tensors and "iterating" over planes)

@Alcpz Alcpz changed the title Alcpz/batched repack mul mat ggml-cpu: handle 3d tensors in repack mat_mul Nov 13, 2025
@max-krasnyansky
Copy link
Collaborator

@max-krasnyansky can you please give this PR a shot and let me know if the perf is fixed? I've simplified a lot the chunking (essentially left it as it is for 2d tensors and "iterating" over planes)

Yep. Looks great! Thanks for the quick followup.
I tested llama-3.2-1B/3B and qwen3-0.6B/4B with chunking instrumentation and it generates the same number of chunks as before. The performance is the same as well, checked 2,4,6 threads on Snapdragons.

I'm marking it as ready to merge and approving.

btw If you have some more time/energy it'd be great to add chunking to the repacked mul_mat_id for the MOE models.
And we should revisit non-repacked mul_mat and mul_mat_id chunking to use this n_threads * 4 formula for the number of chunks instead of the arbitrary 16/64 that we have now.
That was/is on my TODO list after updating flash_attn and repacked mul_mat but my list is a little too long at the moment :)

@max-krasnyansky max-krasnyansky marked this pull request as ready for review November 13, 2025 17:00
@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 13, 2025

Thanks @max-krasnyansky. I'll be collaborating every now and then, but I have a couple of implementations of the repacked q4_K to address first. Not sure if you are able to merge, if not, I'll just ping gerganov once ci passes. Thanks again!

Edit: Not sure if by marking "ready to merge" you meant to merge once ci passed

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Nov 13, 2025

Thanks @max-krasnyansky. I'll be collaborating every now and then, but I have a couple of implementations of the repacked q4_K to address first. Not sure if you are able to merge, if not, I'll just ping gerganov once ci passes. Thanks again!

Edit: Not sure if by marking "ready to merge" you meant to merge once ci passed

I meant switching from "Draft" to "Ready" :)
I can merge it. No worries.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 13, 2025
@max-krasnyansky max-krasnyansky merged commit becc481 into ggml-org:master Nov 13, 2025
63 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants