Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Nov 26, 2025

For small shapes where the number of columns is small (i.e. 16), the current logic skipped some chunks due to rounding.

The issue was observed with NB_COLS 8 and ne01 16, and could potentially happen with NB_COLS 4 and other combinations threads/shape.
This is also affected the corner case where chunking is disabled.

@max-krasnyansky I checked the performance here and didn't see any issue. Let me know if you'd like me to perform any particular test

Performance

RPI5

model test 2f416b2 (7162) t/s 3e18dba (7161) t/s
lfm2 350M Q4_0 pp256 174.46 ± 0.07 173.41 ± 0.64
lfm2 350M Q4_0 tg128 51.58 ± 0.03 51.38 ± 0.26
lfm2 700M Q4_0 pp256 81.79 ± 0.01 82.55 ± 0.03
lfm2 700M Q4_0 tg128 25.78 ± 0.00 25.86 ± 0.00

M4 max

model test 2f416b2 (7162) t/s 3e18dba (7161) t/s
lfm2 1.2B Q4_K Medium pp256 682.39 ± 3.23 682.82 ± 2.97
lfm2 1.2B Q4_K Medium tg128 233.77 ± 4.45 234.96 ± 0.57
lfm2 700M Q4_K Medium pp256 1070.08 ± 2.77 1067.29 ± 7.14
lfm2 700M Q4_K Medium tg128 331.12 ± 1.27 333.13 ± 1.32
llama 8B Q4_K Medium pp256 100.26 ± 0.11 96.65 ± 1.75
llama 8B Q4_K Medium tg128 43.10 ± 0.50 41.69 ± 0.72
qwen3 8B Q4_K Medium pp256 94.40 ± 0.33 90.45 ± 0.34
qwen3 8B Q4_K Medium tg128 40.92 ± 0.33 40.29 ± 0.27

@max-krasnyansky
Copy link
Collaborator

Looks good to me. It's funny how many little corner cases we ended up having to deal with.
The original logic I added (ie 4x chunks per thread) seemed so simple and bulletproof :)

Tested on my Snapdragon Gen5 with a bunch of models (llama-3.2-1/2B, qwen3-0.6B .. 8B, LFM2s, ...).
nchunk selection looks good and the overall performance is the same. Merging ...

@max-krasnyansky max-krasnyansky merged commit 5449367 into ggml-org:master Nov 26, 2025
70 of 74 checks passed
@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 27, 2025

Yeah totally. I guess these smaller cases are not representative of the models that are out there and that's why we don't run into them. Thanks for the review and merge

@Alcpz Alcpz deleted the Alcpz/mul_mat_chunk_fix branch November 27, 2025 12:04
am17an pushed a commit to am17an/llama.cpp that referenced this pull request Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants