ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494

Alcpz · 2025-11-25T13:14:24Z

This PR improves q4_k_q8_k gemm and gemv in arm64 using only dotprod (e.g rpi5)

Introduces Q8_Kx4 for convenience with dotprod intrinsics. Q4_K remains the same as the scales fit nicely in memory this way.
Added also a missing DOTPROD guard for the gemv introduced in #16739
Removed a TODO: Tests showed no meaningful difference in performance.

The PR also had to introduce the generic versions of the 8x4 gemm and gemvs. The output of the gemms and gemvs was tested generic vs no repack using a couple matrix shapes (for [ne00, ne01, ne10, ne11, ne12]):

Various ne11 with fixed dimensions [128,128,128,ne11,1]
ne11 with different batch sizes [128,128,128,ne11,ne12]
Various ne00(=ne10) [ne00,128,ne00,16,2]
Various output rows ne01 [128, ne01,128,16,2]

Performance

M4 max (GGML_NATIVE=OFF `-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve`)

model	threads	test	new t/s	ref t/s	speedup
lfm2 1.2B Q4_K Medium	8	pp256	667.63	344.36	1.94x
lfm2 1.2B Q4_K Medium	8	tg128	231.36	209.05	1.11x
llama 8B Q4_K Medium	8	pp256	96.09	51.11	1.88x
llama 8B Q4_K Medium	8	tg128	43.52	36.72	1.18x
qwen3 8B Q4_K Medium	8	pp256	90.38	50.35	1.79x
qwen3 8B Q4_K Medium	8	tg128	40.77	35.98	1.13x

new: 690ac9a (7166) (GGML_CPU_REPACK=ON)
ref: 690ac9a (7166) (GGML_CPU_REPACK=OFF)

Rpi5 (GGML_NATIVE=ON)

model	threads	test	new t/s	ref t/s	speedup
lfm2 350M Q4_K Medium	4	pp256	223.99	148.55	1.51x
lfm2 350M Q4_K Medium	4	tg128	48.23	49.66	0.97x
lfm2 700M Q4_K Medium	4	pp256	107.23	69.62	1.54x
lfm2 700M Q4_K Medium	4	tg128	24.41	24.76	0.99x

Perplexity

model	generic	no-repack	repack
LFM2 1.2B Q4_K_M	15.5768 ± 0.62498	15.5833 ± 0.62558	15.5653 ± 0.62430
LFM2 700M Q4_K_M	20.2397 ± 0.86927	20.2207 ± 0.86775	20.2665 ± 0.86979
Qwen3 8B 128K Q4_K_M	10.8743 ± 0.46634	10.8862 ± 0.46691	10.8863 ± 0.46732
Meta Llama 3.1 8B Instruct Q4_K_M	n/a	8.3882 ± 0.29233	8.3918 ± 0.29254

Note: generic for llama was skipped as it took quite some time to get Qwen3

To check the generic path I used the patch below:

Click to expand patch

diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
index c6e723d40..a683853c0 100644
--- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
+++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
@@ -512,6 +512,7 @@ void ggml_gemv_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     UNUSED(ncols_interleaved);
     UNUSED(blocklen);

+#undef __ARM_NEON
 #if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
     constexpr int    col_groups = ncols_interleaved / 4; // 0123 and 4567
     const uint8x16_t m4b        = vdupq_n_u8(0x0f);
@@ -628,6 +629,7 @@ void ggml_gemv_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     }  // for x
     return;
 #endif  // #if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
+#define __ARM_NEON
     ggml_gemv_q4_K_8x4_q8_K_generic(n, s, bs, vx, vy, nr, nc);
 }

@@ -2217,6 +2219,7 @@ void ggml_gemm_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     UNUSED(ncols_interleaved);
     UNUSED(blocklen);

+#undef __ARM_NEON
 #if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
     constexpr int    q8_k_blocklen = 4;
     constexpr int    acc_size  = 2 * 4;  // 2 row pairs × 4 col pairs
@@ -2400,6 +2403,7 @@ void ggml_gemm_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     }  // for y
     return;
 #endif  // defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
+#define __ARM_NEON
     ggml_gemm_q4_K_8x4_q8_K_generic(n, s, bs, vx, vy, nr, nc);
 }

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Alcpz · 2025-11-25T21:19:30Z

All ci failures are due to

FileNotFoundError: File not found: ../models-mnt/qwen3/0.6B/tokenizer.model

For the convert_hf_to_gguf.py, so, not related.

ggerganov · 2025-11-26T08:38:39Z

The CI failures were due to #17453 (comment). Should be fixed now - restarted the workflows.

The perplexity tests are good for verifying the GEMM kernels are correct. The GEMV kernels are not exercised during perplexity, so make sure you don't see any issues during token generation.

Alcpz · 2025-11-26T18:09:51Z

The CI failures were due to #17453 (comment). Should be fixed now - restarted the workflows.

The perplexity tests are good for verifying the GEMM kernels are correct. The GEMV kernels are not exercised during perplexity, so make sure you don't see any issues during token generation.

I'm confident about the GEMV outputs, I did a couple of tests regarding this:

I checked a lot of mat_muls directly with and without non-multiple of 4 blocks, so it also triggered GEMV for the final rows of the prefill.
I compared the tensor outputs (vs non-repack). Error is below the NMSE defined in llama-bench (relative error < 5e-4).
I checked outputs for llama-server and llama-cli (though I recking this is not a really good metric, It helped me in the past with glaring errors).

I am not sure if there is a better mechanism to test token generation output.

The testing above helped me detect an issue with chunking though, but it's not related to this PR, it affects repack in general, most likely inadvertently introduced when extending chunking to 3d tensors. See #17526

ggml/src/ggml-cpu/repack.cpp

…only) (ggml-org#17494) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Alcpz added 12 commits November 25, 2025 10:27

Enabled q4_K_4x8 path

37a477c

Fixed generic Q4_K 8x4 implementation

968d2d0

wip: dotprod gemm

4f0cee7

Working arm q4_K dotprod gemm

328512b

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Undo acc rename

2ce46b8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Q4_K arm dotprod gemm

6e7caea

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fix: q4_qs reinterpret from uint to int

66d6651

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Removed comments

bf717d9

Fixed macro guards

1f7d3eb

Fixed unused vars in generic implementation

9aecf34

Fixed unused vars in 8x4 repack

ccb84f6

Fixed unused vars in generic implementation, unneeded comment

dfbc4c6

Alcpz requested a review from ggerganov as a code owner November 25, 2025 13:14

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 25, 2025

Missing arch fallback for x86

eb94631

ggerganov approved these changes Nov 27, 2025

View reviewed changes

minor : style

a2c991c

ggerganov merged commit cd8370b into ggml-org:master Nov 27, 2025
63 of 65 checks passed

Alcpz deleted the Alcpz/arm_q4_k_repack_dotprod branch November 27, 2025 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494

ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494

Alcpz commented Nov 25, 2025

Uh oh!

Alcpz commented Nov 25, 2025

Uh oh!

ggerganov commented Nov 26, 2025

Uh oh!

Alcpz commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494

ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494

Conversation

Alcpz commented Nov 25, 2025

Performance

M4 max (GGML_NATIVE=OFF -mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve)

Rpi5 (GGML_NATIVE=ON)

Perplexity

Uh oh!

Alcpz commented Nov 25, 2025

Uh oh!

ggerganov commented Nov 26, 2025

Uh oh!

Alcpz commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

M4 max (GGML_NATIVE=OFF `-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve`)