Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Nov 25, 2025

This PR improves q4_k_q8_k gemm and gemv in arm64 using only dotprod (e.g rpi5)

Introduces Q8_Kx4 for convenience with dotprod intrinsics. Q4_K remains the same as the scales fit nicely in memory this way.
Added also a missing DOTPROD guard for the gemv introduced in #16739
Removed a TODO: Tests showed no meaningful difference in performance.

The PR also had to introduce the generic versions of the 8x4 gemm and gemvs. The output of the gemms and gemvs was tested generic vs no repack using a couple matrix shapes (for [ne00, ne01, ne10, ne11, ne12]):

  • Various ne11 with fixed dimensions [128,128,128,ne11,1]
  • ne11 with different batch sizes [128,128,128,ne11,ne12]
  • Various ne00(=ne10) [ne00,128,ne00,16,2]
  • Various output rows ne01 [128, ne01,128,16,2]

Performance

M4 max (GGML_NATIVE=OFF -mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve)

model threads test new t/s ref t/s speedup
lfm2 1.2B Q4_K Medium 8 pp256 667.63 344.36 1.94x
lfm2 1.2B Q4_K Medium 8 tg128 231.36 209.05 1.11x
llama 8B Q4_K Medium 8 pp256 96.09 51.11 1.88x
llama 8B Q4_K Medium 8 tg128 43.52 36.72 1.18x
qwen3 8B Q4_K Medium 8 pp256 90.38 50.35 1.79x
qwen3 8B Q4_K Medium 8 tg128 40.77 35.98 1.13x

new: 690ac9a (7166) (GGML_CPU_REPACK=ON)
ref: 690ac9a (7166) (GGML_CPU_REPACK=OFF)

Rpi5 (GGML_NATIVE=ON)

model threads test new t/s ref t/s speedup
lfm2 350M Q4_K Medium 4 pp256 223.99 148.55 1.51x
lfm2 350M Q4_K Medium 4 tg128 48.23 49.66 0.97x
lfm2 700M Q4_K Medium 4 pp256 107.23 69.62 1.54x
lfm2 700M Q4_K Medium 4 tg128 24.41 24.76 0.99x

Perplexity

model generic no-repack repack
LFM2 1.2B Q4_K_M 15.5768 ± 0.62498 15.5833 ± 0.62558 15.5653 ± 0.62430
LFM2 700M Q4_K_M 20.2397 ± 0.86927 20.2207 ± 0.86775 20.2665 ± 0.86979
Qwen3 8B 128K Q4_K_M 10.8743 ± 0.46634 10.8862 ± 0.46691 10.8863 ± 0.46732
Meta Llama 3.1 8B Instruct Q4_K_M n/a 8.3882 ± 0.29233 8.3918 ± 0.29254

Note: generic for llama was skipped as it took quite some time to get Qwen3


To check the generic path I used the patch below:

Click to expand patch
diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
index c6e723d40..a683853c0 100644
--- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
+++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
@@ -512,6 +512,7 @@ void ggml_gemv_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     UNUSED(ncols_interleaved);
     UNUSED(blocklen);

+#undef __ARM_NEON
 #if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
     constexpr int    col_groups = ncols_interleaved / 4; // 0123 and 4567
     const uint8x16_t m4b        = vdupq_n_u8(0x0f);
@@ -628,6 +629,7 @@ void ggml_gemv_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     }  // for x
     return;
 #endif  // #if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
+#define __ARM_NEON
     ggml_gemv_q4_K_8x4_q8_K_generic(n, s, bs, vx, vy, nr, nc);
 }

@@ -2217,6 +2219,7 @@ void ggml_gemm_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     UNUSED(ncols_interleaved);
     UNUSED(blocklen);

+#undef __ARM_NEON
 #if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
     constexpr int    q8_k_blocklen = 4;
     constexpr int    acc_size  = 2 * 4;  // 2 row pairs × 4 col pairs
@@ -2400,6 +2403,7 @@ void ggml_gemm_q4_K_8x4_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
     }  // for y
     return;
 #endif  // defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
+#define __ARM_NEON
     ggml_gemm_q4_K_8x4_q8_K_generic(n, s, bs, vx, vy, nr, nc);
 }

@Alcpz Alcpz requested a review from ggerganov as a code owner November 25, 2025 13:14
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 25, 2025
@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 25, 2025

All ci failures are due to

FileNotFoundError: File not found: ../models-mnt/qwen3/0.6B/tokenizer.model

For the convert_hf_to_gguf.py, so, not related.

@ggerganov
Copy link
Member

The CI failures were due to #17453 (comment). Should be fixed now - restarted the workflows.

The perplexity tests are good for verifying the GEMM kernels are correct. The GEMV kernels are not exercised during perplexity, so make sure you don't see any issues during token generation.

@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 26, 2025

The CI failures were due to #17453 (comment). Should be fixed now - restarted the workflows.

The perplexity tests are good for verifying the GEMM kernels are correct. The GEMV kernels are not exercised during perplexity, so make sure you don't see any issues during token generation.

I'm confident about the GEMV outputs, I did a couple of tests regarding this:

  • I checked a lot of mat_muls directly with and without non-multiple of 4 blocks, so it also triggered GEMV for the final rows of the prefill.
  • I compared the tensor outputs (vs non-repack). Error is below the NMSE defined in llama-bench (relative error < 5e-4).
  • I checked outputs for llama-server and llama-cli (though I recking this is not a really good metric, It helped me in the past with glaring errors).

I am not sure if there is a better mechanism to test token generation output.

The testing above helped me detect an issue with chunking though, but it's not related to this PR, it affects repack in general, most likely inadvertently introduced when extending chunking to 3d tensors. See #17526

@ggerganov ggerganov merged commit cd8370b into ggml-org:master Nov 27, 2025
63 of 65 checks passed
@Alcpz Alcpz deleted the Alcpz/arm_q4_k_repack_dotprod branch November 27, 2025 12:04
am17an pushed a commit to am17an/llama.cpp that referenced this pull request Nov 27, 2025
…only) (ggml-org#17494)

* Enabled q4_K_4x8 path

* Fixed generic Q4_K 8x4 implementation

* wip: dotprod gemm

* Working arm q4_K dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Undo acc rename

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Q4_K arm dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix: q4_qs reinterpret from uint to int

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Removed comments

* Fixed macro guards

* Fixed unused vars in generic implementation

* Fixed unused vars in 8x4 repack

* Fixed unused vars in generic implementation, unneeded comment

* Missing arch fallback for x86

* minor : style

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants