-
Notifications
You must be signed in to change notification settings - Fork 13.9k
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) #17494
Conversation
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
|
All ci failures are due to For the |
|
The CI failures were due to #17453 (comment). Should be fixed now - restarted the workflows. The perplexity tests are good for verifying the GEMM kernels are correct. The GEMV kernels are not exercised during perplexity, so make sure you don't see any issues during token generation. |
I'm confident about the GEMV outputs, I did a couple of tests regarding this:
I am not sure if there is a better mechanism to test token generation output. The testing above helped me detect an issue with chunking though, but it's not related to this PR, it affects repack in general, most likely inadvertently introduced when extending chunking to 3d tensors. See #17526 |
…only) (ggml-org#17494) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This PR improves q4_k_q8_k gemm and gemv in arm64 using only dotprod (e.g rpi5)
Introduces Q8_Kx4 for convenience with dotprod intrinsics. Q4_K remains the same as the scales fit nicely in memory this way.
Added also a missing
DOTPRODguard for the gemv introduced in #16739Removed a TODO: Tests showed no meaningful difference in performance.
The PR also had to introduce the generic versions of the 8x4 gemm and gemvs. The output of the gemms and gemvs was tested
genericvsno repackusing a couple matrix shapes (for [ne00, ne01, ne10, ne11, ne12]):Performance
M4 max (GGML_NATIVE=OFF
-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve)new: 690ac9a (7166) (GGML_CPU_REPACK=ON)
ref: 690ac9a (7166) (GGML_CPU_REPACK=OFF)
Rpi5 (GGML_NATIVE=ON)
Perplexity
Note: generic for llama was skipped as it took quite some time to get Qwen3
To check the generic path I used the patch below:
Click to expand patch