Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : add Q8_0 support #2763

Merged
merged 3 commits into from
Aug 24, 2023
Merged

metal : add Q8_0 support #2763

merged 3 commits into from
Aug 24, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Aug 24, 2023

close #2508

Add Q8_0 support for Metal

I haven't tested if this is the most optimal way to implement it regarding the mat x vec kernel, so there might be room for optimizations in the future.

M2 Ultra

model backend n_gpu_layers test t/s
LLaMA v2 7B mostly F16 Metal 1 pp 512 664.05 ± 0.22
LLaMA v2 7B mostly Q4_0 Metal 1 pp 512 632.16 ± 0.47
LLaMA v2 7B mostly Q4_1 Metal 1 pp 512 634.40 ± 0.40
LLaMA v2 7B mostly Q8_0 Metal 1 pp 512 630.26 ± 0.11
LLaMA v2 7B mostly Q2_K Metal 1 pp 512 580.58 ± 0.22
LLaMA v2 7B mostly Q3_K - Medium Metal 1 pp 512 580.74 ± 0.26
LLaMA v2 7B mostly Q4_K - Medium Metal 1 pp 512 587.62 ± 0.19
LLaMA v2 7B mostly Q5_K - Medium Metal 1 pp 512 560.96 ± 0.15
LLaMA v2 7B mostly Q6_K Metal 1 pp 512 561.99 ± 0.15
LLaMA v2 7B mostly F16 Metal 1 tg 128 29.38 ± 0.11
LLaMA v2 7B mostly Q4_0 Metal 1 tg 128 86.17 ± 0.05
LLaMA v2 7B mostly Q4_1 Metal 1 tg 128 81.30 ± 0.08
LLaMA v2 7B mostly Q8_0 Metal 1 tg 128 61.16 ± 0.05
LLaMA v2 7B mostly Q2_K Metal 1 tg 128 74.89 ± 0.05
LLaMA v2 7B mostly Q3_K - Medium Metal 1 tg 128 76.22 ± 0.06
LLaMA v2 7B mostly Q4_K - Medium Metal 1 tg 128 79.64 ± 0.08
LLaMA v2 7B mostly Q5_K - Medium Metal 1 tg 128 68.91 ± 0.04
LLaMA v2 7B mostly Q6_K Metal 1 tg 128 68.46 ± 0.07

build: 1202e06 (1049)

@ggerganov ggerganov marked this pull request as ready for review August 24, 2023 12:51
@ggerganov ggerganov merged commit d67777c into master Aug 24, 2023
3 checks passed
@ggerganov ggerganov deleted the metal-add-q8_0 branch August 24, 2023 13:20
@ggerganov
Copy link
Owner Author

@lshzh-ww I'll probably try to also add Q5_0 and Q5_1 later today - just don't want to overlap in case you have started doing it.

@lshzh-ww
Copy link
Collaborator

I do have a template for all matrix-vector multiplication kernels. However, it requires careful tuning of the dequantize_q_n functions to achieve maximum performance for both matrix-vector multiplication and matrix-matrix multiplication. Currently, I have only finished reimplementing dequantize_q2_k and dequantize_q3_k, so the new template can achieve better or at least similar performance compared to the master branch. I may submit the PR this weekend or early next week.

So, if you feel that it's urgent, please go ahead. However, it may not be worth spending too much time optimizing the kernel. Alternatively, we can wait a few more days to provide support for Q5_0 and Q5_1 for metal.

@ggerganov
Copy link
Owner Author

No rush - will wait for the new kernels then. Thanks!

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* metal : add dequantize_q8_0 kernel

* metal : add mul_mat_q8_0_f32 kernel

* metal : add Q8_0 mul_mm kernel
@sukualam
Copy link

sukualam commented Sep 4, 2023

is it just for m1/m2 only? not amd gpu? because i cant run with gpu with my amd card on macos (it support mps, btw)

Sam2much96 pushed a commit to Sam2much96/llama.cpp that referenced this pull request Sep 11, 2023
* metal : add dequantize_q8_0 kernel

* metal : add mul_mat_q8_0_f32 kernel

* metal : add Q8_0 mul_mm kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running with Metal for llama-2-13b-chat.ggmlv3.q8_0.bin with -ngl throw unimplemented error
3 participants