metal: Q3_K speedup by ikawrakow · Pull Request #2995 · ggml-org/llama.cpp

ikawrakow · 2023-09-03T19:12:42Z

When running on a modern GPU using CUDA, Q3_K token generation (TG) is faster than Q4_0 by a comfortable margin. On Metal, it is much slower than Q4_0. Assuming optimum implementation, TG is memory bound, so Q3_K should be always faster, given the smaller model size. I have tried various approaches, including the exact same matrix x vector multiplication kernel as on CUDA, to no avail.

This PR brings a ~15% speedup for Q3_K TG on Metal. It is still ~20% slower than Q4_0, but at least some progress.

Prompt processing is not affected by the PR, so table gives performance for TG only on a 30-core M2 Max:

model	backend	test	t/s (Master)	t/s (PR)	Speedup
LLaMA 7B mostly Q3_K - Small	Metal	tg 32	46.14 ± 0.01	52.77 ± 0.03	1.144
LLaMA 7B mostly Q3_K - Small	Metal	tg 64	45.15 ± 0.55	52.60 ± 0.09	1.165
LLaMA 7B mostly Q3_K - Small	Metal	tg 128	44.52 ± 0.05	52.26 ± 0.03	1.174
LLaMA 7B mostly Q3_K - Small	Metal	tg 256	44.19 ± 0.09	51.31 ± 0.24	1.161

There is also a mini speedup for Q5_K (~2%).

Combined with previous commit, we are now +9.6% for TG. PP is not affected as this happens via the matrix multiplication templates.

We are now 13% faster than master

lshzh-ww · 2023-09-04T00:23:10Z

Unfortunately, on the M1 Max, this kernel is slower than master. I suspect that this is still due to each thread directly reading from device memory. It is not a significant issue for the M2 series, but it affects M1 series performance because the M1 series has a weaker memory controller. Would you mind trying to first copy blocks in whole to threadgroup memory and then letting each thread read from there? I would be happy to test it for you.

model	backend	ngl	threads	test	t/s (Master)	t/s (PR)	Speedup
llama-30b q3_K_S	Metal	1	4	tg 128	11.52 ± 0.01	10.49 ± 0.00	0.911

To address the issue of Q3_K being slower than Q4_0, I believe that Q3_K is actually compute-bound on M-series chips because they have lower FLOPS per memory bandwidth when compared to Nvidia GPUs.

ggerganov · 2023-09-04T06:55:20Z

M2 Ultra

model	size	test	master t/s	PR t/s	speedup
LLaMA 7B Q3_K_S	2.75 GiB	pp 32	568.81 ± 7.26	568.92 ± 6.53	1.000
LLaMA 7B Q3_K_S	2.75 GiB	pp 64	763.14 ± 1.27	762.14 ± 1.39	0.999
LLaMA 7B Q3_K_S	2.75 GiB	pp 128	888.56 ± 1.71	888.15 ± 1.96	1.000
LLaMA 7B Q3_K_S	2.75 GiB	pp 256	907.66 ± 1.04	907.54 ± 0.72	1.000
LLaMA 7B Q3_K_S	2.75 GiB	pp 512	876.53 ± 0.60	876.46 ± 0.17	1.000
---
LLaMA 7B Q3_K_S	2.75 GiB	tg 16	75.91 ± 0.08	83.43 ± 0.13	1.099
LLaMA 7B Q3_K_S	2.75 GiB	tg 32	75.75 ± 0.03	83.31 ± 0.13	1.100
LLaMA 7B Q3_K_S	2.75 GiB	tg 64	75.46 ± 0.04	83.11 ± 0.08	1.101
LLaMA 7B Q3_K_S	2.75 GiB	tg 128	74.72 ± 0.07	82.15 ± 0.13	1.099
LLaMA 7B Q3_K_S	2.75 GiB	tg 256	73.88 ± 0.06	81.20 ± 0.07	1.099

model	size	test	master t/s	PR t/s	speedup
LLaMA 30B Q3_K_S	13.10 GiB	tg 16	22.08 ± 0.01	24.62 ± 0.01	1.115
LLaMA 30B Q3_K_S	13.10 GiB	tg 32	22.09 ± 0.01	24.62 ± 0.02	1.115
LLaMA 30B Q3_K_S	13.10 GiB	tg 64	22.02 ± 0.02	24.57 ± 0.03	1.116
LLaMA 30B Q3_K_S	13.10 GiB	tg 128	21.78 ± 0.02	24.38 ± 0.01	1.119

M1 Pro

model	size	test	master t/s	PR t/s	speedup
llama2 7B Q3_K_S	2.75 GiB	tg 16	26.37 ± 0.04	28.00 ± 0.02	1.062
llama2 7B Q3_K_S	2.75 GiB	tg 32	26.33 ± 0.01	27.96 ± 0.01	1.062
llama2 7B Q3_K_S	2.75 GiB	tg 64	26.26 ± 0.01	27.87 ± 0.01	1.061
llama2 7B Q3_K_S	2.75 GiB	tg 128	26.12 ± 0.01	27.74 ± 0.01	1.062
llama2 7B Q3_K_S	2.75 GiB	tg 256	25.94 ± 0.01	27.54 ± 0.00	1.062

The M1 Pro also benefits from this change - at least for the small 7B model

@lshzh-ww

Do you at least see benfit for 7B Q3_K_S?

ikawrakow · 2023-09-04T07:12:54Z

I don't see a big difference in performance improvement between 7B and 30B. The gain is ~12-14% vs ~14-17% for 7B:

model	backend	test	t/s (Master)	t/s (PR)	Speedup
LLaMA 30B mostly Q3_K - Small	Metal	tg 32	11.38 ± 0.02	12.93 ± 0.03	1.136
LLaMA 30B mostly Q3_K - Small	Metal	tg 64	11.33 ± 0.00	12.85 ± 0.01	1.134
LLaMA 30B mostly Q3_K - Small	Metal	tg 128	11.20 ± 0.10	12.81 ± 0.04	1.144
LLaMA 30B mostly Q3_K - Small	Metal	tg 256	11.22 ± 0.01	12.57 ± 0.17	1.120

ikawrakow · 2023-09-04T07:29:44Z

@lshzh-ww

To address the issue of Q3_K being slower than Q4_0, I believe that Q3_K is actually compute-bound on M-series chips because they have lower FLOPS per memory bandwidth when compared to Nvidia GPUs.

If that was true, we wouldn't see the more than ~8X difference in t/s between TG and PP we are observing on the M series, which applies to all quantization types. On a modern GPU the performance difference is even more pronounced because of more compute being available (e.g., on my RTX-4080 TG-128 is ~130 t/s while PP-512 is ~3400 t/s, so 28X difference). On M2 Q3_K PP performance is ~85% of Q4_0 PP performance, which is easily understandable considering how much more bit fiddling per quant is necessary for Q3_K compared to Q4_0. On the M series, if implemented optimally, TG ought to become memory bound, with perhaps some small influence of available compute performance.

lshzh-ww · 2023-09-04T15:51:15Z

If that was true, we wouldn't see the more than ~8X difference in t/s between TG and PP we are observing on the M series, which applies to all quantization types. On a modern GPU the performance difference is even more pronounced because of more compute being available (e.g., on my RTX-4080 TG-128 is ~130 t/s while PP-512 is ~3400 t/s, so 28X difference).

I am sorry, but we shouldn't compare the performance between TG and PP to estimate the compute pressure for dequantization. In matrix-matrix multiplication kernels, we split the dst matrix into tiles of 64x32 dimensions. By doing this, we reduce the memory load pressure and computation needs for dequantization to 1/32. (i.e., each weight in src0 is loaded once, dequantized once, and reused 32 times, compared to matrix-vector multiplication kernels where each weight in src0 is loaded once, dequantized once, and used once.)

If this doesn't convince you, here are the profiling results for the master branch :).
M1 Max 32c GPU, 7B Q3_K, tg 128

@ggerganov
M1 Max 32c GPU, 7B model:

model	backend	ngl	threads	test	t/s (Master)	t/s (PR)	Speedup
codellama-7b q3_K_S	Metal	1	4	tg 128	45.95 ± 0.01	43.38 ± 0.00	0.944

By the way, are your results from the M1 Pro with 16 GPU cores or 14 GPU cores? Since Q3_K is more compute-bound, it may be worth collecting more benchmark results from various GPU core configurations.

ggerganov · 2023-09-04T16:19:37Z

@lshzh-ww

It has 16 GPU cores. A 32GB model, running macOS 13.4.1.

lshzh-ww · 2023-09-04T16:26:37Z

Thank you for checking this!

ikawrakow · 2023-09-07T09:36:34Z

@ggerganov What do we do with this? It is faster on M2 Max, M2 Ultra, M1 Pro, but somehow slower on M1 Max from 1 datapoint and no further feedback.

ggerganov · 2023-09-08T09:40:37Z

@lshzh-ww

I don't understand the provided screenshot about M1 Max performance on master - what is the relevant information to look at and what is it's meaning?

I don't see how this PR increases the memory pressure. AFAICT it even reduces it since we now put 32 floats from src1 into local memory, instead of 16.

Would you mind trying to first copy blocks in whole to threadgroup memory and then letting each thread read from there?

I agree that we should try this strategy - my feeling is also that it would help if done properly.
But we can experiment with this from master

@ikawrakow

I'll merge this PR later today as it mostly has a positive effect even if it is not across the entire M-series line-up.
In the meantime, if anyone would like to help and provide more datapoints - would appreciate that!

Iwan Kawrakow added 4 commits September 3, 2023 21:51

Slightly faster Q3_K and Q5_K on metal

ec13de5

Another Q3_K speedup on metal

123a870

Combined with previous commit, we are now +9.6% for TG. PP is not affected as this happens via the matrix multiplication templates.

Slowly progressing on Q3_K on metal

9eb1d4d

We are now 13% faster than master

nother small improvement for Q3_K on metal

2cab21c

ikawrakow requested a review from lshzh-ww September 3, 2023 19:12

ikawrakow changed the title ~~metal: Q3_K speedup on~~ metal: Q3_K speedup Sep 3, 2023

ggerganov approved these changes Sep 4, 2023

View reviewed changes

ggerganov added the need feedback Testing and feedback with results are needed label Sep 8, 2023

ggerganov merged commit ba7ffbb into master Sep 8, 2023

ikawrakow mentioned this pull request Sep 9, 2023

Metal: PP speedup #3084

Merged

ikawrakow deleted the ik/metal_q3k branch September 24, 2023 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: Q3_K speedup#2995

metal: Q3_K speedup#2995
ggerganov merged 4 commits intomasterfrom
ik/metal_q3k

ikawrakow commented Sep 3, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023

Uh oh!

ggerganov commented Sep 4, 2023 •

edited

Loading

Uh oh!

ikawrakow commented Sep 4, 2023

Uh oh!

ikawrakow commented Sep 4, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023 •

edited

Loading

Uh oh!

ggerganov commented Sep 4, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023

Uh oh!

ikawrakow commented Sep 7, 2023

Uh oh!

ggerganov commented Sep 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Sep 3, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023

Uh oh!

ggerganov commented Sep 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Sep 4, 2023

Uh oh!

ikawrakow commented Sep 4, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 4, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023

Uh oh!

ikawrakow commented Sep 7, 2023

Uh oh!

ggerganov commented Sep 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Sep 4, 2023 •

edited

Loading

lshzh-ww commented Sep 4, 2023 •

edited

Loading