Skip to content

metal: Q3_K speedup#2995

Merged
ggerganov merged 4 commits intomasterfrom
ik/metal_q3k
Sep 8, 2023
Merged

metal: Q3_K speedup#2995
ggerganov merged 4 commits intomasterfrom
ik/metal_q3k

Conversation

@ikawrakow
Copy link
Contributor

When running on a modern GPU using CUDA, Q3_K token generation (TG) is faster than Q4_0 by a comfortable margin. On Metal, it is much slower than Q4_0. Assuming optimum implementation, TG is memory bound, so Q3_K should be always faster, given the smaller model size. I have tried various approaches, including the exact same matrix x vector multiplication kernel as on CUDA, to no avail.

This PR brings a ~15% speedup for Q3_K TG on Metal. It is still ~20% slower than Q4_0, but at least some progress.

Prompt processing is not affected by the PR, so table gives performance for TG only on a 30-core M2 Max:

model backend test t/s (Master) t/s (PR) Speedup
LLaMA 7B mostly Q3_K - Small Metal tg 32 46.14 ± 0.01 52.77 ± 0.03 1.144
LLaMA 7B mostly Q3_K - Small Metal tg 64 45.15 ± 0.55 52.60 ± 0.09 1.165
LLaMA 7B mostly Q3_K - Small Metal tg 128 44.52 ± 0.05 52.26 ± 0.03 1.174
LLaMA 7B mostly Q3_K - Small Metal tg 256 44.19 ± 0.09 51.31 ± 0.24 1.161

There is also a mini speedup for Q5_K (~2%).

Iwan Kawrakow added 4 commits September 3, 2023 21:51
Combined with previous commit, we are now +9.6% for TG.
PP is not affected as this happens via the matrix multiplication
templates.
We are now 13% faster than master
@ikawrakow ikawrakow requested a review from lshzh-ww September 3, 2023 19:12
@ikawrakow ikawrakow changed the title metal: Q3_K speedup on metal: Q3_K speedup Sep 3, 2023
@lshzh-ww
Copy link
Contributor

lshzh-ww commented Sep 4, 2023

Unfortunately, on the M1 Max, this kernel is slower than master. I suspect that this is still due to each thread directly reading from device memory. It is not a significant issue for the M2 series, but it affects M1 series performance because the M1 series has a weaker memory controller. Would you mind trying to first copy blocks in whole to threadgroup memory and then letting each thread read from there? I would be happy to test it for you.

model backend ngl threads test t/s (Master) t/s (PR) Speedup
llama-30b q3_K_S Metal 1 4 tg 128 11.52 ± 0.01 10.49 ± 0.00 0.911

To address the issue of Q3_K being slower than Q4_0, I believe that Q3_K is actually compute-bound on M-series chips because they have lower FLOPS per memory bandwidth when compared to Nvidia GPUs.

@ggerganov
Copy link
Member

ggerganov commented Sep 4, 2023

  • M2 Ultra
model size test master t/s PR t/s speedup
LLaMA 7B Q3_K_S 2.75 GiB pp 32 568.81 ± 7.26 568.92 ± 6.53 1.000
LLaMA 7B Q3_K_S 2.75 GiB pp 64 763.14 ± 1.27 762.14 ± 1.39 0.999
LLaMA 7B Q3_K_S 2.75 GiB pp 128 888.56 ± 1.71 888.15 ± 1.96 1.000
LLaMA 7B Q3_K_S 2.75 GiB pp 256 907.66 ± 1.04 907.54 ± 0.72 1.000
LLaMA 7B Q3_K_S 2.75 GiB pp 512 876.53 ± 0.60 876.46 ± 0.17 1.000
---
LLaMA 7B Q3_K_S 2.75 GiB tg 16 75.91 ± 0.08 83.43 ± 0.13 1.099
LLaMA 7B Q3_K_S 2.75 GiB tg 32 75.75 ± 0.03 83.31 ± 0.13 1.100
LLaMA 7B Q3_K_S 2.75 GiB tg 64 75.46 ± 0.04 83.11 ± 0.08 1.101
LLaMA 7B Q3_K_S 2.75 GiB tg 128 74.72 ± 0.07 82.15 ± 0.13 1.099
LLaMA 7B Q3_K_S 2.75 GiB tg 256 73.88 ± 0.06 81.20 ± 0.07 1.099
model size test master t/s PR t/s speedup
LLaMA 30B Q3_K_S 13.10 GiB tg 16 22.08 ± 0.01 24.62 ± 0.01 1.115
LLaMA 30B Q3_K_S 13.10 GiB tg 32 22.09 ± 0.01 24.62 ± 0.02 1.115
LLaMA 30B Q3_K_S 13.10 GiB tg 64 22.02 ± 0.02 24.57 ± 0.03 1.116
LLaMA 30B Q3_K_S 13.10 GiB tg 128 21.78 ± 0.02 24.38 ± 0.01 1.119
  • M1 Pro
model size test master t/s PR t/s speedup
llama2 7B Q3_K_S 2.75 GiB tg 16 26.37 ± 0.04 28.00 ± 0.02 1.062
llama2 7B Q3_K_S 2.75 GiB tg 32 26.33 ± 0.01 27.96 ± 0.01 1.062
llama2 7B Q3_K_S 2.75 GiB tg 64 26.26 ± 0.01 27.87 ± 0.01 1.061
llama2 7B Q3_K_S 2.75 GiB tg 128 26.12 ± 0.01 27.74 ± 0.01 1.062
llama2 7B Q3_K_S 2.75 GiB tg 256 25.94 ± 0.01 27.54 ± 0.00 1.062

The M1 Pro also benefits from this change - at least for the small 7B model

@lshzh-ww

Do you at least see benfit for 7B Q3_K_S?

@ikawrakow
Copy link
Contributor Author

I don't see a big difference in performance improvement between 7B and 30B. The gain is ~12-14% vs ~14-17% for 7B:

model backend test t/s (Master) t/s (PR) Speedup
LLaMA 30B mostly Q3_K - Small Metal tg 32 11.38 ± 0.02 12.93 ± 0.03 1.136
LLaMA 30B mostly Q3_K - Small Metal tg 64 11.33 ± 0.00 12.85 ± 0.01 1.134
LLaMA 30B mostly Q3_K - Small Metal tg 128 11.20 ± 0.10 12.81 ± 0.04 1.144
LLaMA 30B mostly Q3_K - Small Metal tg 256 11.22 ± 0.01 12.57 ± 0.17 1.120

@ikawrakow
Copy link
Contributor Author

@lshzh-ww

To address the issue of Q3_K being slower than Q4_0, I believe that Q3_K is actually compute-bound on M-series chips because they have lower FLOPS per memory bandwidth when compared to Nvidia GPUs.

If that was true, we wouldn't see the more than ~8X difference in t/s between TG and PP we are observing on the M series, which applies to all quantization types. On a modern GPU the performance difference is even more pronounced because of more compute being available (e.g., on my RTX-4080 TG-128 is ~130 t/s while PP-512 is ~3400 t/s, so 28X difference). On M2 Q3_K PP performance is ~85% of Q4_0 PP performance, which is easily understandable considering how much more bit fiddling per quant is necessary for Q3_K compared to Q4_0. On the M series, if implemented optimally, TG ought to become memory bound, with perhaps some small influence of available compute performance.

@lshzh-ww
Copy link
Contributor

lshzh-ww commented Sep 4, 2023

If that was true, we wouldn't see the more than ~8X difference in t/s between TG and PP we are observing on the M series, which applies to all quantization types. On a modern GPU the performance difference is even more pronounced because of more compute being available (e.g., on my RTX-4080 TG-128 is ~130 t/s while PP-512 is ~3400 t/s, so 28X difference).

I am sorry, but we shouldn't compare the performance between TG and PP to estimate the compute pressure for dequantization. In matrix-matrix multiplication kernels, we split the dst matrix into tiles of 64x32 dimensions. By doing this, we reduce the memory load pressure and computation needs for dequantization to 1/32. (i.e., each weight in src0 is loaded once, dequantized once, and reused 32 times, compared to matrix-vector multiplication kernels where each weight in src0 is loaded once, dequantized once, and used once.)

If this doesn't convince you, here are the profiling results for the master branch :).
M1 Max 32c GPU, 7B Q3_K, tg 128
Screenshot 2023-09-04 at 11 32 52

@ggerganov
M1 Max 32c GPU, 7B model:

model backend ngl threads test t/s (Master) t/s (PR) Speedup
codellama-7b q3_K_S Metal 1 4 tg 128 45.95 ± 0.01 43.38 ± 0.00 0.944

By the way, are your results from the M1 Pro with 16 GPU cores or 14 GPU cores? Since Q3_K is more compute-bound, it may be worth collecting more benchmark results from various GPU core configurations.

@ggerganov
Copy link
Member

@lshzh-ww

It has 16 GPU cores. A 32GB model, running macOS 13.4.1.

image

@lshzh-ww
Copy link
Contributor

lshzh-ww commented Sep 4, 2023

Thank you for checking this!

@ikawrakow
Copy link
Contributor Author

@ggerganov What do we do with this? It is faster on M2 Max, M2 Ultra, M1 Pro, but somehow slower on M1 Max from 1 datapoint and no further feedback.

@ggerganov
Copy link
Member

@lshzh-ww

I don't understand the provided screenshot about M1 Max performance on master - what is the relevant information to look at and what is it's meaning?

I don't see how this PR increases the memory pressure. AFAICT it even reduces it since we now put 32 floats from src1 into local memory, instead of 16.

Would you mind trying to first copy blocks in whole to threadgroup memory and then letting each thread read from there?

I agree that we should try this strategy - my feeling is also that it would help if done properly.
But we can experiment with this from master

@ikawrakow

I'll merge this PR later today as it mostly has a positive effect even if it is not across the entire M-series line-up.
In the meantime, if anyone would like to help and provide more datapoints - would appreciate that!

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Sep 8, 2023
@ggerganov ggerganov merged commit ba7ffbb into master Sep 8, 2023
@ikawrakow ikawrakow mentioned this pull request Sep 9, 2023
@ikawrakow ikawrakow deleted the ik/metal_q3k branch September 24, 2023 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

need feedback Testing and feedback with results are needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants