Conversation
Combined with previous commit, we are now +9.6% for TG. PP is not affected as this happens via the matrix multiplication templates.
We are now 13% faster than master
|
Unfortunately, on the M1 Max, this kernel is slower than
To address the issue of |
The M1 Pro also benefits from this change - at least for the small 7B model Do you at least see benfit for 7B |
|
I don't see a big difference in performance improvement between 7B and 30B. The gain is ~12-14% vs ~14-17% for 7B:
|
If that was true, we wouldn't see the more than ~8X difference in t/s between TG and PP we are observing on the M series, which applies to all quantization types. On a modern GPU the performance difference is even more pronounced because of more compute being available (e.g., on my RTX-4080 TG-128 is ~130 t/s while PP-512 is ~3400 t/s, so 28X difference). On M2 |
I am sorry, but we shouldn't compare the performance between TG and PP to estimate the compute pressure for dequantization. In matrix-matrix multiplication kernels, we split the If this doesn't convince you, here are the profiling results for the @ggerganov
By the way, are your results from the M1 Pro with 16 GPU cores or 14 GPU cores? Since |
|
It has 16 GPU cores. A 32GB model, running macOS 13.4.1. |
|
Thank you for checking this! |
|
@ggerganov What do we do with this? It is faster on M2 Max, M2 Ultra, M1 Pro, but somehow slower on M1 Max from 1 datapoint and no further feedback. |
|
I don't understand the provided screenshot about M1 Max performance on I don't see how this PR increases the memory pressure. AFAICT it even reduces it since we now put 32 floats from
I agree that we should try this strategy - my feeling is also that it would help if done properly. I'll merge this PR later today as it mostly has a positive effect even if it is not across the entire M-series line-up. |


When running on a modern GPU using CUDA,
Q3_Ktoken generation (TG) is faster thanQ4_0by a comfortable margin. On Metal, it is much slower thanQ4_0. Assuming optimum implementation, TG is memory bound, soQ3_Kshould be always faster, given the smaller model size. I have tried various approaches, including the exact same matrix x vector multiplication kernel as on CUDA, to no avail.This PR brings a ~15% speedup for
Q3_KTG on Metal. It is still ~20% slower thanQ4_0, but at least some progress.Prompt processing is not affected by the PR, so table gives performance for TG only on a 30-core M2 Max:
There is also a mini speedup for
Q5_K(~2%).