Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : support MTLGPUFamily < Apple7, formatting, style #3524

Merged
merged 5 commits into from
Oct 8, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 7, 2023

Edit:

ref #3129

The scope of this PR changed - it is now mostly a formatting change. Improved batched decoding will be investigated in a future PR.

Obsolete info below


ref #3479

In Metal, we have 2 matrix multiplication kernels:

  • matrix-matrix
  • matrix-vector

Depending on the batch size, one of the 2 kernels is faster.

This PR adds logic for choosing which kernel to use depending on the batch size. The numbers are determined empirically on M2 Ultra. Not sure if these translate to the optimal numbers for other chips, but for sure would not affect the performance tests that we have been doing so far, since we have been testing either batch size of 1 or batch size of 512.

This change improves batched decoding performance for non-F16 types. For F16 there is no difference, although similar analysis should be performed on the CUDA kernels and see where is the break-even point between the 2 kernels

make -j && ../scripts/run-all-perf.sh llama-13b-v2 "q4_0" "-ngl 1 -t 4 -p 1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256 -n 128"
model size backend ngl th test master t/s PR t/s speedup
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 1 48.31 ± 24.64 48.69 ± 24.48 1.008
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 2 88.30 ± 0.97 88.81 ± 0.90 1.006
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 4 42.94 ± 0.15 108.12 ± 0.77 2.518
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 5 53.43 ± 0.12 113.97 ± 0.79 2.133
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 6 64.03 ± 0.20 126.26 ± 0.65 1.972
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 7 74.53 ± 0.09 130.34 ± 0.79 1.749
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 8 84.85 ± 0.22 127.16 ± 0.50 1.499
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 9 95.41 ± 0.05 143.48 ± 0.80 1.504
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 10 105.56 ± 0.13 146.24 ± 0.41 1.385
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 11 115.74 ± 0.10 142.48 ± 0.70 1.231
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 12 125.90 ± 0.12 145.26 ± 0.43 1.154
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 13 135.86 ± 0.10 148.30 ± 0.45 1.092
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 14 146.83 ± 0.26 155.42 ± 0.59 1.059
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 15 156.65 ± 0.45 156.63 ± 0.33 1.000
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 16 165.56 ± 0.10 165.38 ± 0.31 0.999
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 32 333.24 ± 1.00 332.12 ± 0.39 0.997
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 64 474.08 ± 0.61 474.25 ± 0.92 1.000
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 128 596.25 ± 1.05 596.23 ± 1.49 1.000
llama 13B Q4_0 6.86 GiB Metal 1 4 pp 256 630.68 ± 2.13 629.23 ± 1.68 0.998
llama 13B Q4_0 6.86 GiB Metal 1 4 tg 128 58.55 ± 0.23 58.82 ± 0.07 1.005

build: 99ed03a (1343)

make -j && ../scripts/run-all-perf.sh llama-7b-v2 "q4_0" "-ngl 1 -t 4 -p 1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256 -n 128"
model size backend ngl th test t/s t/s speedup
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 1 76.45 ± 42.71 82.15 ± 41.40 1.075
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 2 153.05 ± 2.75 152.31 ± 3.73 0.995
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 4 73.90 ± 0.30 195.84 ± 3.21 2.650
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 5 91.90 ± 0.18 207.80 ± 1.06 2.261
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 6 109.99 ± 0.47 231.24 ± 2.18 2.102
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 7 128.03 ± 0.54 238.55 ± 2.61 1.863
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 8 146.67 ± 0.89 240.39 ± 0.53 1.639
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 9 164.05 ± 0.69 263.59 ± 1.70 1.607
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 10 181.87 ± 0.18 268.87 ± 2.57 1.478
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 11 200.08 ± 0.61 264.98 ± 2.13 1.324
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 12 217.81 ± 0.78 270.22 ± 1.90 1.241
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 13 234.61 ± 0.82 274.98 ± 0.56 1.172
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 14 252.92 ± 1.04 284.32 ± 2.01 1.124
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 15 270.12 ± 1.27 287.72 ± 0.72 1.065
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 16 288.98 ± 0.96 288.78 ± 1.11 0.999
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 32 569.31 ± 2.22 568.82 ± 2.76 0.999
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 64 840.19 ± 1.89 840.68 ± 0.23 1.001
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 128 1064.88 ± 2.15 1067.44 ± 2.81 1.002
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 256 1165.27 ± 1.40 1166.87 ± 1.66 1.001
llama 7B Q4_0 3.56 GiB Metal 1 4 tg 128 98.42 ± 0.09 98.34 ± 0.07 0.999

Sample results for parallel example

Generating 64 sequences using a system prompt, serving 4 requests in parallel

# 13B Q8_0, n_parallel = 4
make -j && ./parallel -m ./models/llama-13b-v2/ggml-model-q8_0.gguf -f "prompts/parallel-questions.txt" -n 512 -t 1 -s 3456 -ngl 100 -c 8192 -np 4 -ns 64 -cb
  • master
main: n_parallel = 4, n_sequences = 64, cont_batching = 1, system tokens = 305
External prompt file: prompts/parallel-questions.txt
Model and path used:  ./models/llama-13b-v2/ggml-model-q8_0.gguf

Total prompt tokens:    895, speed:  9.59 t/s
Total gen tokens:      3437, speed: 36.84 t/s
Total speed (AVG):           speed: 46.43 t/s
Cache misses:             0


llama_print_timings:        load time =     788.18 ms
llama_print_timings:      sample time =    2339.45 ms /  3501 runs   (    0.67 ms per token,  1496.51 tokens per second)
llama_print_timings: prompt eval time =   90551.29 ms /  4635 tokens (   19.54 ms per token,    51.19 tokens per second)
llama_print_timings:        eval time =      55.12 ms /     2 runs   (   27.56 ms per token,    36.29 tokens per second)
llama_print_timings:       total time =   93307.56 ms
  • PR
main: n_parallel = 4, n_sequences = 64, cont_batching = 1, system tokens = 305
External prompt file: prompts/parallel-questions.txt
Model and path used:  ./models/llama-13b-v2/ggml-model-q8_0.gguf

Total prompt tokens:    895, speed: 13.49 t/s
Total gen tokens:      3497, speed: 52.71 t/s
Total speed (AVG):           speed: 66.20 t/s
Cache misses:             0


llama_print_timings:        load time =     778.88 ms
llama_print_timings:      sample time =    2392.38 ms /  3561 runs   (    0.67 ms per token,  1488.48 tokens per second)
llama_print_timings: prompt eval time =   63470.54 ms /  4693 tokens (   13.52 ms per token,    73.94 tokens per second)
llama_print_timings:        eval time =     111.81 ms /     4 runs   (   27.95 ms per token,    35.77 tokens per second)
llama_print_timings:       total time =   66341.93 ms

@ggerganov ggerganov added performance Speed related topics need feedback Testing and feedback with results are needed labels Oct 7, 2023
@jhen0409
Copy link
Sponsor Collaborator

jhen0409 commented Oct 7, 2023

M2 (10c GPU), it looks like the speed has some slow down in some cases:

model size backend ngl test master t/s PR t/s speedup
llama 7B Q4_0 3.56 GiB Metal 1 pp 1 19.80 ± 5.49 17.70 ± 9.79 0.8944
llama 7B Q4_0 3.56 GiB Metal 1 pp 2 36.79 ± 0.18 36.66 ± 0.05 0.9965
llama 7B Q4_0 3.56 GiB Metal 1 pp 4 21.54 ± 0.03 36.77 ± 0.06 1.7079
llama 7B Q4_0 3.56 GiB Metal 1 pp 5 26.76 ± 0.02 36.96 ± 0.16 1.3809
llama 7B Q4_0 3.56 GiB Metal 1 pp 6 32.03 ± 0.07 39.96 ± 0.04 1.2483
llama 7B Q4_0 3.56 GiB Metal 1 pp 7 37.26 ± 0.07 40.11 ± 0.07 1.0768
llama 7B Q4_0 3.56 GiB Metal 1 pp 8 42.55 ± 0.04 38.42 ± 0.03 0.9030
llama 7B Q4_0 3.56 GiB Metal 1 pp 9 47.65 ± 0.10 41.45 ± 0.05 0.8698
llama 7B Q4_0 3.56 GiB Metal 1 pp 10 52.77 ± 0.08 41.57 ± 0.05 0.7878
llama 7B Q4_0 3.56 GiB Metal 1 pp 11 57.77 ± 0.08 38.75 ± 0.13 0.6709
llama 7B Q4_0 3.56 GiB Metal 1 pp 12 62.84 ± 0.06 39.08 ± 0.20 0.6218
llama 7B Q4_0 3.56 GiB Metal 1 pp 13 67.62 ± 0.11 39.23 ± 0.09 0.5802
llama 7B Q4_0 3.56 GiB Metal 1 pp 14 72.84 ± 0.04 41.79 ± 0.07 0.5739
llama 7B Q4_0 3.56 GiB Metal 1 pp 15 77.63 ± 0.14 41.85 ± 0.05 0.5394
llama 7B Q4_0 3.56 GiB Metal 1 pp 16 82.73 ± 0.17 82.65 ± 0.22 0.9990
llama 7B Q4_0 3.56 GiB Metal 1 pp 32 166.89 ± 0.32 166.90 ± 0.31 1.0001
llama 7B Q4_0 3.56 GiB Metal 1 pp 64 182.12 ± 0.28 182.21 ± 0.17 1.0005
llama 7B Q4_0 3.56 GiB Metal 1 pp 128 185.47 ± 0.08 185.67 ± 0.17 1.0011
llama 7B Q4_0 3.56 GiB Metal 1 pp 256 183.72 ± 0.19 183.76 ± 0.19 1.0002
llama 7B Q4_0 3.56 GiB Metal 1 tg 128 21.92 ± 0.01 21.89 ± 0.01 0.9986

build: 99ed03a (1343)

@ggerganov
Copy link
Owner Author

Yes, I've also realized that the break-even point is not so trivial as currently proposed. It is function of the matrix sizes.
Still trying to find a general way to determine it

@ggerganov
Copy link
Owner Author

Bummer. I can't figure out an universal way to determine which kernel to use when. The break-even point for the number of batches at which the matrix-matrix kernel becomes more performant than the matrix-vector kernel depends both on the specifics of the hardware which are not queryable (number of cores, memory bandwidth, FLOPs) and on the model / matrix sizes.

Based on the tests here, there is a significant performance to be gained for quantized low-batch (< 16) decoding, which would be quite important for speculative approaches. But can't figure out a way to choose the optimal kernel. Any suggestions?

@KerfuffleV2
Copy link
Collaborator

although similar analysis should be performed on the CUDA kernels

Are there actually already two kernels for CUDA? I actually wanted to mess with this and see if it helped with my issue where parallel generation gets slower and slower but it didn't look like there was that kind of logic in ggml-cuda.cu.

But can't figure out a way to choose the optimal kernel. Any suggestions?

You didn't say "good suggestions". The simplest thing that comes to mind is to just do a mini benchmark that runs a few operations and stores the result into the context. Maybe that could even be part of the warmup stuff that already exists. There might be other stuff that could benefit from that sort of thing as well.

@slaren
Copy link
Collaborator

slaren commented Oct 7, 2023

Are there actually already two kernels for CUDA?

CUDA has different kernels for matrix-vector and matrix-matrix multiplication. To do this, first the matrix-vector kernels would need to be updated to support matrix-matrix multiplication.

@ggerganov ggerganov changed the title metal : improve decoding speed for batches of 2-16 metal : support MTLGPUFamily < Apple7, formatting, style Oct 8, 2023
@ggerganov ggerganov merged commit b0ec521 into master Oct 8, 2023
10 of 11 checks passed
@ibehnam
Copy link
Contributor

ibehnam commented Oct 23, 2023

Bummer. I can't figure out an universal way to determine which kernel to use when. The break-even point for the number of batches at which the matrix-matrix kernel becomes more performant than the matrix-vector kernel depends both on the specifics of the hardware which are not queryable (number of cores, memory bandwidth, FLOPs) and on the model / matrix sizes.

Optionally, we can ask the user to pass in their hardware specs. In the docs, there's already a hint to set --threads equal to the number of physical CPU cores, implying that the user can obtain this info about their system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants