metal : support MTLGPUFamily < Apple7, formatting, style #3524

ggerganov · 2023-10-07T10:16:14Z

Edit:

The scope of this PR changed - it is now mostly a formatting change. Improved batched decoding will be investigated in a future PR.

Obsolete info below

ref #3479

In Metal, we have 2 matrix multiplication kernels:

matrix-matrix
matrix-vector

Depending on the batch size, one of the 2 kernels is faster.

This PR adds logic for choosing which kernel to use depending on the batch size. The numbers are determined empirically on M2 Ultra. Not sure if these translate to the optimal numbers for other chips, but for sure would not affect the performance tests that we have been doing so far, since we have been testing either batch size of 1 or batch size of 512.

This change improves batched decoding performance for non-F16 types. For F16 there is no difference, although similar analysis should be performed on the CUDA kernels and see where is the break-even point between the 2 kernels

make -j && ../scripts/run-all-perf.sh llama-13b-v2 "q4_0" "-ngl 1 -t 4 -p 1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256 -n 128"

model	size	backend	ngl	th	test	master t/s	PR t/s	speedup
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 1	48.31 ± 24.64	48.69 ± 24.48	1.008
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 2	88.30 ± 0.97	88.81 ± 0.90	1.006
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 4	42.94 ± 0.15	108.12 ± 0.77	2.518
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 5	53.43 ± 0.12	113.97 ± 0.79	2.133
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 6	64.03 ± 0.20	126.26 ± 0.65	1.972
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 7	74.53 ± 0.09	130.34 ± 0.79	1.749
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 8	84.85 ± 0.22	127.16 ± 0.50	1.499
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 9	95.41 ± 0.05	143.48 ± 0.80	1.504
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 10	105.56 ± 0.13	146.24 ± 0.41	1.385
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 11	115.74 ± 0.10	142.48 ± 0.70	1.231
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 12	125.90 ± 0.12	145.26 ± 0.43	1.154
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 13	135.86 ± 0.10	148.30 ± 0.45	1.092
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 14	146.83 ± 0.26	155.42 ± 0.59	1.059
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 15	156.65 ± 0.45	156.63 ± 0.33	1.000
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 16	165.56 ± 0.10	165.38 ± 0.31	0.999
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 32	333.24 ± 1.00	332.12 ± 0.39	0.997
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 64	474.08 ± 0.61	474.25 ± 0.92	1.000
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 128	596.25 ± 1.05	596.23 ± 1.49	1.000
llama 13B Q4_0	6.86 GiB	Metal	1	4	pp 256	630.68 ± 2.13	629.23 ± 1.68	0.998
llama 13B Q4_0	6.86 GiB	Metal	1	4	tg 128	58.55 ± 0.23	58.82 ± 0.07	1.005

build: 99ed03a (1343)

make -j && ../scripts/run-all-perf.sh llama-7b-v2 "q4_0" "-ngl 1 -t 4 -p 1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256 -n 128"

model	size	backend	ngl	th	test	t/s	t/s	speedup
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 1	76.45 ± 42.71	82.15 ± 41.40	1.075
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 2	153.05 ± 2.75	152.31 ± 3.73	0.995
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 4	73.90 ± 0.30	195.84 ± 3.21	2.650
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 5	91.90 ± 0.18	207.80 ± 1.06	2.261
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 6	109.99 ± 0.47	231.24 ± 2.18	2.102
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 7	128.03 ± 0.54	238.55 ± 2.61	1.863
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 8	146.67 ± 0.89	240.39 ± 0.53	1.639
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 9	164.05 ± 0.69	263.59 ± 1.70	1.607
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 10	181.87 ± 0.18	268.87 ± 2.57	1.478
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 11	200.08 ± 0.61	264.98 ± 2.13	1.324
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 12	217.81 ± 0.78	270.22 ± 1.90	1.241
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 13	234.61 ± 0.82	274.98 ± 0.56	1.172
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 14	252.92 ± 1.04	284.32 ± 2.01	1.124
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 15	270.12 ± 1.27	287.72 ± 0.72	1.065
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 16	288.98 ± 0.96	288.78 ± 1.11	0.999
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 32	569.31 ± 2.22	568.82 ± 2.76	0.999
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 64	840.19 ± 1.89	840.68 ± 0.23	1.001
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 128	1064.88 ± 2.15	1067.44 ± 2.81	1.002
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 256	1165.27 ± 1.40	1166.87 ± 1.66	1.001
llama 7B Q4_0	3.56 GiB	Metal	1	4	tg 128	98.42 ± 0.09	98.34 ± 0.07	0.999

Sample results for `parallel` example

Generating 64 sequences using a system prompt, serving 4 requests in parallel

# 13B Q8_0, n_parallel = 4
make -j && ./parallel -m ./models/llama-13b-v2/ggml-model-q8_0.gguf -f "prompts/parallel-questions.txt" -n 512 -t 1 -s 3456 -ngl 100 -c 8192 -np 4 -ns 64 -cb

master

main: n_parallel = 4, n_sequences = 64, cont_batching = 1, system tokens = 305
External prompt file: prompts/parallel-questions.txt
Model and path used:  ./models/llama-13b-v2/ggml-model-q8_0.gguf

Total prompt tokens:    895, speed:  9.59 t/s
Total gen tokens:      3437, speed: 36.84 t/s
Total speed (AVG):           speed: 46.43 t/s
Cache misses:             0


llama_print_timings:        load time =     788.18 ms
llama_print_timings:      sample time =    2339.45 ms /  3501 runs   (    0.67 ms per token,  1496.51 tokens per second)
llama_print_timings: prompt eval time =   90551.29 ms /  4635 tokens (   19.54 ms per token,    51.19 tokens per second)
llama_print_timings:        eval time =      55.12 ms /     2 runs   (   27.56 ms per token,    36.29 tokens per second)
llama_print_timings:       total time =   93307.56 ms

PR

main: n_parallel = 4, n_sequences = 64, cont_batching = 1, system tokens = 305
External prompt file: prompts/parallel-questions.txt
Model and path used:  ./models/llama-13b-v2/ggml-model-q8_0.gguf

Total prompt tokens:    895, speed: 13.49 t/s
Total gen tokens:      3497, speed: 52.71 t/s
Total speed (AVG):           speed: 66.20 t/s
Cache misses:             0


llama_print_timings:        load time =     778.88 ms
llama_print_timings:      sample time =    2392.38 ms /  3561 runs   (    0.67 ms per token,  1488.48 tokens per second)
llama_print_timings: prompt eval time =   63470.54 ms /  4693 tokens (   13.52 ms per token,    73.94 tokens per second)
llama_print_timings:        eval time =     111.81 ms /     4 runs   (   27.95 ms per token,    35.77 tokens per second)
llama_print_timings:       total time =   66341.93 ms

jhen0409 · 2023-10-07T11:40:59Z

M2 (10c GPU), it looks like the speed has some slow down in some cases:

model	size	backend	ngl	test	master t/s	PR t/s	speedup
llama 7B Q4_0	3.56 GiB	Metal	1	pp 1	19.80 ± 5.49	17.70 ± 9.79	0.8944
llama 7B Q4_0	3.56 GiB	Metal	1	pp 2	36.79 ± 0.18	36.66 ± 0.05	0.9965
llama 7B Q4_0	3.56 GiB	Metal	1	pp 4	21.54 ± 0.03	36.77 ± 0.06	1.7079
llama 7B Q4_0	3.56 GiB	Metal	1	pp 5	26.76 ± 0.02	36.96 ± 0.16	1.3809
llama 7B Q4_0	3.56 GiB	Metal	1	pp 6	32.03 ± 0.07	39.96 ± 0.04	1.2483
llama 7B Q4_0	3.56 GiB	Metal	1	pp 7	37.26 ± 0.07	40.11 ± 0.07	1.0768
llama 7B Q4_0	3.56 GiB	Metal	1	pp 8	42.55 ± 0.04	38.42 ± 0.03	0.9030
llama 7B Q4_0	3.56 GiB	Metal	1	pp 9	47.65 ± 0.10	41.45 ± 0.05	0.8698
llama 7B Q4_0	3.56 GiB	Metal	1	pp 10	52.77 ± 0.08	41.57 ± 0.05	0.7878
llama 7B Q4_0	3.56 GiB	Metal	1	pp 11	57.77 ± 0.08	38.75 ± 0.13	0.6709
llama 7B Q4_0	3.56 GiB	Metal	1	pp 12	62.84 ± 0.06	39.08 ± 0.20	0.6218
llama 7B Q4_0	3.56 GiB	Metal	1	pp 13	67.62 ± 0.11	39.23 ± 0.09	0.5802
llama 7B Q4_0	3.56 GiB	Metal	1	pp 14	72.84 ± 0.04	41.79 ± 0.07	0.5739
llama 7B Q4_0	3.56 GiB	Metal	1	pp 15	77.63 ± 0.14	41.85 ± 0.05	0.5394
llama 7B Q4_0	3.56 GiB	Metal	1	pp 16	82.73 ± 0.17	82.65 ± 0.22	0.9990
llama 7B Q4_0	3.56 GiB	Metal	1	pp 32	166.89 ± 0.32	166.90 ± 0.31	1.0001
llama 7B Q4_0	3.56 GiB	Metal	1	pp 64	182.12 ± 0.28	182.21 ± 0.17	1.0005
llama 7B Q4_0	3.56 GiB	Metal	1	pp 128	185.47 ± 0.08	185.67 ± 0.17	1.0011
llama 7B Q4_0	3.56 GiB	Metal	1	pp 256	183.72 ± 0.19	183.76 ± 0.19	1.0002
llama 7B Q4_0	3.56 GiB	Metal	1	tg 128	21.92 ± 0.01	21.89 ± 0.01	0.9986

build: 99ed03a (1343)

ggerganov · 2023-10-07T12:08:25Z

Yes, I've also realized that the break-even point is not so trivial as currently proposed. It is function of the matrix sizes.
Still trying to find a general way to determine it

ggerganov · 2023-10-07T16:18:45Z

Bummer. I can't figure out an universal way to determine which kernel to use when. The break-even point for the number of batches at which the matrix-matrix kernel becomes more performant than the matrix-vector kernel depends both on the specifics of the hardware which are not queryable (number of cores, memory bandwidth, FLOPs) and on the model / matrix sizes.

Based on the tests here, there is a significant performance to be gained for quantized low-batch (< 16) decoding, which would be quite important for speculative approaches. But can't figure out a way to choose the optimal kernel. Any suggestions?

KerfuffleV2 · 2023-10-07T17:25:32Z

although similar analysis should be performed on the CUDA kernels

Are there actually already two kernels for CUDA? I actually wanted to mess with this and see if it helped with my issue where parallel generation gets slower and slower but it didn't look like there was that kind of logic in ggml-cuda.cu.

But can't figure out a way to choose the optimal kernel. Any suggestions?

You didn't say "good suggestions". The simplest thing that comes to mind is to just do a mini benchmark that runs a few operations and stores the result into the context. Maybe that could even be part of the warmup stuff that already exists. There might be other stuff that could benefit from that sort of thing as well.

slaren · 2023-10-07T17:46:46Z

Are there actually already two kernels for CUDA?

CUDA has different kernels for matrix-vector and matrix-matrix multiplication. To do this, first the matrix-vector kernels would need to be updated to support matrix-matrix multiplication.

ibehnam · 2023-10-23T22:11:37Z

Bummer. I can't figure out an universal way to determine which kernel to use when. The break-even point for the number of batches at which the matrix-matrix kernel becomes more performant than the matrix-vector kernel depends both on the specifics of the hardware which are not queryable (number of cores, memory bandwidth, FLOPs) and on the model / matrix sizes.

Optionally, we can ask the user to pass in their hardware specs. In the docs, there's already a hint to set --threads equal to the number of physical CPU cores, implying that the user can obtain this info about their system.

metal : improve decoding speed for batches of 2-16

99ed03a

ggerganov added performance Speed related topics need feedback Testing and feedback with results are needed labels Oct 7, 2023

metal : rename kernels mul_mat_ to mul_mv_

c600224

metal : indentations

8f6ad68

minor

545b034

metal : print more GPU info + disable mul_mm for MTLGPUFamiliy < Apple7

6b9554a

ggerganov force-pushed the metal-improve-batching branch from fe6ef1c to 6b9554a Compare October 8, 2023 06:55

ggerganov changed the title ~~metal : improve decoding speed for batches of 2-16~~ metal : support MTLGPUFamily < Apple7, formatting, style Oct 8, 2023

ggerganov merged commit b0ec521 into master Oct 8, 2023
10 of 11 checks passed

This was referenced Oct 8, 2023

Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129

Closed

batched : add bench tool #3545

Merged

ggerganov mentioned this pull request Oct 23, 2023

server : parallel decoding and multimodal (cont) #3677

Merged

bobqianic mentioned this pull request Oct 27, 2023

Skip mm_mul kernel functions additions if on Intel ggerganov/whisper.cpp#1294

Closed

ggerganov mentioned this pull request Jan 2, 2024

llama_decode is significantly slower if n_tokens > 1 #4624

Closed

ggerganov mentioned this pull request Mar 16, 2024

Metal kernel mv_f16_f32_l4 performance issue for long contexts, too many threads #6089

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : support MTLGPUFamily < Apple7, formatting, style #3524

metal : support MTLGPUFamily < Apple7, formatting, style #3524

ggerganov commented Oct 7, 2023 •

edited

jhen0409 commented Oct 7, 2023

ggerganov commented Oct 7, 2023

ggerganov commented Oct 7, 2023

KerfuffleV2 commented Oct 7, 2023

slaren commented Oct 7, 2023

ibehnam commented Oct 23, 2023

metal : support MTLGPUFamily < Apple7, formatting, style #3524

metal : support MTLGPUFamily < Apple7, formatting, style #3524

Conversation

ggerganov commented Oct 7, 2023 • edited

Edit:

Sample results for parallel example

jhen0409 commented Oct 7, 2023

ggerganov commented Oct 7, 2023

ggerganov commented Oct 7, 2023

KerfuffleV2 commented Oct 7, 2023

slaren commented Oct 7, 2023

ibehnam commented Oct 23, 2023

ggerganov commented Oct 7, 2023 •

edited

Sample results for `parallel` example