ggml-cuda : perform cublas mat mul of quantized types as f16 #3412

slaren · 2023-09-30T10:35:38Z

Improves prompt processing speed with quantized types with mmq disabled only (-nommq).

Essentially this is the same as #3370, extended to quantized types by dequantizing to fp16.

model	size	test	master t/s	PR t/s	speedup
llama 7B mostly Q2_K	2.63 GiB	pp 512	1740.05 ± 2.18	3422.52 ± 26.59	1.97
llama 7B mostly Q3_K - Large	3.35 GiB	pp 512	1704.44 ± 36.02	3434.08 ± 21.88	2.02
llama 7B mostly Q3_K - Medium	3.07 GiB	pp 512	1725.62 ± 2.40	3423.26 ± 44.50	1.98
llama 7B mostly Q3_K - Small	2.75 GiB	pp 512	1720.28 ± 17.54	3415.61 ± 15.86	1.98
llama 7B mostly Q4_0	3.56 GiB	pp 512	1705.19 ± 5.29	3230.66 ± 16.41	1.89
llama 7B mostly Q4_1	3.95 GiB	pp 512	1696.79 ± 16.01	3241.20 ± 25.87	1.91
llama 7B mostly Q4_K - Medium	3.80 GiB	pp 512	1718.43 ± 16.33	3507.96 ± 8.80	2.04
llama 7B mostly Q4_K - Small	3.59 GiB	pp 512	1727.00 ± 4.03	3413.81 ± 97.65	1.98
llama 7B mostly Q5_0	4.33 GiB	pp 512	1695.06 ± 6.91	3172.95 ± 15.14	1.87
llama 7B mostly Q5_1	4.72 GiB	pp 512	1697.97 ± 4.07	3179.81 ± 35.89	1.87
llama 7B mostly Q5_K - Medium	4.45 GiB	pp 512	1721.78 ± 2.12	3460.38 ± 34.02	2.01
llama 7B mostly Q5_K - Small	4.33 GiB	pp 512	1722.66 ± 4.93	3474.46 ± 36.03	2.02
llama 7B mostly Q6_K	5.15 GiB	pp 512	1712.02 ± 3.62	3468.20 ± 23.51	2.03
llama 7B mostly Q8_0	6.67 GiB	pp 512	1685.94 ± 8.08	3176.80 ± 51.69	1.88

For comparison, this is the performance that I get with mmq enabled (the default):

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q2_K	2.63 GiB	6.74 B	CUDA	99	pp 512	1814.65 ± 4.73
llama 7B mostly Q3_K - Large	3.35 GiB	6.74 B	CUDA	99	pp 512	1922.97 ± 19.27
llama 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	99	pp 512	2009.16 ± 8.20
llama 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	pp 512	1901.53 ± 31.49
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	2420.71 ± 9.51
llama 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	99	pp 512	2099.10 ± 31.81
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	2220.84 ± 1.78
llama 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	99	pp 512	2181.68 ± 54.97
llama 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	99	pp 512	2191.70 ± 5.07
llama 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	99	pp 512	1945.15 ± 6.50
llama 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	99	pp 512	2070.01 ± 5.23
llama 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	99	pp 512	2044.41 ± 12.79
llama 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	99	pp 512	2125.72 ± 7.12
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	pp 512	2346.21 ± 35.02

…s fp16

ggerganov

Great. Can’t test atm, but if ppl looks ok we should merge

slaren · 2023-09-30T11:39:44Z

Perplexity looks good:

model	ppl
7B/ggml-model-f16.gguf	5.9073 +/- 0.03309
7B/ggml-model-Q2_K.gguf	6.5864 +/- 0.03755
7B/ggml-model-Q3_K_S.gguf	6.4524 +/- 0.03672
7B/ggml-model-Q3_K.gguf	6.1548 +/- 0.03456
7B/ggml-model-Q3_K_L.gguf	6.0866 +/- 0.03416
7B/ggml-model-Q4_0.gguf	6.1159 +/- 0.03504
7B/ggml-model-Q4_1.gguf	6.0655 +/- 0.03400
7B/ggml-model-Q4_K_S.gguf	6.0067 +/- 0.03371
7B/ggml-model-Q4_K.gguf	5.9616 +/- 0.03342
7B/ggml-model-Q5_0.gguf	5.9814 +/- 0.03409
7B/ggml-model-Q5_1.gguf	5.9418 +/- 0.03328
7B/ggml-model-Q5_K_S.gguf	5.9463 +/- 0.03330
7B/ggml-model-Q5_K.gguf	5.9196 +/- 0.03317
7B/ggml-model-Q6_K.gguf	5.9076 +/- 0.03309
7B/ggml-model-Q8_0.gguf	5.9078 +/- 0.03309

Ph0rk0z · 2023-09-30T12:42:01Z

Will this murder P40? Also, what if I am running a model on 3090s and also P40s together?

slaren · 2023-09-30T12:47:39Z

This is only used on Volta and up.

Ph0rk0z · 2023-09-30T14:11:13Z

Right but what happens if one gpu is pascal and one GPU is ampere? Will it go with the lowest cuda version for all?

slaren · 2023-09-30T15:23:18Z

This was already only used on the main GPU, but I think that even that may not work properly when converting dst to fp32 due to synchronization issues. So this is completely disabled with multi GPU now, the fp32 mat mul is used when using more than one GPU.

ggerganov · 2023-09-30T16:03:56Z

I updated the A100 numbers using this PR: #3359

Dampfinchen · 2023-09-30T16:52:13Z

This increases VRAM usage for some reason. With this build and using --nommap my q4K_S model no longer fits in VRAM and it slows down dramatically.

Edit: Apologies, I misread. I was confusing the new mul mat kernels (MMQ) with MMAP. So the higher VRAM usage is expected. The difference MMQ makes is dramatic in my case:

llama_print_timings:        load time =  2527.29 ms
llama_print_timings:      sample time =   111.18 ms /   180 runs   (    0.62 ms per token,  1618.94 tokens per second)
llama_print_timings: prompt eval time = 35581.01 ms /  1849 tokens (   19.24 ms per token,    51.97 tokens per second)
llama_print_timings:        eval time =  6702.92 ms /   179 runs   (   37.45 ms per token,    26.70 tokens per second)
llama_print_timings:       total time = 42690.89 ms

llama_print_timings:        load time =  2477.21 ms
llama_print_timings:      sample time =   109.03 ms /   180 runs   (    0.61 ms per token,  1650.92 tokens per second)
llama_print_timings: prompt eval time =  4031.07 ms /  1849 tokens (    2.18 ms per token,   458.69 tokens per second)
llama_print_timings:        eval time =  6651.27 ms /   179 runs   (   37.16 ms per token,    26.91 tokens per second)
llama_print_timings:       total time = 11085.66 ms

Hopefully this change can be made to work with mmq as well.

slaren · 2023-09-30T17:09:44Z

Hopefully this change can be made to work with mmq as well.

Once support for tensor cores is added to mmq, it will be as fast or faster than cublas again, while still using less VRAM. For now, cublas is the only way to use tensor cores.

Dampfinchen · 2023-09-30T17:36:13Z

Hopefully this change can be made to work with mmq as well.

Once support for tensor cores is added to mmq, it will be as fast or faster than cublas again, while still using less VRAM. For now, cublas is the only way to use tensor cores.

Just ran NSight and can confirm the tensor cores are, for the first time ever, used to their full extent.

Awesome work! Now fingers crossed its easy to enable tensor core support for mmq as well. If mmq (which was a lot faster than cublas before this commit) can benefit from TC support as well, then we are definately in for another revolution here. Exciting stuff!

YellowRoseCx · 2023-09-30T18:52:29Z

Would be interesting to see how the changes affect AMD users

…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401) train : fix KQ_pos allocation (ggerganov#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206) readme : update hot topics + model links (ggerganov#3399) readme : add link to grammars app (ggerganov#3388) swift : fix build on xcode 15 (ggerganov#3387) build : enable more non-default compiler warnings (ggerganov#3200) ggml_tensor: update the structure comments. (ggerganov#3283) ggml : release the requested thread pool resource (ggerganov#3292) llama.cpp : split llama_context_params into model and context params (ggerganov#3301) ci : multithreaded builds (ggerganov#3311) train : finetune LORA (ggerganov#2632) gguf : basic type checking in gguf_get_* (ggerganov#3346) gguf : make token scores and types optional (ggerganov#3347) ci : disable freeBSD builds due to lack of VMs (ggerganov#3381) llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228) docs : mark code as Bash (ggerganov#3375) readme : add Mistral AI release 0.1 (ggerganov#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)

JohannesGaessler · 2023-10-05T08:57:49Z

Once support for tensor cores is added to mmq, it will be as fast or faster than cublas again, while still using less VRAM.

~2 weeks ago I did a prototype implementation for mmq using tensor cores and was not able to get better performance. From what I can tell a prerequisite to getting good tensor core utilization would be to load data asynchronously. As of right now the mmq compute pipeline utilization (without tensor cores) is only ~50%.

…ov#3412) * ggml-cuda : perform cublas matrix multiplication of quantized types as fp16 * rename CC_TURING to CC_VOLTA * disable fp16 mat mul completely with multi GPU

ggml-cuda : perform cublas matrix multiplication of quantized types a…

62832c5

…s fp16

ggerganov approved these changes Sep 30, 2023

View reviewed changes

rename CC_TURING to CC_VOLTA

59937e4

disable fp16 mat mul completely with multi GPU

39ddda2

slaren merged commit f5ef5cf into master Sep 30, 2023
34 checks passed

slaren deleted the cublas-q-f16 branch September 30, 2023 16:13

ggerganov mentioned this pull request Oct 4, 2023

llama : improve batched decoding performance #3479

Closed

ggerganov mentioned this pull request Oct 10, 2023

cuda: 1.2x faster dequantization kernel #2809

Open

ggerganov mentioned this pull request Oct 25, 2023

cuda : improve text-generation and batched decoding performance #3776

Merged

6 tasks

phymbert mentioned this pull request Feb 28, 2024

cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q #5772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda : perform cublas mat mul of quantized types as f16 #3412

ggml-cuda : perform cublas mat mul of quantized types as f16 #3412

slaren commented Sep 30, 2023 •

edited

ggerganov left a comment

slaren commented Sep 30, 2023

Ph0rk0z commented Sep 30, 2023 •

edited

slaren commented Sep 30, 2023

Ph0rk0z commented Sep 30, 2023

slaren commented Sep 30, 2023

ggerganov commented Sep 30, 2023 •

edited

Dampfinchen commented Sep 30, 2023 •

edited

slaren commented Sep 30, 2023

Dampfinchen commented Sep 30, 2023 •

edited

YellowRoseCx commented Sep 30, 2023

JohannesGaessler commented Oct 5, 2023

ggml-cuda : perform cublas mat mul of quantized types as f16 #3412

ggml-cuda : perform cublas mat mul of quantized types as f16 #3412

Conversation

slaren commented Sep 30, 2023 • edited

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Sep 30, 2023

Ph0rk0z commented Sep 30, 2023 • edited

slaren commented Sep 30, 2023

Ph0rk0z commented Sep 30, 2023

slaren commented Sep 30, 2023

ggerganov commented Sep 30, 2023 • edited

Dampfinchen commented Sep 30, 2023 • edited

slaren commented Sep 30, 2023

Dampfinchen commented Sep 30, 2023 • edited

YellowRoseCx commented Sep 30, 2023

JohannesGaessler commented Oct 5, 2023

slaren commented Sep 30, 2023 •

edited

Ph0rk0z commented Sep 30, 2023 •

edited

ggerganov commented Sep 30, 2023 •

edited

Dampfinchen commented Sep 30, 2023 •

edited

Dampfinchen commented Sep 30, 2023 •

edited