ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

slaren · 2023-09-27T18:05:50Z

Improves prompt processing performance with fp16 models.

3090 Ti/WSL2:

model	backend	ngl	test	master t/s	PR t/s	speedup
llama 7B mostly F16	CUDA	99	pp 512	1661.19 ± 3.09	3984.28 ± 25.45	2.40

ggml-cuda.cu

slaren · 2023-09-27T22:40:49Z

Is this actually correct? I believe compute capability 7.0 is volta, not turing.

llama.cpp/ggml-cuda.cu

Line 82 in 7d5674d

#define CC_TURING 700

bobqianic · 2023-09-28T03:26:50Z

Is this actually correct? I believe compute capability 7.0 is volta, not turing.

llama.cpp/ggml-cuda.cu

Line 82 in 7d5674d

#define CC_TURING 700

The computing capacity of Turing is 7.5, while that of Volta is 7.0. However, Volta also supports FP16.

ggerganov · 2023-09-28T10:21:43Z

Hm, this change might actually degrade the TG performance:

Before:

model	size	params	backend	ngl	threads	test	t/s
LLaMA v2 7B mostly F16	12.55 GiB	6.74 B	CUDA	999	1	pp 512	3373.39 ± 3.54
LLaMA v2 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	999	1	pp 512	2041.33 ± 0.32
LLaMA v2 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	999	1	pp 512	2084.74 ± 0.08
LLaMA v2 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	999	1	pp 512	2015.38 ± 0.69
LLaMA v2 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	999	1	pp 512	2042.62 ± 0.34
LLaMA v2 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	999	1	pp 512	1887.36 ± 0.40
LLaMA v2 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	999	1	pp 512	2041.73 ± 0.42
LLaMA v2 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	999	1	pp 512	1745.14 ± 3.15
LLaMA v2 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	999	1	pp 512	1674.69 ± 4.07
LLaMA v2 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	999	1	pp 512	1816.83 ± 4.45
LLaMA v2 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	999	1	pp 512	1759.79 ± 3.11
LLaMA v2 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	999	1	pp 512	1503.72 ± 1.09
LLaMA v2 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	999	1	pp 512	1314.71 ± 0.08
LLaMA v2 7B mostly F16	12.55 GiB	6.74 B	CUDA	999	1	tg 128	72.73 ± 0.02
LLaMA v2 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	999	1	tg 128	102.59 ± 0.03
LLaMA v2 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	999	1	tg 128	143.87 ± 0.05
LLaMA v2 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	999	1	tg 128	142.90 ± 0.03
LLaMA v2 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	999	1	tg 128	124.20 ± 0.03
LLaMA v2 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	999	1	tg 128	125.25 ± 0.03
LLaMA v2 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	999	1	tg 128	103.77 ± 0.01
LLaMA v2 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	999	1	tg 128	119.19 ± 0.03
LLaMA v2 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	999	1	tg 128	122.88 ± 0.04
LLaMA v2 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	999	1	tg 128	128.46 ± 0.03
LLaMA v2 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	999	1	tg 128	133.71 ± 0.02
LLaMA v2 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	999	1	tg 128	112.58 ± 0.04
LLaMA v2 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	999	1	tg 128	100.40 ± 0.07

build: 99115f3 (1273)

After:

model	size	params	backend	ngl	threads	test	t/s
LLaMA v2 7B mostly F16	12.55 GiB	6.74 B	CUDA	999	1	pp 512	5223.21 ± 11.60
LLaMA v2 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	999	1	pp 512	2027.73 ± 1.49
LLaMA v2 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	999	1	pp 512	2074.45 ± 1.03
LLaMA v2 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	999	1	pp 512	2004.63 ± 0.62
LLaMA v2 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	999	1	pp 512	2032.11 ± 0.54
LLaMA v2 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	999	1	pp 512	1878.42 ± 0.38
LLaMA v2 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	999	1	pp 512	2029.70 ± 1.46
LLaMA v2 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	999	1	pp 512	1741.54 ± 2.36
LLaMA v2 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	999	1	pp 512	1684.32 ± 5.19
LLaMA v2 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	999	1	pp 512	1799.69 ± 3.52
LLaMA v2 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	999	1	pp 512	1749.17 ± 2.88
LLaMA v2 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	999	1	pp 512	1497.79 ± 2.87
LLaMA v2 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	999	1	pp 512	1310.20 ± 0.14
LLaMA v2 7B mostly F16	12.55 GiB	6.74 B	CUDA	999	1	tg 128	69.43 ± 0.01
LLaMA v2 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	999	1	tg 128	91.49 ± 0.02
LLaMA v2 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	999	1	tg 128	137.75 ± 0.02
LLaMA v2 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	999	1	tg 128	136.58 ± 0.03
LLaMA v2 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	999	1	tg 128	112.14 ± 0.01
LLaMA v2 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	999	1	tg 128	112.90 ± 0.01
LLaMA v2 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	999	1	tg 128	92.56 ± 0.03
LLaMA v2 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	999	1	tg 128	102.45 ± 0.01
LLaMA v2 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	999	1	tg 128	104.18 ± 0.01
LLaMA v2 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	999	1	tg 128	111.96 ± 0.02
LLaMA v2 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	999	1	tg 128	116.10 ± 0.01
LLaMA v2 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	999	1	tg 128	103.30 ± 0.04
LLaMA v2 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	999	1	tg 128	94.84 ± 0.00

build: da04003 (1280)

Still testing to verify

ggerganov · 2023-09-28T10:33:18Z

False alarm - forgot to build with LLAMA_CUDA_MMV_Y=2

…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401) train : fix KQ_pos allocation (ggerganov#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206) readme : update hot topics + model links (ggerganov#3399) readme : add link to grammars app (ggerganov#3388) swift : fix build on xcode 15 (ggerganov#3387) build : enable more non-default compiler warnings (ggerganov#3200) ggml_tensor: update the structure comments. (ggerganov#3283) ggml : release the requested thread pool resource (ggerganov#3292) llama.cpp : split llama_context_params into model and context params (ggerganov#3301) ci : multithreaded builds (ggerganov#3311) train : finetune LORA (ggerganov#2632) gguf : basic type checking in gguf_get_* (ggerganov#3346) gguf : make token scores and types optional (ggerganov#3347) ci : disable freeBSD builds due to lack of VMs (ggerganov#3381) llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228) docs : mark code as Bash (ggerganov#3375) readme : add Mistral AI release 0.1 (ggerganov#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)

…nov#3370) * ggml-cuda : perform cublas fp16 matrix multiplication as fp16 * try to fix rocm build * restrict fp16 mat mul to volta and up

whoreson · 2023-10-14T12:38:13Z

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

whoreson · 2023-10-15T19:46:19Z

Let's fix this ok? I can provide SSH access if needed.

cebtenzzre · 2023-10-17T19:25:01Z

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

Old CUDA versions seem to be a low priority, but you could open a new issue to track this and maybe someone will fix it eventually.

ByerRA · 2023-11-01T05:53:47Z

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

I am also seeing this as well as a "CUBLAS_TF32_TENSOR_OP_MATH" is undefined error when trying to compile with CUDA 10 and it would be nice to get it fixed or at least a work around we can try to get something working now.

slaren added 2 commits September 27, 2023 19:59

ggml-cuda : perform cublas fp16 matrix multiplication as fp16

79fe5a1

try to fix rocm build

32ada53

cebtenzzre reviewed Sep 27, 2023

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

restrict fp16 mat mul to volta and up

7d5674d

ggerganov approved these changes Sep 28, 2023

View reviewed changes

ggerganov merged commit da04003 into master Sep 28, 2023
34 of 35 checks passed

slaren deleted the cublas-f16 branch September 28, 2023 10:46

slaren mentioned this pull request Sep 30, 2023

ggml-cuda : perform cublas mat mul of quantized types as f16 #3412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

slaren commented Sep 27, 2023 •

edited

slaren commented Sep 27, 2023

bobqianic commented Sep 28, 2023

ggerganov commented Sep 28, 2023 •

edited

ggerganov commented Sep 28, 2023

whoreson commented Oct 14, 2023

whoreson commented Oct 15, 2023

cebtenzzre commented Oct 17, 2023

ByerRA commented Nov 1, 2023

ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

Conversation

slaren commented Sep 27, 2023 • edited

slaren commented Sep 27, 2023

bobqianic commented Sep 28, 2023

ggerganov commented Sep 28, 2023 • edited

ggerganov commented Sep 28, 2023

whoreson commented Oct 14, 2023

whoreson commented Oct 15, 2023

cebtenzzre commented Oct 17, 2023

ByerRA commented Nov 1, 2023

slaren commented Sep 27, 2023 •

edited

ggerganov commented Sep 28, 2023 •

edited