Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

Merged
merged 3 commits into from
Sep 28, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Sep 27, 2023

Improves prompt processing performance with fp16 models.

3090 Ti/WSL2:

model backend ngl test master t/s PR t/s speedup
llama 7B mostly F16 CUDA 99 pp 512 1661.19 ± 3.09 3984.28 ± 25.45 2.40

ggml-cuda.cu Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Sep 27, 2023

Is this actually correct? I believe compute capability 7.0 is volta, not turing.

#define CC_TURING 700

@bobqianic
Copy link
Contributor

Is this actually correct? I believe compute capability 7.0 is volta, not turing.

#define CC_TURING 700

The computing capacity of Turing is 7.5, while that of Volta is 7.0. However, Volta also supports FP16.

image

@ggerganov ggerganov merged commit da04003 into master Sep 28, 2023
34 of 35 checks passed
@ggerganov
Copy link
Owner

ggerganov commented Sep 28, 2023

Hm, this change might actually degrade the TG performance:

Before:

model size params backend ngl threads test t/s
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 pp 512 3373.39 ± 3.54
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 pp 512 2041.33 ± 0.32
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 pp 512 2084.74 ± 0.08
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 pp 512 2015.38 ± 0.69
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 pp 512 2042.62 ± 0.34
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 pp 512 1887.36 ± 0.40
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 pp 512 2041.73 ± 0.42
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 pp 512 1745.14 ± 3.15
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 pp 512 1674.69 ± 4.07
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 pp 512 1816.83 ± 4.45
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 pp 512 1759.79 ± 3.11
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 pp 512 1503.72 ± 1.09
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 pp 512 1314.71 ± 0.08
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 tg 128 72.73 ± 0.02
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 tg 128 102.59 ± 0.03
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 tg 128 143.87 ± 0.05
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 tg 128 142.90 ± 0.03
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 tg 128 124.20 ± 0.03
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 tg 128 125.25 ± 0.03
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 tg 128 103.77 ± 0.01
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 tg 128 119.19 ± 0.03
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 tg 128 122.88 ± 0.04
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 tg 128 128.46 ± 0.03
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 tg 128 133.71 ± 0.02
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 tg 128 112.58 ± 0.04
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 tg 128 100.40 ± 0.07

build: 99115f3 (1273)

After:

model size params backend ngl threads test t/s
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 pp 512 5223.21 ± 11.60
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 pp 512 2027.73 ± 1.49
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 pp 512 2074.45 ± 1.03
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 pp 512 2004.63 ± 0.62
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 pp 512 2032.11 ± 0.54
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 pp 512 1878.42 ± 0.38
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 pp 512 2029.70 ± 1.46
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 pp 512 1741.54 ± 2.36
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 pp 512 1684.32 ± 5.19
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 pp 512 1799.69 ± 3.52
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 pp 512 1749.17 ± 2.88
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 pp 512 1497.79 ± 2.87
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 pp 512 1310.20 ± 0.14
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 tg 128 69.43 ± 0.01
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 tg 128 91.49 ± 0.02
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 tg 128 137.75 ± 0.02
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 tg 128 136.58 ± 0.03
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 tg 128 112.14 ± 0.01
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 tg 128 112.90 ± 0.01
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 tg 128 92.56 ± 0.03
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 tg 128 102.45 ± 0.01
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 tg 128 104.18 ± 0.01
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 tg 128 111.96 ± 0.02
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 tg 128 116.10 ± 0.01
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 tg 128 103.30 ± 0.04
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 tg 128 94.84 ± 0.00

build: da04003 (1280)

Still testing to verify

@ggerganov
Copy link
Owner

False alarm - forgot to build with LLAMA_CUDA_MMV_Y=2

@slaren slaren deleted the cublas-f16 branch September 28, 2023 10:46
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 2, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412)
  llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401)
  train : fix KQ_pos allocation (ggerganov#3392)
  llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206)
  readme : update hot topics + model links (ggerganov#3399)
  readme : add link to grammars app (ggerganov#3388)
  swift : fix build on xcode 15 (ggerganov#3387)
  build : enable more non-default compiler warnings (ggerganov#3200)
  ggml_tensor: update the structure comments. (ggerganov#3283)
  ggml : release the requested thread pool resource (ggerganov#3292)
  llama.cpp : split llama_context_params into model and context params (ggerganov#3301)
  ci : multithreaded builds (ggerganov#3311)
  train : finetune LORA (ggerganov#2632)
  gguf : basic type checking in gguf_get_* (ggerganov#3346)
  gguf : make token scores and types optional (ggerganov#3347)
  ci : disable freeBSD builds due to lack of VMs (ggerganov#3381)
  llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228)
  docs : mark code as Bash (ggerganov#3375)
  readme : add Mistral AI release 0.1 (ggerganov#3362)
  ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
…nov#3370)

* ggml-cuda : perform cublas fp16 matrix multiplication as fp16

* try to fix rocm build

* restrict fp16 mat mul to volta and up
@whoreson
Copy link
Contributor

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

@whoreson
Copy link
Contributor

Let's fix this ok? I can provide SSH access if needed.

@cebtenzzre
Copy link
Collaborator

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

Old CUDA versions seem to be a low priority, but you could open a new issue to track this and maybe someone will fix it eventually.

@ByerRA
Copy link

ByerRA commented Nov 1, 2023

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

I am also seeing this as well as a "CUBLAS_TF32_TENSOR_OP_MATH" is undefined error when trying to compile with CUDA 10 and it would be nice to get it fixed or at least a work around we can try to get something working now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants