cuBLAS: use host pinned memory and dequantize while copying #1207

slaren · 2023-04-27T20:10:32Z

Copying memory to the GPU from pageable memory is slow because it forces CUDA to copy the buffer to non-pageable memory before it can DMA it to the GPU. This also means that cudaMemcpyAsync is actually synchronous.

By storing the ggml context in non-pageable, pinned memory, this additional copy is avoided, and cudaMemcpyAsync is done asynchronously. This also makes it possible to dequantize while copying data for the other matrix.

To observe most of the benefits, this has to be used with --no-mmap, otherwise the weights will be stored in paged, memory mapped memory. With mmap enabled, there is still some benefit from the non-weight matrices. In the future, this will be solved by caching the weights in the GPU memory, avoiding the copy entirely.

To avoid adding a CUDA-only function to the ggml interface, llama.cpp has been modified to include ggml-cuda.h when cuBLAS is enabled.

For me, this represents a ~30% speedup in perplexity times with cuBLAS.

PR:

Master:

dfyz · 2023-04-28T00:41:58Z

I think these changes looks great. You said elsewhere that this stuff might cause "some friction", but I think it turns out to be very non-intrusive. The CUDA stuff is still relatively self-contained and is separated from the ggml core.

Of course, @ggerganov might have a different opinion, but think this should be merged as is.

SlyEcho · 2023-04-28T07:05:06Z

Unrelated stuff

On AMD I'm noticing something funny, it creates 64 additional GPU threads. If I use --memory_f32 then not.

Otherwise it works too, I will add the additional definitions to my port so it can be merged.

EDIT: 5.07 seconds per pass - ETA 55 minutes let's see in an hour or so.

slaren · 2023-04-28T08:17:02Z

@SlyEcho are you sure that this is with this branch and not cuda-f16f32? That one does create 64 additional streams.

SlyEcho · 2023-04-28T08:51:04Z

@slaren, you are quite right, this is slaren/cuda-f16f32

But it does have the same changes included?

Anyway, perplexity on Q4_0 was [655]6.2838

slaren · 2023-04-28T08:55:35Z

But it does have the same changes included?

Yes, that branch is built on top of this one, with additional changes to the f16 x f32 mat mul.

ggerganov

Probably have to merge #1164 first since @0cc4m has been waiting for a while.
Will take a look now

dfyz mentioned this pull request Apr 27, 2023

Improve cuBLAS performance by using a memory pool #1094

Merged

slaren force-pushed the quant-stream branch from e3bc7c7 to bf3745d Compare April 28, 2023 13:43

ggerganov approved these changes Apr 28, 2023

View reviewed changes

slaren added 5 commits April 29, 2023 01:35

cuBLAS: dequantize simultaneously while copying memory

d3fd04e

cuBLAS: use host pinned memory

2dd6dee

cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory

d5d6a80

cuBLAS: also pin kv cache

3cf2247

fix rebase

38a021f

slaren force-pushed the quant-stream branch from bf3745d to 38a021f Compare April 28, 2023 23:56

slaren merged commit 7fc50c0 into ggerganov:master Apr 29, 2023

slaren deleted the quant-stream branch April 29, 2023 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS: use host pinned memory and dequantize while copying #1207

cuBLAS: use host pinned memory and dequantize while copying #1207

slaren commented Apr 27, 2023 •

edited

Loading

dfyz commented Apr 28, 2023

SlyEcho commented Apr 28, 2023 •

edited

Loading

slaren commented Apr 28, 2023

SlyEcho commented Apr 28, 2023 •

edited

Loading

slaren commented Apr 28, 2023

ggerganov left a comment

cuBLAS: use host pinned memory and dequantize while copying #1207

cuBLAS: use host pinned memory and dequantize while copying #1207

Conversation

slaren commented Apr 27, 2023 • edited Loading

dfyz commented Apr 28, 2023

SlyEcho commented Apr 28, 2023 • edited Loading

slaren commented Apr 28, 2023

SlyEcho commented Apr 28, 2023 • edited Loading

slaren commented Apr 28, 2023

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Apr 27, 2023 •

edited

Loading

SlyEcho commented Apr 28, 2023 •

edited

Loading

SlyEcho commented Apr 28, 2023 •

edited

Loading