Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBLAS: use host pinned memory and dequantize while copying #1207

Merged
merged 5 commits into from
Apr 29, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 27, 2023

Copying memory to the GPU from pageable memory is slow because it forces CUDA to copy the buffer to non-pageable memory before it can DMA it to the GPU. This also means that cudaMemcpyAsync is actually synchronous.

By storing the ggml context in non-pageable, pinned memory, this additional copy is avoided, and cudaMemcpyAsync is done asynchronously. This also makes it possible to dequantize while copying data for the other matrix.

To observe most of the benefits, this has to be used with --no-mmap, otherwise the weights will be stored in paged, memory mapped memory. With mmap enabled, there is still some benefit from the non-weight matrices. In the future, this will be solved by caching the weights in the GPU memory, avoiding the copy entirely.

To avoid adding a CUDA-only function to the ggml interface, llama.cpp has been modified to include ggml-cuda.h when cuBLAS is enabled.

For me, this represents a ~30% speedup in perplexity times with cuBLAS.

PR:
image

Master:
image

@dfyz
Copy link
Collaborator

dfyz commented Apr 28, 2023

I think these changes looks great. You said elsewhere that this stuff might cause "some friction", but I think it turns out to be very non-intrusive. The CUDA stuff is still relatively self-contained and is separated from the ggml core.

Of course, @ggerganov might have a different opinion, but think this should be merged as is.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 28, 2023

Unrelated stuff

On AMD I'm noticing something funny, it creates 64 additional GPU threads. If I use --memory_f32 then not.

image

Otherwise it works too, I will add the additional definitions to my port so it can be merged.

EDIT: 5.07 seconds per pass - ETA 55 minutes let's see in an hour or so.

@slaren
Copy link
Collaborator Author

slaren commented Apr 28, 2023

@SlyEcho are you sure that this is with this branch and not cuda-f16f32? That one does create 64 additional streams.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 28, 2023

@slaren, you are quite right, this is slaren/cuda-f16f32

But it does have the same changes included?

Anyway, perplexity on Q4_0 was [655]6.2838

@slaren
Copy link
Collaborator Author

slaren commented Apr 28, 2023

But it does have the same changes included?

Yes, that branch is built on top of this one, with additional changes to the f16 x f32 mat mul.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably have to merge #1164 first since @0cc4m has been waiting for a while.
Will take a look now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants