Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLBlast support #1164

Merged
merged 20 commits into from
Apr 28, 2023
Merged

CLBlast support #1164

merged 20 commits into from
Apr 28, 2023

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Apr 24, 2023

Add CLBlast support as an alternative to CuBLAS to speed up context processing.

The advantage of CLBlast over CuBLAS is that it is vendor-agnostic, it runs on basically any GPU (even some phones). It is also a much smaller library than proprietary CuBLAS, while managing to be nearly as fast.

Resolves #1059

@0cc4m 0cc4m changed the title Clblast llama cpp CLBlast support Apr 24, 2023
@ghost
Copy link

ghost commented Apr 24, 2023

This patch works fine for me on my Intel HD530 iGPU. CLBlast is slower than CPU with prompt ingestion speeds of ~330ms/token vs ~150ms/token on OpenBLAS.

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 25, 2023

Comparison between latest master with OpenBLAS processing dan.txt versus this PR with CLBlast.
OpenBLAS on Ryzen 2600:
llama_print_timings: prompt eval time = 35540.49 ms / 399 tokens ( 89.07 ms per token)
CLBlast on RX 570:
llama_print_timings: prompt eval time = 20087.81 ms / 399 tokens ( 50.35 ms per token)

@LostRuins
Copy link
Collaborator

In case anyone is concerned - Occ4m is the main developer for the code relating to the CLBlast kernels and implementation, and we are fine with this code being merged upstream under the MIT license. So there will not be any licensing incompatibilities with KoboldCpp.

Makefile Outdated Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 25, 2023

I have some thoughts.

I think the header ggml-opencl.h should not have all that implementation-specific stuff in there. It should be moved to ggml-opencl.cpp, only the two function definitions that ggml.c uses should stay.

something like this: SlyEcho@9ff5ce8

* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read

Co-authored-by: Henri Vasserman <henv@hot.ee>
@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 26, 2023

Thank you, I added most of those suggestions. You also found some leftover code from previous implementations that I hadn't caught.

ggml-opencl.c Outdated Show resolved Hide resolved
ggml-opencl.c Outdated
clReleaseEvent(ev_a);
clReleaseEvent(ev_b);
if (dequant) {
clReleaseEvent(ev_qb);
Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be done right after it's used in the clEnqueueNDRangeKernel(), because clEnqueueNDRangeKernel() will increase the reference count and take ownership over the event.

Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I'm not a CL expert.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it and it works.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 27, 2023

Yesterday I performed the CLBlast tuning for the Steam Deck, I can check if there is a difference, it takes a few hours to do.

ggml-opencl.c Outdated Show resolved Hide resolved
ggml-opencl.c Outdated Show resolved Hide resolved
@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 27, 2023

I'll have to rebase onto a newer version soon and implement the dequantization functions that have been added in the meantime. Should I do that or leave the PR as-is and add dequant kernels in a future PR?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 27, 2023

Please add CUBLAS and CLBLAS to these lines:

const char * llama_print_system_info(void) {

I think that output will get pretty crowded if we just add everything to it. Considering we are just adding a bunch of BLAS backends, I think it's fine if it just shows that BLAS is enabled, not which specific backend.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 28, 2023

@ggerganov @slaren Anything else that's required here? I think we have reached a good state.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 28, 2023

What could be done is I was thinking that all the different BLASs could be abstracted away from ggml.c so there would only generic calls like ggml_blas_alloc_mem() ggml_blas_memcpy_host_device() ggml_blas_dequantize() ggml_blas_sgemm() and this would work for OpenBLAS too because the allocation and memory copy would be noops.

That being said, I think it's better if this PR were merged first.

@Folko-Ven
Copy link
Contributor

0cc4m Hi, i came across this conversation and I have a question - if I use igpu is there no useless copying of data between ram and dedicated to igrpu ram?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 28, 2023

@Folko-Ven Sadly that is not the case. I tried implementing that to test it, using Intel's recommendations, but found that it slowed Nvidia down, led to OOM errors on Intel and was straight up not implemented for AMD. I am not sure if I did something wrong or if it is simply not well-supported on OpenCL. If you are interested in specifics of what I tried, you can look at the clblast-llama-cpp-igpu branch on my fork.

@Folko-Ven
Copy link
Contributor

Too bad. I'm not more worried about the extra performance, but about the extra memory used. Looks like I'll have to look for a laptop with dgpu. And I want to thank you again for this CLBlast implementation.

@@ -10902,7 +10936,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
} else if (node->src0->type == GGML_TYPE_F32 && node->src1->type == GGML_TYPE_F32) {
cur = 0;
} else if (ggml_is_quantized(node->src0->type) && node->src1->type == GGML_TYPE_F32) {
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS)
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse ggml_cpu_has_blas()

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the cuBLAS support:
This addition is great since it speeds-up perplexity computation a lot.
But in the long term, we will be looking in alternative GPU support strategies that are not strongly coupled with ggml (see #914). It's still questionable if such strategy can work, but if it does, we will probably drop these BLAS implementations

result[index + 1] = (vi >> 4) * d + m;
}

);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have this inlined in ggml-opencl.c and avoid this extra file. We can do this later - it's not a problem

The way I see it is that at this point of the development we want to have as few files as possible.
It can seem like a weird constraint and requirement, but I really think that we benefit a lot when we have everything in one place. It is more difficult for a new person to understand the structure of the code, but after they get used to it, it becomes a benefit.

In the future, we will split the library in the proper source files and directory structure, but at the start I think it is a better strategy to have everything packed in one place.

@ggerganov ggerganov merged commit 7296c96 into ggerganov:master Apr 28, 2023
@rabidcopy
Copy link
Contributor

Wanted to add, it appears OpenCL performance on AMD is actually better with the opencl-mesa package instead of the opencl-amd package on Arch.
llama_print_timings: prompt eval time = 15324.17 ms / 399 tokens ( 38.41 ms per token) (Roughly 10ms faster than opencl-amd)

@0cc4m 0cc4m deleted the clblast-llama-cpp branch April 28, 2023 15:26
@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 28, 2023

@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 28, 2023

@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?

No idea honestly. Using an RX 570 which is not ancient but not new either.
Using Platform: Clover Device: AMD Radeon RX 570 Series (polaris10, LLVM 15.0.7, DRM 3.49, 6.2.7-zen1-1-zen)

@rabidcopy
Copy link
Contributor

Has anyone compared speeds between Clover and rusticd OpenCL? Apparently rusticd OpenCL is getting merged into Mesa soon. Kinda curious if it would be worth going through the trouble to build Mesa from source or just wait.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 29, 2023

@rabidcopy I tried, but Clover doesn't support my RX 6800 XT. I'll try to get rusticl to work and compare it with AMD's pro driver.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 29, 2023

I got it to work, but rusticl was approximately 2x slower than the rocm-opencl-runtime for me.

@rabidcopy
Copy link
Contributor

Huh, very strange. For me I can't even use rocm-opencl-runtime as my card is too old.

@SlyEcho SlyEcho mentioned this pull request May 24, 2023
@Eliastrt
Copy link

Eliastrt commented Jun 6, 2023

@0cc4m is it in plans to add multi-gpu support like in CUDA refactor? https://github.com/ggerganov/llama.cpp/pull/1607/commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any chance of adding Clblast support?
9 participants