CLBlast support #1164

0cc4m · 2023-04-24T20:17:44Z

Add CLBlast support as an alternative to CuBLAS to speed up context processing.

The advantage of CLBlast over CuBLAS is that it is vendor-agnostic, it runs on basically any GPU (even some phones). It is also a much smaller library than proprietary CuBLAS, while managing to be nearly as fast.

Resolves #1059

…for context processing

…ndant transfers

Add buffer reuse code (adapted from slaren's cuda implementation)

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

Fix compile warnings

ghost · 2023-04-24T23:20:03Z

This patch works fine for me on my Intel HD530 iGPU. CLBlast is slower than CPU with prompt ingestion speeds of ~330ms/token vs ~150ms/token on OpenBLAS.

rabidcopy · 2023-04-25T00:44:15Z

Comparison between latest master with OpenBLAS processing dan.txt versus this PR with CLBlast.
OpenBLAS on Ryzen 2600:
llama_print_timings: prompt eval time = 35540.49 ms / 399 tokens ( 89.07 ms per token)
CLBlast on RX 570:
llama_print_timings: prompt eval time = 20087.81 ms / 399 tokens ( 50.35 ms per token)

LostRuins · 2023-04-25T12:50:26Z

In case anyone is concerned - Occ4m is the main developer for the code relating to the CLBlast kernels and implementation, and we are fine with this code being merged upstream under the MIT license. So there will not be any licensing incompatibilities with KoboldCpp.

Makefile

ggml-opencl.cpp

ggml.c

ggml-opencl.cpp

Makefile

…upported

SlyEcho · 2023-04-25T20:09:42Z

I have some thoughts.

I think the header ggml-opencl.h should not have all that implementation-specific stuff in there. It should be moved to ggml-opencl.cpp, only the two function definitions that ggml.c uses should stay.

something like this: SlyEcho@9ff5ce8

* Move internal stuff out of header * Use internal enums instead of CLBlast enums * Remove leftover C++ includes and defines * Make event use easier to read Co-authored-by: Henri Vasserman <henv@hot.ee>

0cc4m · 2023-04-26T05:59:56Z

Thank you, I added most of those suggestions. You also found some leftover code from previous implementations that I hadn't caught.

ggml-opencl.c

SlyEcho · 2023-04-26T09:00:09Z

ggml-opencl.c

+    clReleaseEvent(ev_a);
+    clReleaseEvent(ev_b);
+    if (dequant) {
+        clReleaseEvent(ev_qb);


I think this could be done right after it's used in the clEnqueueNDRangeKernel(), because clEnqueueNDRangeKernel() will increase the reference count and take ownership over the event.

Maybe, I'm not a CL expert.

I tested it and it works.

SlyEcho · 2023-04-27T07:37:11Z

Yesterday I performed the CLBlast tuning for the Steam Deck, I can check if there is a difference, it takes a few hours to do.

ggml-opencl.c

0cc4m · 2023-04-27T13:28:51Z

I'll have to rebase onto a newer version soon and implement the dequantization functions that have been added in the meantime. Should I do that or leave the PR as-is and add dequant kernels in a future PR?

0cc4m · 2023-04-27T20:14:02Z

Please add CUBLAS and CLBLAS to these lines:

llama.cpp/llama.cpp

Line 2392 in 859fee6

const char * llama_print_system_info(void) {

I think that output will get pretty crowded if we just add everything to it. Considering we are just adding a bunch of BLAS backends, I think it's fine if it just shows that BLAS is enabled, not which specific backend.

0cc4m · 2023-04-28T07:34:52Z

@ggerganov @slaren Anything else that's required here? I think we have reached a good state.

SlyEcho · 2023-04-28T08:49:15Z

What could be done is I was thinking that all the different BLASs could be abstracted away from ggml.c so there would only generic calls like ggml_blas_alloc_mem() ggml_blas_memcpy_host_device() ggml_blas_dequantize() ggml_blas_sgemm() and this would work for OpenBLAS too because the allocation and memory copy would be noops.

That being said, I think it's better if this PR were merged first.

Folko-Ven · 2023-04-28T11:06:51Z

0cc4m Hi, i came across this conversation and I have a question - if I use igpu is there no useless copying of data between ram and dedicated to igrpu ram?

0cc4m · 2023-04-28T12:07:47Z

@Folko-Ven Sadly that is not the case. I tried implementing that to test it, using Intel's recommendations, but found that it slowed Nvidia down, led to OOM errors on Intel and was straight up not implemented for AMD. I am not sure if I did something wrong or if it is simply not well-supported on OpenCL. If you are interested in specifics of what I tried, you can look at the clblast-llama-cpp-igpu branch on my fork.

Folko-Ven · 2023-04-28T13:51:34Z

Too bad. I'm not more worried about the extra performance, but about the extra memory used. Looks like I'll have to look for a laptop with dgpu. And I want to thank you again for this CLBlast implementation.

ggerganov · 2023-04-28T14:41:59Z

ggml.c

@@ -10902,7 +10936,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                        } else if (node->src0->type == GGML_TYPE_F32 && node->src1->type == GGML_TYPE_F32) {
                            cur = 0;
                        } else if (ggml_is_quantized(node->src0->type) && node->src1->type == GGML_TYPE_F32) {
-#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS)
+#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST)


Reuse ggml_cpu_has_blas()

ggerganov

Same comment as for the cuBLAS support:
This addition is great since it speeds-up perplexity computation a lot.
But in the long term, we will be looking in alternative GPU support strategies that are not strongly coupled with ggml (see #914). It's still questionable if such strategy can work, but if it does, we will probably drop these BLAS implementations

ggerganov · 2023-04-28T14:52:37Z

ggml-opencl-dequant.cl

+    result[index + 1] = (vi >> 4) * d + m;
+}
+
+);


I prefer to have this inlined in ggml-opencl.c and avoid this extra file. We can do this later - it's not a problem

The way I see it is that at this point of the development we want to have as few files as possible.
It can seem like a weird constraint and requirement, but I really think that we benefit a lot when we have everything in one place. It is more difficult for a new person to understand the structure of the code, but after they get used to it, it becomes a benefit.

In the future, we will split the library in the proper source files and directory structure, but at the start I think it is a better strategy to have everything packed in one place.

rabidcopy · 2023-04-28T15:21:24Z

Wanted to add, it appears OpenCL performance on AMD is actually better with the opencl-mesa package instead of the opencl-amd package on Arch.
llama_print_timings: prompt eval time = 15324.17 ms / 399 tokens ( 38.41 ms per token) (Roughly 10ms faster than opencl-amd)

0cc4m · 2023-04-28T15:28:49Z

@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?

rabidcopy · 2023-04-28T15:37:03Z

@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?

No idea honestly. Using an RX 570 which is not ancient but not new either.
Using Platform: Clover Device: AMD Radeon RX 570 Series (polaris10, LLVM 15.0.7, DRM 3.49, 6.2.7-zen1-1-zen)

rabidcopy · 2023-04-28T22:40:30Z

Has anyone compared speeds between Clover and rusticd OpenCL? Apparently rusticd OpenCL is getting merged into Mesa soon. Kinda curious if it would be worth going through the trouble to build Mesa from source or just wait.

0cc4m · 2023-04-29T05:53:27Z

@rabidcopy I tried, but Clover doesn't support my RX 6800 XT. I'll try to get rusticl to work and compare it with AMD's pro driver.

0cc4m · 2023-04-29T07:12:45Z

I got it to work, but rusticl was approximately 2x slower than the rocm-opencl-runtime for me.

rabidcopy · 2023-04-29T08:03:17Z

Huh, very strange. For me I can't even use rocm-opencl-runtime as my card is too old.

Eliastrt · 2023-06-06T16:28:24Z

@0cc4m is it in plans to add multi-gpu support like in CUDA refactor? https://github.com/ggerganov/llama.cpp/pull/1607/commits

0cc4m and others added 10 commits April 24, 2023 22:09

Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS …

a908c37

…for context processing

Improve ClBlast implementation, avoid recreating buffers, remove redu…

b7143c1

…ndant transfers

Finish merge of ClBlast support

6f66870

Move CLBlast implementation to separate file

1b16b8c

Add buffer reuse code (adapted from slaren's cuda implementation)

Add q4_2 and q4_3 CLBlast support, improve code

309af7f

Double CLBlast speed by disabling OpenBLAS thread workaround

f469d9a

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

Fix device selection env variable names

8603c25

Fix cast in opencl kernels

18cc05b

Add CLBlast to CMakeLists.txt

ae73887

Replace buffer pool with static buffers a, b, qb, c

daa5df5

Fix compile warnings

0cc4m changed the title ~~Clblast llama cpp~~ CLBlast support Apr 24, 2023

slaren reviewed Apr 25, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

slaren reviewed Apr 25, 2023

View reviewed changes

ggml-opencl.cpp Outdated Show resolved Hide resolved

slaren reviewed Apr 25, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Fix typos, use GGML_TYPE defines, improve code

36bfb3c

slaren reviewed Apr 25, 2023

View reviewed changes

ggml-opencl.cpp Outdated Show resolved Hide resolved

SlyEcho approved these changes Apr 25, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

Improve btype dequant kernel selection code, add error if type is uns…

1370710

…upported

slaren approved these changes Apr 25, 2023

View reviewed changes

Improve code quality

2b0c6a5

* Move internal stuff out of header * Use internal enums instead of CLBlast enums * Remove leftover C++ includes and defines * Make event use easier to read Co-authored-by: Henri Vasserman <henv@hot.ee>

SlyEcho reviewed Apr 26, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

SlyEcho reviewed Apr 26, 2023

View reviewed changes

0cc4m added 3 commits April 26, 2023 18:38

Use c compiler for opencl files

b746458

Simplify code, fix include

ce97a80

First check error, then release event

4a35ec9

SlyEcho reviewed Apr 27, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

SlyEcho reviewed Apr 27, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

0cc4m added 2 commits April 27, 2023 15:25

Make globals static, fix indentation

fafebff

Rename dequant kernels file to conform with other file names

96346fb

Fix import cl file name

bbfba5f

SlyEcho approved these changes Apr 27, 2023

View reviewed changes

slaren requested a review from ggerganov April 28, 2023 13:53

ggerganov mentioned this pull request Apr 28, 2023

cuBLAS: use host pinned memory and dequantize while copying #1207

Merged

ggerganov reviewed Apr 28, 2023

View reviewed changes

ggerganov approved these changes Apr 28, 2023

View reviewed changes

Merge branch 'master' into clblast-llama-cpp

4530d5c

ggerganov merged commit 7296c96 into ggerganov:master Apr 28, 2023

0cc4m deleted the clblast-llama-cpp branch April 28, 2023 15:26

SlyEcho mentioned this pull request May 24, 2023

OpenCl compiling issue #1571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLBlast support #1164

CLBlast support #1164

0cc4m commented Apr 24, 2023

ghost commented Apr 24, 2023

rabidcopy commented Apr 25, 2023 •

edited

Loading

LostRuins commented Apr 25, 2023

SlyEcho commented Apr 25, 2023 •

edited

Loading

0cc4m commented Apr 26, 2023

SlyEcho Apr 26, 2023

SlyEcho Apr 26, 2023

0cc4m Apr 26, 2023

SlyEcho commented Apr 27, 2023

0cc4m commented Apr 27, 2023

0cc4m commented Apr 27, 2023

0cc4m commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

Folko-Ven commented Apr 28, 2023

0cc4m commented Apr 28, 2023 •

edited

Loading

Folko-Ven commented Apr 28, 2023

ggerganov Apr 28, 2023

ggerganov left a comment

ggerganov Apr 28, 2023

rabidcopy commented Apr 28, 2023

0cc4m commented Apr 28, 2023

rabidcopy commented Apr 28, 2023 •

edited

Loading

rabidcopy commented Apr 28, 2023

0cc4m commented Apr 29, 2023

0cc4m commented Apr 29, 2023

rabidcopy commented Apr 29, 2023

Eliastrt commented Jun 6, 2023

CLBlast support #1164

CLBlast support #1164

Conversation

0cc4m commented Apr 24, 2023

ghost commented Apr 24, 2023

rabidcopy commented Apr 25, 2023 • edited Loading

LostRuins commented Apr 25, 2023

SlyEcho commented Apr 25, 2023 • edited Loading

0cc4m commented Apr 26, 2023

SlyEcho Apr 26, 2023

Choose a reason for hiding this comment

SlyEcho Apr 26, 2023

Choose a reason for hiding this comment

0cc4m Apr 26, 2023

Choose a reason for hiding this comment

SlyEcho commented Apr 27, 2023

0cc4m commented Apr 27, 2023

0cc4m commented Apr 27, 2023

0cc4m commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

Folko-Ven commented Apr 28, 2023

0cc4m commented Apr 28, 2023 • edited Loading

Folko-Ven commented Apr 28, 2023

ggerganov Apr 28, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Apr 28, 2023

Choose a reason for hiding this comment

rabidcopy commented Apr 28, 2023

0cc4m commented Apr 28, 2023

rabidcopy commented Apr 28, 2023 • edited Loading

rabidcopy commented Apr 28, 2023

0cc4m commented Apr 29, 2023

0cc4m commented Apr 29, 2023

rabidcopy commented Apr 29, 2023

Eliastrt commented Jun 6, 2023

rabidcopy commented Apr 25, 2023 •

edited

Loading

SlyEcho commented Apr 25, 2023 •

edited

Loading

0cc4m commented Apr 28, 2023 •

edited

Loading

rabidcopy commented Apr 28, 2023 •

edited

Loading