-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLBlast support #1164
CLBlast support #1164
Conversation
…for context processing
Add buffer reuse code (adapted from slaren's cuda implementation)
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Fix compile warnings
This patch works fine for me on my Intel HD530 iGPU. CLBlast is slower than CPU with prompt ingestion speeds of ~330ms/token vs ~150ms/token on OpenBLAS. |
Comparison between latest master with OpenBLAS processing dan.txt versus this PR with CLBlast. |
In case anyone is concerned - Occ4m is the main developer for the code relating to the CLBlast kernels and implementation, and we are fine with this code being merged upstream under the MIT license. So there will not be any licensing incompatibilities with KoboldCpp. |
I have some thoughts. I think the header ggml-opencl.h should not have all that implementation-specific stuff in there. It should be moved to ggml-opencl.cpp, only the two function definitions that ggml.c uses should stay. something like this: SlyEcho@9ff5ce8 |
* Move internal stuff out of header * Use internal enums instead of CLBlast enums * Remove leftover C++ includes and defines * Make event use easier to read Co-authored-by: Henri Vasserman <henv@hot.ee>
Thank you, I added most of those suggestions. You also found some leftover code from previous implementations that I hadn't caught. |
ggml-opencl.c
Outdated
clReleaseEvent(ev_a); | ||
clReleaseEvent(ev_b); | ||
if (dequant) { | ||
clReleaseEvent(ev_qb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be done right after it's used in the clEnqueueNDRangeKernel()
, because clEnqueueNDRangeKernel()
will increase the reference count and take ownership over the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I'm not a CL expert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and it works.
Yesterday I performed the CLBlast tuning for the Steam Deck, I can check if there is a difference, it takes a few hours to do. |
I'll have to rebase onto a newer version soon and implement the dequantization functions that have been added in the meantime. Should I do that or leave the PR as-is and add dequant kernels in a future PR? |
I think that output will get pretty crowded if we just add everything to it. Considering we are just adding a bunch of BLAS backends, I think it's fine if it just shows that BLAS is enabled, not which specific backend. |
@ggerganov @slaren Anything else that's required here? I think we have reached a good state. |
What could be done is I was thinking that all the different BLASs could be abstracted away from ggml.c so there would only generic calls like That being said, I think it's better if this PR were merged first. |
@Folko-Ven Sadly that is not the case. I tried implementing that to test it, using Intel's recommendations, but found that it slowed Nvidia down, led to OOM errors on Intel and was straight up not implemented for AMD. I am not sure if I did something wrong or if it is simply not well-supported on OpenCL. If you are interested in specifics of what I tried, you can look at the |
Too bad. I'm not more worried about the extra performance, but about the extra memory used. Looks like I'll have to look for a laptop with dgpu. And I want to thank you again for this CLBlast implementation. |
@@ -10902,7 +10936,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) | |||
} else if (node->src0->type == GGML_TYPE_F32 && node->src1->type == GGML_TYPE_F32) { | |||
cur = 0; | |||
} else if (ggml_is_quantized(node->src0->type) && node->src1->type == GGML_TYPE_F32) { | |||
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) | |||
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse ggml_cpu_has_blas()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for the cuBLAS support:
This addition is great since it speeds-up perplexity computation a lot.
But in the long term, we will be looking in alternative GPU support strategies that are not strongly coupled with ggml
(see #914). It's still questionable if such strategy can work, but if it does, we will probably drop these BLAS implementations
result[index + 1] = (vi >> 4) * d + m; | ||
} | ||
|
||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to have this inlined in ggml-opencl.c
and avoid this extra file. We can do this later - it's not a problem
The way I see it is that at this point of the development we want to have as few files as possible.
It can seem like a weird constraint and requirement, but I really think that we benefit a lot when we have everything in one place. It is more difficult for a new person to understand the structure of the code, but after they get used to it, it becomes a benefit.
In the future, we will split the library in the proper source files and directory structure, but at the start I think it is a better strategy to have everything packed in one place.
Wanted to add, it appears OpenCL performance on AMD is actually better with the opencl-mesa package instead of the opencl-amd package on Arch. |
@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already? |
No idea honestly. Using an RX 570 which is not ancient but not new either. |
Has anyone compared speeds between Clover and rusticd OpenCL? Apparently rusticd OpenCL is getting merged into Mesa soon. Kinda curious if it would be worth going through the trouble to build Mesa from source or just wait. |
@rabidcopy I tried, but Clover doesn't support my RX 6800 XT. I'll try to get rusticl to work and compare it with AMD's pro driver. |
I got it to work, but rusticl was approximately 2x slower than the rocm-opencl-runtime for me. |
Huh, very strange. For me I can't even use rocm-opencl-runtime as my card is too old. |
@0cc4m is it in plans to add multi-gpu support like in CUDA refactor? https://github.com/ggerganov/llama.cpp/pull/1607/commits |
Add CLBlast support as an alternative to CuBLAS to speed up context processing.
The advantage of CLBlast over CuBLAS is that it is vendor-agnostic, it runs on basically any GPU (even some phones). It is also a much smaller library than proprietary CuBLAS, while managing to be nearly as fast.
Resolves #1059