-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL: Fixes for older devices. #1435
Conversation
I managed to crash my GPU when I tried using Clover. Maybe we should have a platform blocklist? Or could it be made to work… |
Thanks very much for working on this @SlyEcho It is now picking the right device first time. But now I get a new error message: Full log:
|
Sorry, I have some difficulties running OpenCL right now because my home desktop is completely locked up from the broken Mesa Clover driver. I put it to draft right now and will change back when I have run tests on this code. |
Ohh OK sorry I understand now. It's because I had a short prompt - I forgot that CLBLAST is only for prompt evaluation at the moment? I just wrote a really long prompt like you had, and then I saw some GPU usage Thank you again @SlyEcho ! |
Yes, it's currently only for prompt evaluation, and only activates when it reaches a specific length I think. 😄 |
It doesn't do the generating on OpenCL right now, like it does for CUDA. Maybe if I get my setup running again I can take a shot at it. Probably in a different PR. |
Understood. It's cool just seeing it using the GPU at all! But of course it'd be amazing if it could one day do the same as CUBLAS as well. Thanks for all your work on this. |
TestingI got my Steam Deck to run containers again and was able to run the code there. Model files:
Testing command: for q in q4_0 q4_1 q5_0 q5_1 q8_0 f16; do
./bin/perplexity -m ../models/llama-7b-$q.bin --no-mmap -f ../models/wiki.test.mini;
done Results: 7B Q4_0
7B Q4_1
7B Q5_0
7B Q5_1
7B Q8_0
7B F16
|
I think Q8_0 is now fixed. |
I got relatively far with that already. I'll make a PR soon. |
OK, I tested it on an old Mac as well and it "works". It needs a small batch size (32) or CLBlast will not get enough memory. It is also 4 times slower than the CPU. |
Nice! I tested it and it works as intended, mostly. It still selects my CPU by default, even though your code looks like it should pick the GPU. I'll take a closer look later today unless you figure it out first. |
@0cc4m, feel free to take my really complex platform and device init code and experiment. I think I tried to follow the "default" logic where I pass in NULL. I think it is kind of similar like what We could iterate all platforms and devices and select the first GPU, although I would also try to skip the Clover platform, because for me it is even worse than just not working: it causes the GPU to lock up requiring the GPU to be reset in the best case and power cycled in the worst case. |
Since we don't access the value directly, it might as well be a It really does depend on the platform itself so much, since the code has to be compiled at run time. Older drivers have worse compiler support, while newer ones have something like modern LLVM. Would be nice to have SPIR-V or something... |
Platform: AMD Accelerated Parallel Processor It's an old Radeon HD 8570 with 1Gb of VRAM. Mainly was using it just for testing purposes, I found that its performance on prompt evaluation was worse than CPU, so supporting it if that's the only one exhibiting the issue is not a need for me. Just thought I would make y'all aware of it in case it affects other users. This card was a nightmare to get setup, only one particular version of AMD's software works on one particular version of Ubuntu (20.04), and I still needed to fiddle with library paths and names |
Latest version of ROCm is 5.5, did you mean Ubuntu 20.04? |
AMD's documentation is confusing, sorry. Meant (I think) Radeon software / amdgpu: https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-40 |
@0cc4m, Now everything (and I mean everything) is wrapped in I figured out a way to even do the calls that return error by reference and I do that by (ab)using the comma operator. Maybe with C++, there's a better way with lambdas or whatever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the CL_CHECK improvements. Only some oddities with the device selection code left. It's getting a little too complicated for my taste, but I see the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Now rewritten the selection logic, it first scans all devices in all platforms (well, 16 platforms and 16 devices...) Then the user selection is applied. I also match by the platform vendor string. It is now tring to choose GPU devices by default. GGML_OPENCL_PLATFORM=AMD ./main # select AMD and look for GPUs there
GGML_OPENCL_PLATFORM=pocl ./main # choose pocl but not a GPU, so show warning
GGML_OPENCL_PLATFORM=rusticl ./main # abort because there are not devices
GGML_OPENCL_PLATFORM=0 ./main # pocl because it's the first
GGML_OPENCL_DEVICE=Intel ./main # find a device Intel, which pocl has
GGML_OPENCL_DEVICE=gfx900 ./main # use the Vega from the AMD OpenCL platform
GGML_OPENCL_DEVICE=Vega ./main # use the Vega from Mesa Clover 💩
# you can apply both filters too but the device numbers are not per platform any more.
GGML_OPENCL_PLATFORM=pocl GGML_OPENCL_DEVICE=Intel ./main What strings to use? Well, I didn't include the listing of the devices, but you can see it from the Device numbers are now absolute, so you can't select like the second device from the 3rd platform. If you filter by name, it will select the first matching. But it is possible to give the absolute number of the device you specifically want. llama.cpp can only use one device right now anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
ggml-opencl.c
Outdated
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; } | ||
else { CL_CHECK(clGetDeviceIDsError); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; } | |
else { CL_CHECK(clGetDeviceIDsError); } | |
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; | |
} else CL_CHECK(clGetDeviceIDsError); |
ggml-opencl.c
Outdated
cl_device_id device_ids[NDEV]; | ||
cl_int clGetDeviceIDsError = clGetDeviceIDs(p->id, CL_DEVICE_TYPE_ALL, NDEV, device_ids, &p->n_devices); | ||
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; } | ||
else { CL_CHECK(clGetDeviceIDsError); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
else { CL_CHECK(clGetDeviceIDsError); } | |
else { CL_CHECK(clGetDeviceIDsError); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM
So I can confirm it's doing 'something' on my Ubuntu 20.04 system, 32GB DDR4, i9-9500T (3.2ghz @ 6c with mitigations off+pwr lim increased). No specific errors but odd/horrible performance which I kinda expected. Model is VicUnlocked 30B q4_1 and latest llama.cpp as of writing Happily detects my iGPU which I installed the NEO drivers on a while ago: Normal CPU GGML (6 threads) is about 350ms/token with 28GB/s IMC read during compute, 0% GPU usage Using the GPU, the token time went to 1400ms/token and eval time of 720m/s token with 100% CPU usage but 70% GPU 3D usage via Intel_GPU_TOP. Sample and eval time was the same for both CPU only and compiled-enabled OpenCL. Tried a 13B q4_1 model too, and same huge speed reduction, wonder if it's an architectural thing or the iGPU's way too weak |
Remove
constant
in array definitions, we can't use defines because CPP does not process the kernel code normally with the way it is included in the code.The platform and device selection should also be improved, it is now possible to use a string to match platforms:
GGML_OPENCL_PLATFORM
GGML_OPENCL_DEVICE
etc... But I changed the name of the env variable, because it is not really related to CLBlast itself, only OpenCL in general.
Issue: #1429
Two mallocs and frees removed as well :)