Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CLBlast to call GPU on Android device, what is the relationship between ngl parameters and model output correctness? #6562

Closed
qtyandhasee opened this issue Apr 9, 2024 · 4 comments

Comments

@qtyandhasee
Copy link

qtyandhasee commented Apr 9, 2024

with the help of issue2169 I use CLBlast on my qualcomm equipment (Adreno740v2) successfully call the GPU calls

But I found an interesting thing when I tried to do model reasoning, when I used the model stories260K.gguf, the model returned normal for questions and answers, but the GPU was hardly called (showing a call rate of 1% or even 0%).
image
For the model llama- 2-7B-Chatt.q4_k_M.GGUf and llama- 2-7B-Chat.q5_k_S.GGUf, I could get the output, but the output result was not correct. At this time, the GPU call rate is about 40%.
image
For the model LLAMA-2-13b-chat.q2_K. gguf and LLAMA-2-7b-chat.q2_K. gguf, I got normal and satisfactory responses when the ngl parameter was set to 2, but when I set the ngl parameter to close to all the GPU parameters that can be unloaded, For example, 40/41, the output of the model is back to random output. Of course, the GPU call rate is displayed at around 50%.
when ngl is 2 or 10 (not very large)
image
when set ngl as 40(40/41),the answer is rediculious
image

the cmd i use to run is as follows
GGML_OPENCL_PLATFORM=0 GGML_OPENCL_DEVICE=0 ./bin/main -t 8 -m /data/local/tmp/llama_cpu/llama-2-7b-chat.Q4_K_M.gguf --color -c 2048 -ngl 2 --temp 0.7 -n -1 -i -ins
i didn't change other parameters but ngl and model

This looks interesting, and I wonder if it's because CLBlast is making some kind of error in the GPU call?
Has anyone else found themselves in my situation? I want to know which direction I should take to eliminate this mistake.

@Jeximo
Copy link
Contributor

Jeximo commented Apr 9, 2024

Unfortunately, OpenCL for Android under-performs, and yes, even the output is incorrect: likely a memory alignment/padding issue

You'll likely see wild results if you run the perplexity tool with CLBlast on Android.

Related: CLBlast is more of a OpenCL library than an actual backend

@qtyandhasee
Copy link
Author

qtyandhasee commented Apr 10, 2024

Unfortunately, OpenCL for Android under-performs, and yes, even the output is incorrect: likely a memory alignment/padding issue

You'll likely see wild results if you run the perplexity tool with CLBlast on Android.

Related: CLBlast is more of a OpenCL library than an actual backend

@Jeximo Thank you very much for your answer. Can we simply understand that the support of llama.cpp for GPU call on SoC is not perfect at present? Or is it because none of SoC's OpenCL driver support currently supports LLM-like reasoning?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 10, 2024

support of llama.cpp for GPU call on SoC is not perfect at present?

Yes, it's imperfect.

Or is it because none of SoC's OpenCL driver support currently supports LLM-like reasoning?

Yes, OpenCL for Android is bugged, and no one is currently developing it. Vulkan is better developed, but it's not optimized for Android as it also produces bad output.

To put it simply, there's a lot of progress to make for LLM and GPU Android.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants