-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU not being utilized on Windows #3806
Comments
EDIT: I have tested all the way back to https://github.com/ggerganov/llama.cpp/releases/tag/b1116 (August 29) and I'm experiencing the same behavior. I'm running main.exe with I am also experiencing the same behavior from the latest build release https://github.com/ggerganov/llama.cpp/releases/tag/b1429 Low, to zero GPU utilization with CUDA 12.2 on an RTX 2070, and zero utilization on a GTX 1070 benchmark here:
|
I may have found a partial answer. First I updated my NVIDIA driver. When I ran a prompt, I immediately noticed that the dedicated GPU memory filled up almost to the max. My RAM was maxed out per usual. I ran a 14GB model, and a 4GB model. It would appear that the 4GB model fits into the GPU memory, but clearly not the 14GB model, which then offloads to RAM. As a result, I got faster outputs, with up to 30% GPU utilization. Diagram with results below: |
So the key is that when |
Yes I've reproduced this behavior on multiple machines. It appears that overflowing the GPU memory causes 100% of the activity to shift to the CPU. |
This is a feature of the NVIDIA drivers under Windows, they allow allocating more memory than available. As far as I know there is no way to disable this. Linux should not be affected. |
@atonalfreerider I think these are separate issues - I get poor performance even with These are all with 017efe8
017efe8
ff5a3f0
There's spiky activity at 017efe8 but is steadily in single digits, whereas ff5a3f0 usage stays around 40% |
It looks like this issue is specifically with LLAMA_NATIVE, I get normal performance again with |
Resolved by #3906. Not sure what's going on with low GPU usage, maybe CPU was just bottlenecking. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
GPU usage goes up with
-ngl
and decent inference performance. Expect to see around 170 ms/tok.Current Behavior
GPU memory usage goes up but activity stays at 0, only CPU usage increases. Getting around 2500 ms/tok.
Environment and Context
Windows 11 - 3070 RTX
Attempting to run
codellama-13b-instruct.Q6_K.gguf
I ran a git bisect which showed 017efe899d8 as the first bad commit. I see about a 10x drop in performance between ff5a3f0 and 017efe8, from 170ms/tok to 2500ms/tok.
Steps to Reproduce
cmake .. -DLLAMA_CUBLAS=ON
.\bin\Release\server.exe -m ..\models\codellama-13b-instruct.Q6_K.gguf -c 4096 -ngl 24
The text was updated successfully, but these errors were encountered: