-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run with Metal on M1 Mac (normal works fine) #2048
Comments
I may be wrong but status 5 is you running out of memory. Metal implementations cannot use more than half of the unified memory (or however much is reported in the maxworkingset) I think. In your logs:
Try a smaller model. |
Okay thanks 😎. If anybody ends up able to run the 7B model 4.0 quantised with metal enabled on M1 Mac 8GB RAM let me know!! |
i think its not possible for now, due to hardware limited. my M1 Mac Air 8GB only able to load 3B model using GPU Inference / MPS, im happy with it :), much faster than CPU |
Have you tried this branch: #2011 |
I've just given it a go - still same error message. |
I managed to get it working with the smallest model (llama-2-7b-chat.ggmlv3.q2_K.bin) from TheBloke. I tried llama-2-7b-chat.ggmlv3.q4_K_S too but ran out of memory.
then
I was able to chat with Llama 2 using that. The streaming is not smooth, but not too bad. Here is the data:
|
Thanks! The 3 bit quantisation works as well :) |
What would be nice was if reducing --gpulayers fixed the problem, so only the layers that Metal was actually running counted towards the recommendedMaxWorkingSetSize. I have an M2 MacBook Pro with 64GB, recommendedMaxWorkingSetSize = 49152.00 MB, so I can run LLama2 70B 6-bit quantized on CPU (rather slow), but not on GPU — I guess I'll need to redownload one of the 5-bit quantized versions... |
@cheese-melted , reduce n_gpu_layers can reduce the mem pressure on gpu, which can solve your problem. But it will use cpu more and thus slow you down. |
From my experience, Device: MacBook Pro 2017, 13inch, intel iris gpu, 16gb ram (Metal not supported) I tried running the llama-2-7b.Q2_K.gguf (2.5gb) got errors including +ggml_metal_graph_compute: command buffer 0 failed with status 5 My solution was to re-compile using make LLAMA_NO_METAL=1 It now works, even though it is not so fast. |
Prerequisites
Behavior
Able to run
$./main -m ./models/7B/ggml-model-q4_0.bin -n 128
Unable to run
$./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1
Environment and Context
Failure Information (for bugs)
I notice:
during the build.
On running:
$./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1
I get (see logs below for full output):
Steps to Reproduce
Failure Logs
The text was updated successfully, but these errors were encountered: