-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] GPU Memory problem on Apple M2 Max 64GB #1870
Comments
Is the model suppose to work with llama.cpp? |
This is a known Metal issue that they are unable to allocate enough memory for the GPU. The current limit is approximately half of the physical memory available. |
Oh no! that really sucks. I've just invested almost 5K in a 64GB MAX model because everyone pointed me in that direction for local LLMs. Now I can't use the extra unified memory for GPU? Is there anything we can do to fix this? please tell me there is a solution. |
@CyborgArmy83 If you are eager to experiment with 65B using your GPU, 65B Q3_K_S is the recommended choice. However, it is computationally demanding and generally not optimal due to excessive compression. Nevertheless, it still outperforms any 33B models. The 65B Q4_K_S models are one of the easiest models to compute and have significantly better quality than Q3_K_S. However, due to its larger size, it can only be run on a CPU, as it exceeds half of your RAM's capacity. You'd better compare the speed and quality of the two options and choose whichever you prefer. |
Thanks! I did find the Metal performance difference on 7B, 13B and some 30B models I could run night and day! So I am not sure why you say it's not that much of a difference? Could be twice as fast right? And how can you even know the true speed difference at this moment since it's not working for anyone? Not trying to be disrespectful here, just trying to understand your reasoning. I will look into 65B Q4_K_S as you've suggested. Q3 seems to low quality for my needs. I really hope we can fix this as a community of smart developers! |
@CyborgArmy83 Many people who own the M2 Max 96GB and M1/M2 Ultra models have reported speeds of 65B when using the GPU. It is also relatively easy to estimate the 65B speed based on the performance of smaller models. |
@CyborgArmy83 |
I followed the issue closely, but thanks for bringing it to my attention! I will check it out, but reading the conclusion here #1826 (comment) - it seems like there are still some big obstacles to be overcome. That's why I am also advocating for mentioning the limitations/remaining issues either in a new issue or discussion topic so others can join in on any new ideas. |
@CyborgArmy83 |
Do you mean that the new software changes from today should be better for 64GB machines? I am left wondering if there is a way around these limitations. Maybe we can contact the guy from the video, he apparently works for Apple GPU engineering team. Might not be possible to contact him directly but I am willing to give it a try. |
@CyborgArmy83 |
The new changes have resulted in some models now loading, but the limit is still rather annoying. Around 47GB or higher I get problems loading models on my 64GB M2 Max. I also had a M1 Max 64GB to test and the behavior is exactly the same. It seems like there is no current discussion on the matter so I have reached out to Apple engineering team. I hope some other people will also investigate this problem further! There might be a way to use more of our unified memory, this would be very welcoming! 65B models on CPU on Apple Silicon is a lot slower when you have a MAX chip with double the amount of GPU power. I've also read the current Metal implementation for local LLM inferencing is capped at 250GB/s speed, while the memory can go up to 400 GB/s for the MAX and 800 GB/s for the ultra. A recent study showed that memory bandwith might be more important than raw processing power. Hopefully it will get faster on current Apple silicon in the (near) future. Anyone else care to share their experience? |
Just to add to the list of the broken ones, the 70B Stable Beluga (Q6_K) which states that around 49GB is the recommended allocation size but model was higher (~54GB): llama_model_load_internal: mem required = 54749.41 MB
...
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 Also breaks with the ggml_metal_graph_compute: command buffer 6 failed with status 5 running in M1 Max 64GB RAM. |
Btw, using @mbosc fork, I could run over M1 Max GPU the 70B with The speed can be noticed, over 5x. CPU Stats
GPU Stats
|
What would be nice was if reducing --gpulayers fixed the problem, so that only the layers that Metal was actually running counted towards the recommendedMaxWorkingSetSize. I have an M2 MacBook Pro with 64GB, recommendedMaxWorkingSetSize = 49152.00 MB, so I can run LLama2 70B 6-bit quantized on CPU (rather slow), but not on GPU — I guess I'll need to redownload one of the 5-bit quantized versions, rather than just shifting a few layers to CPU threads. Also, it would be nice if hitting this situation gave a more comprehensible error message than:
without the end user having to first fritz with the code to get GGML Metal debug logging turned back on and then decypher the meaning of the rather opaque:
perhaps something like "Current allocated size is greater than the recommended max working set size: Metal ran out of memory for running your model on GPU (try a smaller context length or smaller model)" |
I wonder, is this an OS set limit? Does Sonoma (macOS 14) change anything? |
It is enforced by the OS. There have been somewhat successful attempts to change the memory allocation with a custom kernel/module but they come at a price; significantly reduced security. Sonoma hasn’t improved or degraded the current situation. See: #2182 |
So we have to ask Apple to change that. I think it is just a parameter some developer did set for GPU using apps not to eat up all main memory. In a 96 GB Mac that will mean 24 GB are not available for data which will be processed on the Metal cores. What might be the best way for influencing Apple? |
Does this imply that if a model exceeds the GPU's memory capacity, it cannot offload the excess to the CPU-managed memory for execution? For instance, for a 70B 6-bit Llama2 requiring about 52.5GB RAM, is it possible that 48GB is available for GPU processing with Metal acceleration, leaving the remaining 4.5GB for CPU execution? |
^ I have that model and 6 OOM's. I can only run the 5-bit at default context size, and increasing either causes the error 5. Q4_K_M is the go to if you want to expand the context a bit I've found. |
So just to be clear, on an M1 MAX with 64GB of ram, even though the working set size is larger than the mem required, we should still expect it to fail?
edit: never mind -- apparently llama_cpp.server doesn't handle this as well as using ./main for some reason. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I cannot seem to load larger models than 30 billion parameters into my system, it gives some kind of memory error? I should have plenty of ram. I have tried using the new K quants as well as the older ones. They only seem to load on CPU (which is extremely slow and almost unbearable to use for inference)
Current Behavior
When loading models I can see with another utility that it will not load past 27~29 Gigs into my system?
Environment and Context
Apple M2 Max 64GB latest 13.4
if this issue is related to something that is already known, I apologize. There seem to be so many different issues that I am not entirely sure anymore which is related to which. I am a just a webdev so C++ is quite new to me.
The text was updated successfully, but these errors were encountered: