-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: run large gguf file in low RAM machine #5207
Comments
This is already possible. llama.cpp uses mmap by default, which is a linux syscall to intelligently swap between system memory and disk. There are obvious performance tradeoffs here, but you can pretty much any model as long as it fits on disk. |
It is possible but will need some big amount of swap space (if you are on a Linux machine). The performance is well, impractical in a slow HDD. For SDD I won't recommend because it will shorten the life of the disk because of constant read/writes. Best is to buy more RAM or push some layers to the GPU if the GPU has enough memory. |
Swap is not necessary. Llama.cpp by default allows you to run models off disk and memory at the same time. There is nothing to configure, you just run ./main and pick whatever model you want. It will be slow, but it will work. Additionally, an SSD will not lose lifespan as the models are only read, not written to. Reading doesn't affect SSD lifespan, only writing does. |
Thanks for your answering. I did some experiments on my 8g mac using llama-2-13b-chat.Q2_K.gguf, it's a 5.43GB file, but I got the following errors, is there anything I did wrong? Here is the error: (base) lianggaoquan@MacBook-Pro llama.cpp-master % ./main -m models/llama-2-13b-chat.Q2_K.gguf -p "The color of sky is " -n 400 -e system_info: n_threads = 4 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | The color of sky is ggml_metal_graph_compute: command buffer 3 failed with status 5 |
found same problem in #2048 |
That's odd, it really should work from what I know. Maybe mmap works differently for devices with unified memory? The errors you're getting are complaining about a lack of memory. You can always try setting the If not, I'm not sure what's wrong in this scenario. |
Swap is actually necessary because I have tested it myself before writing this comment. Otherwise it will trigger the OOM killer. Mmap just load the weights as "cache" in system RAM, atleast on Linux machines. |
I don't have a mac, but I think maybe because the model is offloaded to GPU via metal, so it's actually copied to RAM (the mac uses "unified" memory thing, which is a fancy word to say that CPU and GPU share the same memory). Maybe you can try with |
I wasn't aware, I thought mmap and swap were different things. Good to know. |
Just curious if anyone has looked into Meta Device or AirLLM. I came accross this article saying a 70B model could run on a 4GB GPU. |
This issue is stale because it has been open for 30 days with no activity. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Enable large gguf file (eg. mixtral-8x7b-instruct-v0.1.Q2_K.gguf, the model size is about 15.6 GB) running on low RAM machine, like 8g macbook, using memory swap.
Motivation
Make users who do not have enough memory to run more powerful LLMs.
Possible Implementation
Maybe we can split the gguf file into different layers, and load them separately, to save the memory use. For example, mixtral 8x7b only use two Expert layers for each token. We don't have to load them all in memory.
The text was updated successfully, but these errors were encountered: