Feature Request: run large gguf file in low RAM machine #5207

liangDarwin2 · 2024-01-30T08:19:23Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Enable large gguf file (eg. mixtral-8x7b-instruct-v0.1.Q2_K.gguf, the model size is about 15.6 GB) running on low RAM machine, like 8g macbook, using memory swap.

Motivation

Make users who do not have enough memory to run more powerful LLMs.

Possible Implementation

Maybe we can split the gguf file into different layers, and load them separately, to save the memory use. For example, mixtral 8x7b only use two Expert layers for each token. We don't have to load them all in memory.

Azeirah · 2024-01-30T08:36:40Z

This is already possible. llama.cpp uses mmap by default, which is a linux syscall to intelligently swap between system memory and disk.

There are obvious performance tradeoffs here, but you can pretty much any model as long as it fits on disk.

qnixsynapse · 2024-02-02T05:07:16Z

It is possible but will need some big amount of swap space (if you are on a Linux machine). The performance is well, impractical in a slow HDD. For SDD I won't recommend because it will shorten the life of the disk because of constant read/writes.

Best is to buy more RAM or push some layers to the GPU if the GPU has enough memory.

Azeirah · 2024-02-02T08:31:44Z

It is possible but will need some big amount of swap space (if you are on a Linux machine). The performance is well, impractical in a slow HDD. For SDD I won't recommend because it will shorten the life of the disk because of constant read/writes.

Best is to buy more RAM or push some layers to the GPU if the GPU has enough memory.

Swap is not necessary. Llama.cpp by default allows you to run models off disk and memory at the same time. There is nothing to configure, you just run ./main and pick whatever model you want. It will be slow, but it will work.

Additionally, an SSD will not lose lifespan as the models are only read, not written to. Reading doesn't affect SSD lifespan, only writing does.

liangDarwin2 · 2024-02-02T12:26:57Z

It is possible but will need some big amount of swap space (if you are on a Linux machine). The performance is well, impractical in a slow HDD. For SDD I won't recommend because it will shorten the life of the disk because of constant read/writes.
Best is to buy more RAM or push some layers to the GPU if the GPU has enough memory.

Swap is not necessary. Llama.cpp by default allows you to run models off disk and memory at the same time. There is nothing to configure, you just run ./main and pick whatever model you want. It will be slow, but it will work.

Additionally, an SSD will not lose lifespan as the models are only read, not written to. Reading doesn't affect SSD lifespan, only writing does.

Thanks for your answering. I did some experiments on my 8g mac using llama-2-13b-chat.Q2_K.gguf, it's a 5.43GB file, but I got the following errors, is there anything I did wrong? Here is the error:

(base) lianggaoquan@MacBook-Pro llama.cpp-master % ./main -m models/llama-2-13b-chat.Q2_K.gguf -p "The color of sky is " -n 400 -e
Log start
main: build = 0 (unknown)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.0.0
main: seed = 1706876664
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from models/llama-2-13b-chat.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 10
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q2_K: 81 tensors
llama_model_loader: - type q3_K: 200 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 5.06 GiB (3.34 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 4096.00 MiB, offs = 0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 1158.08 MiB, offs = 4160536576, ( 5254.14 / 5461.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Metal buffer size = 5125.86 MiB
llm_load_tensors: CPU buffer size = 51.27 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/lianggaoquan/Downloads/temp/llama.cpp-master/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 400.00 MiB, ( 5655.70 / 5461.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_kv_cache_init: Metal KV buffer size = 400.00 MiB
llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB
llama_new_context_with_model: CPU input buffer size = 11.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 5655.72 / 5461.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 82.52 MiB, ( 5738.22 / 5461.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: Metal compute buffer size = 82.50 MiB
llama_new_context_with_model: CPU compute buffer size = 11.00 MiB
llama_new_context_with_model: graph splits (measure): 3
ggml_metal_graph_compute: command buffer 3 failed with status 5

The color of sky is ggml_metal_graph_compute: command buffer 3 failed with status 5
vojggml_metal_graph_compute: command buffer 3 failed with status 5
gamesggml_metal_graph_compute: command buffer 3 failed with status 5
Üggml_metal_graph_compute: command buffer 3 failed with status 5
ashingtonggml_metal_graph_compute: command buffer 3 failed with status 5
handlerggml_metal_graph_compute: command buffer 3 failed with status 5
vareggml_metal_graph_compute: command buffer 3 failed with status 5
rectggml_metal_graph_compute: command buffer 3 failed with status 5
mannerggml_metal_graph_compute: command buffer 3 failed with status 5
ufactggml_metal_graph_compute: command buffer 3 failed with status 5
hersggml_metal_graph_compute: command buffer 3 failed with status 5
handlerggml_metal_graph_compute: command buffer 3 failed with status 5
Surggml_metal_graph_compute: command buffer 3 failed with status 5
Helggml_metal_graph_compute: command buffer 3 failed with status 5
annéesggml_metal_graph_compute: command buffer 3 failed with status 5
...

liangDarwin2 · 2024-02-02T13:11:24Z

found same problem in #2048

Azeirah · 2024-02-02T18:32:23Z

That's odd, it really should work from what I know. Maybe mmap works differently for devices with unified memory?

The errors you're getting are complaining about a lack of memory.

You can always try setting the --mlock parameter, it forces llama to load the entire model into RAM. Any overflow will cause your OS to swap, I assume it's a lot slower than the mmap approach but it should work.

If not, I'm not sure what's wrong in this scenario.

qnixsynapse · 2024-02-03T03:44:24Z

It is possible but will need some big amount of swap space (if you are on a Linux machine). The performance is well, impractical in a slow HDD. For SDD I won't recommend because it will shorten the life of the disk because of constant read/writes.
Best is to buy more RAM or push some layers to the GPU if the GPU has enough memory.

Swap is not necessary. Llama.cpp by default allows you to run models off disk and memory at the same time. There is nothing to configure, you just run ./main and pick whatever model you want. It will be slow, but it will work.

Additionally, an SSD will not lose lifespan as the models are only read, not written to. Reading doesn't affect SSD lifespan, only writing does.

Swap is actually necessary because I have tested it myself before writing this comment. Otherwise it will trigger the OOM killer. Mmap just load the weights as "cache" in system RAM, atleast on Linux machines.

ngxson · 2024-02-04T10:36:05Z

I don't have a mac, but I think maybe because the model is offloaded to GPU via metal, so it's actually copied to RAM (the mac uses "unified" memory thing, which is a fancy word to say that CPU and GPU share the same memory).

Maybe you can try with -ngl 10 to offload only 10 layers to GPU?

Azeirah · 2024-02-04T12:46:01Z

It is possible but will need some big amount of swap space (if you are on a Linux machine). The performance is well, impractical in a slow HDD. For SDD I won't recommend because it will shorten the life of the disk because of constant read/writes.
Best is to buy more RAM or push some layers to the GPU if the GPU has enough memory.

Swap is not necessary. Llama.cpp by default allows you to run models off disk and memory at the same time. There is nothing to configure, you just run ./main and pick whatever model you want. It will be slow, but it will work.
Additionally, an SSD will not lose lifespan as the models are only read, not written to. Reading doesn't affect SSD lifespan, only writing does.

Swap is actually necessary because I have tested it myself before writing this comment. Otherwise it will trigger the OOM killer. Mmap just load the weights as "cache" in system RAM, atleast on Linux machines.

I wasn't aware, I thought mmap and swap were different things. Good to know.

inventionstore · 2024-02-09T18:15:45Z

Just curious if anyone has looked into Meta Device or AirLLM. I came accross this article saying a 70B model could run on a 4GB GPU.
https://huggingface.co/blog/lyogavin/airllm#:~:text=The%2070B%20large%20language%20model,for%20complex%20%E2%80%9Cattention%E2%80%9D%20calculations.

github-actions · 2024-03-18T01:32:40Z

This issue is stale because it has been open for 30 days with no activity.

liangDarwin2 added the enhancement New feature or request label Jan 30, 2024

github-actions bot added the stale label Mar 18, 2024

liangDarwin2 closed this as completed Mar 20, 2024

JohnSmithToYou mentioned this issue Jun 27, 2024

gguf : add special tokens metadata for FIM/Infill #6689

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: run large gguf file in low RAM machine #5207

Feature Request: run large gguf file in low RAM machine #5207

liangDarwin2 commented Jan 30, 2024 •

edited

Loading

Azeirah commented Jan 30, 2024

qnixsynapse commented Feb 2, 2024

Azeirah commented Feb 2, 2024

liangDarwin2 commented Feb 2, 2024 •

edited

Loading

liangDarwin2 commented Feb 2, 2024

Azeirah commented Feb 2, 2024

qnixsynapse commented Feb 3, 2024

ngxson commented Feb 4, 2024

Azeirah commented Feb 4, 2024

inventionstore commented Feb 9, 2024

github-actions bot commented Mar 18, 2024

Feature Request: run large gguf file in low RAM machine #5207

Feature Request: run large gguf file in low RAM machine #5207

Comments

liangDarwin2 commented Jan 30, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

Azeirah commented Jan 30, 2024

qnixsynapse commented Feb 2, 2024

Azeirah commented Feb 2, 2024

liangDarwin2 commented Feb 2, 2024 • edited Loading

liangDarwin2 commented Feb 2, 2024

Azeirah commented Feb 2, 2024

qnixsynapse commented Feb 3, 2024

ngxson commented Feb 4, 2024

Azeirah commented Feb 4, 2024

inventionstore commented Feb 9, 2024

github-actions bot commented Mar 18, 2024

liangDarwin2 commented Jan 30, 2024 •

edited

Loading

liangDarwin2 commented Feb 2, 2024 •

edited

Loading