Replies: 3 comments 5 replies
-
This has been supported forever in |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Doesn't airllm benefit a lot from Apple unified memory in particular? This seems to lower memory use and speeds up inference. |
Beta Was this translation helpful? Give feedback.
-
AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed.
https://github.com/lyogavin/Anima/tree/main/air_llm
Kind of interesting method for inference reducing memory usage
https://huggingface.co/blog/lyogavin/airllm
I was wondering if this can be implement in llama.cpp in some way so that make small vram GPU usable.
Beta Was this translation helpful? Give feedback.
All reactions