-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory estimation utilities #4315
Comments
This feature is required to bring llama.cpp to production environments |
I agree that this is important, but first we need to allow allocations to fail without crashing the application. Instead, we should return an error when an allocation fails. Let's fix that first, and then we can think about a memory estimation API. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Feature Description
I'd like to have utility functions exposed in
llama.h
that I can use to estimate the memory utilization of allama_context
andllama_model
, and also get information about the available resources thatllama.cpp
can use.For example, I'd like to be able to do these things:
llama_context
instantiated for a givenllama_context_params
without actually instantiating itllama_model
instantiated for a given model file path andllama_model_params
without actually loading itllama.cpp
has access to all the across devices, and also for each devicellama.cpp
has access toMotivation
I'm actively working on developing
node-llama-cpp
and I intend it to provide a high-level API that can be used without necessarily having to delve into the implementation details ofllama.cpp
(I intend it to be highly customizable, but with good defaults).I'd like to make it easy for users to load a model without having to configure anything, and the library will try to provide the maximum capabilities that can fit into the user's hardware by default, so it'll decide on the right split between RAM and VRAM by itself, set a good context size by itself, etc.
This library will know when to load things into memory and when to free them from memory to make room for other things the user needs on demand.
For example, when a user tries to have another context against the same model, the library will dump the context into a file and erase the KV cache so it can reuse it for another purpose, and when the user tries to use the old context again, it'll dump and erase the current context and load the old context state into memory again.
I'd like to also do this for entire models as well and not only for contexts.
To be able to do this smart memory management, I need to be able to estimate the resource usage of
llama.cpp
objects, and be able to know what are the available resources and resource utilizations are.I'd like this have this implemented as part of
llama.cpp
sincellama.cpp
supports many backends, so I'd like to get this information about the backends thatllama.cpp
was compiled with.Possible Implementation
I have a general idea of what the API of this may look like:
Thank you for this awesome project :)
The text was updated successfully, but these errors were encountered: