Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory estimation utilities #4315

Closed
4 tasks done
giladgd opened this issue Dec 3, 2023 · 3 comments
Closed
4 tasks done

Memory estimation utilities #4315

giladgd opened this issue Dec 3, 2023 · 3 comments
Labels
enhancement New feature or request stale

Comments

@giladgd
Copy link
Contributor

giladgd commented Dec 3, 2023

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

I'd like to have utility functions exposed in llama.h that I can use to estimate the memory utilization of a llama_context and llama_model, and also get information about the available resources that llama.cpp can use.

For example, I'd like to be able to do these things:

  • Get an estimation of how much memory will be allocated for a llama_context instantiated for a given llama_context_params without actually instantiating it
  • Get an estimation of how much memory will be used by a llama_model instantiated for a given model file path and llama_model_params without actually loading it
  • Get the total amount of available VRAM that llama.cpp has access to all the across devices, and also for each device
  • Get the total used VRAM across all devices, and also for each device (given these APIs, users can calculate the free VRAM available for allocation)
  • Get the total RAM that llama.cpp has access to
  • Get the total used RAM

Motivation

I'm actively working on developing node-llama-cpp and I intend it to provide a high-level API that can be used without necessarily having to delve into the implementation details of llama.cpp (I intend it to be highly customizable, but with good defaults).

I'd like to make it easy for users to load a model without having to configure anything, and the library will try to provide the maximum capabilities that can fit into the user's hardware by default, so it'll decide on the right split between RAM and VRAM by itself, set a good context size by itself, etc.

This library will know when to load things into memory and when to free them from memory to make room for other things the user needs on demand.
For example, when a user tries to have another context against the same model, the library will dump the context into a file and erase the KV cache so it can reuse it for another purpose, and when the user tries to use the old context again, it'll dump and erase the current context and load the old context state into memory again.

I'd like to also do this for entire models as well and not only for contexts.

To be able to do this smart memory management, I need to be able to estimate the resource usage of llama.cpp objects, and be able to know what are the available resources and resource utilizations are.
I'd like this have this implemented as part of llama.cpp since llama.cpp supports many backends, so I'd like to get this information about the backends that llama.cpp was compiled with.

Possible Implementation

I have a general idea of what the API of this may look like:

typedef struct llama_resource_usage {
    size_t      ram;
    size_t      vram;
} llama_resource_usage;

typedef struct llama_device_resource_usage {
    const char * device_name;
    bool        is_unified_memory;
    bool        is_gpu;
    size_t      usage;
} llama_device_resource_usage;

typedef struct llama_model_info {
    e_model     type;
    llm_arch    arch;
    llama_ftype ftype;
    size_t      vocab_size;
    const llama_layer_info * layers;
} llama_model_info;

typedef struct llama_layer_info {
    // ...
} llama_layer_info;

LLAMA_API llama_model_info llama_get_model_info(const char * path_model);
LLAMA_API llama_resource_usage llama_estimate_model_memory_usage(
    struct llama_model_info * model_info,
    struct llama_model_params params
);
LLAMA_API llama_resource_usage llama_estimate_context_memory_usage(
    struct llama_model_info * model_info,
    struct llama_model_params params,
    struct llama_context_params params
);
LLAMA_API llama_resource_usage llama_estimate_context_memory_usage(
    struct llama_model * model,
    struct llama_context_params params
);
LLAMA_API const llama_device_resource_usage * llama_get_total_devices_memory();
LLAMA_API const llama_device_resource_usage * llama_get_used_devices_memory();

Thank you for this awesome project :)

@giladgd giladgd added the enhancement New feature or request label Dec 3, 2023
@giladgd giladgd changed the title feat: memory estimation utilities Feature request: memory estimation utilities Dec 11, 2023
@giladgd giladgd changed the title Feature request: memory estimation utilities Memory estimation utilities Dec 14, 2023
@giladgd
Copy link
Contributor Author

giladgd commented Dec 27, 2023

This feature is required to bring llama.cpp to production environments

@slaren
Copy link
Collaborator

slaren commented Dec 28, 2023

I agree that this is important, but first we need to allow allocations to fail without crashing the application. Instead, we should return an error when an allocation fails. Let's fix that first, and then we can think about a memory estimation API.

Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants