Memory estimation utilities #4315

giladgd · 2023-12-03T22:33:18Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

I'd like to have utility functions exposed in llama.h that I can use to estimate the memory utilization of a llama_context and llama_model, and also get information about the available resources that llama.cpp can use.

For example, I'd like to be able to do these things:

Get an estimation of how much memory will be allocated for a llama_context instantiated for a given llama_context_params without actually instantiating it
Get an estimation of how much memory will be used by a llama_model instantiated for a given model file path and llama_model_params without actually loading it
Get the total amount of available VRAM that llama.cpp has access to all the across devices, and also for each device
Get the total used VRAM across all devices, and also for each device (given these APIs, users can calculate the free VRAM available for allocation)
Get the total RAM that llama.cpp has access to
Get the total used RAM

Motivation

I'm actively working on developing node-llama-cpp and I intend it to provide a high-level API that can be used without necessarily having to delve into the implementation details of llama.cpp (I intend it to be highly customizable, but with good defaults).

I'd like to make it easy for users to load a model without having to configure anything, and the library will try to provide the maximum capabilities that can fit into the user's hardware by default, so it'll decide on the right split between RAM and VRAM by itself, set a good context size by itself, etc.

This library will know when to load things into memory and when to free them from memory to make room for other things the user needs on demand.
For example, when a user tries to have another context against the same model, the library will dump the context into a file and erase the KV cache so it can reuse it for another purpose, and when the user tries to use the old context again, it'll dump and erase the current context and load the old context state into memory again.

I'd like to also do this for entire models as well and not only for contexts.

To be able to do this smart memory management, I need to be able to estimate the resource usage of llama.cpp objects, and be able to know what are the available resources and resource utilizations are.
I'd like this have this implemented as part of llama.cpp since llama.cpp supports many backends, so I'd like to get this information about the backends that llama.cpp was compiled with.

Possible Implementation

I have a general idea of what the API of this may look like:

typedef struct llama_resource_usage {
    size_t      ram;
    size_t      vram;
} llama_resource_usage;

typedef struct llama_device_resource_usage {
    const char * device_name;
    bool        is_unified_memory;
    bool        is_gpu;
    size_t      usage;
} llama_device_resource_usage;

typedef struct llama_model_info {
    e_model     type;
    llm_arch    arch;
    llama_ftype ftype;
    size_t      vocab_size;
    const llama_layer_info * layers;
} llama_model_info;

typedef struct llama_layer_info {
    // ...
} llama_layer_info;

LLAMA_API llama_model_info llama_get_model_info(const char * path_model);
LLAMA_API llama_resource_usage llama_estimate_model_memory_usage(
    struct llama_model_info * model_info,
    struct llama_model_params params
);
LLAMA_API llama_resource_usage llama_estimate_context_memory_usage(
    struct llama_model_info * model_info,
    struct llama_model_params params,
    struct llama_context_params params
);
LLAMA_API llama_resource_usage llama_estimate_context_memory_usage(
    struct llama_model * model,
    struct llama_context_params params
);
LLAMA_API const llama_device_resource_usage * llama_get_total_devices_memory();
LLAMA_API const llama_device_resource_usage * llama_get_used_devices_memory();

Thank you for this awesome project :)

The text was updated successfully, but these errors were encountered:

giladgd · 2023-12-27T15:20:23Z

This feature is required to bring llama.cpp to production environments

slaren · 2023-12-28T00:24:12Z

I agree that this is important, but first we need to allow allocations to fail without crashing the application. Instead, we should return an error when an allocation fails. Let's fix that first, and then we can think about a memory estimation API.

github-actions · 2024-04-03T01:14:42Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

giladgd added the enhancement New feature or request label Dec 3, 2023

giladgd mentioned this issue Dec 6, 2023

feat: version 3.0 withcatai/node-llama-cpp#105

Draft

18 tasks

giladgd changed the title ~~feat: memory estimation utilities~~ Feature request: memory estimation utilities Dec 11, 2023

giladgd changed the title ~~Feature request: memory estimation utilities~~ Memory estimation utilities Dec 14, 2023

pedro-devv mentioned this issue Mar 12, 2024

feat: edgen needs to handle 1000s of requests edgenai/edgen#98

Open

giladgd mentioned this issue Mar 16, 2024

feat: async operations withcatai/node-llama-cpp#178

Merged

7 tasks

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

AsakusaRinne mentioned this issue Apr 26, 2024

Removed ContextSize from most examples SciSharp/LLamaSharp#663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory estimation utilities #4315

Memory estimation utilities #4315

giladgd commented Dec 3, 2023

giladgd commented Dec 27, 2023

slaren commented Dec 28, 2023

github-actions bot commented Apr 3, 2024

Memory estimation utilities #4315

Memory estimation utilities #4315

Comments

giladgd commented Dec 3, 2023

Prerequisites

Feature Description

Motivation

Possible Implementation

giladgd commented Dec 27, 2023

slaren commented Dec 28, 2023

github-actions bot commented Apr 3, 2024