ggml : improve memory management #288

ggerganov · 2023-06-25T08:57:08Z

One of the biggest problems with ggml currently is that the user needs to manually pre-calculate the necessary sizes for all the ggml_context objects that they create. This is a result of the goal to have as little memory allocations as possible during runtime. However, it resulted in an unpleasant experience and needs to be improved.

Additionally, the "scratch buffer" mechanism is also very difficult to use and needs to be overhauled as well.

This will be quite a big change to the core library and there are many different ways to approach it, so for now I will keep the description of the issue short. Tagging @slaren as he had shared some nice ideas regarding this topic, which we can discuss further here and decide on a good strategy to implement them

The text was updated successfully, but these errors were encountered:

slaren · 2023-06-25T14:31:15Z

The main goal is to reduce the memory usage of intermediate results, and additionally to also simplify the overall memory management in ggml.

By default, tensors are immediately allocated in the context memory and never freed until the context is deleted. This is great for constant such as the model weights and input parameters, but not so great for intermediate results of computations. Scratch buffers can alleviate this somewhat, but they are not optimal and require a lot of tedious, error-prone work from the user.

The proposed solution would be to delay memory allocation of intermediate results until the graph has been completely built. Once we have the graph, we can determine the point at which each tensor is no longer needed and can be freed. In principle, we could use only strictly as much memory as is necessary in this way, but in practice it is likely that there will still be some memory lost to fragmentation.

Tensors such as the weights and input parameters would still need to be allocated immediately so that the user can modify their values. This could be done by having two types of ggml_context, an "immediate context" that would behave similarly to the way it works currently, and a different type for intermediate results (let's call it a "compute context"). Tensors created in a compute context wouldn't have their memory allocated until the graph is built, and the user wouldn't be able to access their data directly (1).

To determine how much memory has to be allocated for intermediate results, we would have a function that would take a graph and return the required size of the compute buffer. Initialization of the compute buffers could typically be done by creating a graph of maximum batch size and running it through this function. This is something that can be done very quickly, and I would expect that the entire process would take less than one millisecond (2).

To make this work, all the dependencies between tensors would need to be represented in the graph. Currently, sometimes this is avoided by calling ggml_build_forward_expand. This would not be allowed anymore, and instead a dependency would need to be introduced in the graph, either directly or through a new operation that could be called ggml_dependency. A call to ggml_dependency(a, b) would return a unmodified, but would add a dependency to b in the graph. It is essentially a view that also adds a dependency. For example, in llama.cpp instead of:

// store key and value to memory
// k and v are views of kv_self.k and kv_self.v
ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Kcur, k));
ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Vcur, v));
// code below uses kv_self.k and kv_self.v

it would be done such as:

k = ggml_dependency(ctx, kv_self.k, ggml_cpy(ctx0, Kcur, k));
v = ggml_dependency(ctx, kv_self.v, ggml_cpy(ctx0, Vcur, v));
// code below uses uses k and v

We could also avoid this and other copies entirely if we had a way to specify the output tensor of the operation. For example, we could add a function ggml_set_output that could be used to manually specify the address of tensors allocated in an intermediate buffer:

struct ggml_tensor * Kcur = ggml_rope(ctx, ...);
struct ggml_tensor * k = ggml_view_1d(ctx, kv_self.k, ...);
ggml_set_output(Kcur, k);
// Kcur will no longer be allocated in the intermediate buffer, instead,
// the result will be stored in the same memory as k, and the ggml_cpy is no longer necessary

(1): At least, not until the graph has been executed. Even then, it would not be guaranteed that their memory has not been overwritten unless they are the last operation in the graph. A better solution would be to allocate output tensors in an immediate context and copy the result to them, or use ggml_set_output as suggested here to avoid the copy entirely.
(2): ggml_build_forward is slow currently due to the way it avoids re-visiting the same nodes multiple times, but we have a fix for that already.

thriverLio · 2023-06-29T02:59:47Z

I don't think reuse the memory can give any performance improvement, in contrast, it will make the memory hard to manage.
As the modern OS have provide the memory management system, so give this mission to os. the ggml should move focus on caculation performace not memory.
Otherwise it will make a big obstacle to let others using the lib.

slaren · 2023-06-30T22:46:17Z

An alternative to the ggml_set_output described above would be to do an optimization pass on the graph and remove unnecessary ggml_cpy ops by setting the destination of the parent op to the destination of the copy. This should be fairly simple to implement, and would make it easier to use.

Similarly, ops could be made in-place automatically when possible. The _inplace versions of ggml ops could then be removed.

I would also prefer if ggml_dependency wasn't necessary, but I think that would require deeper changes to ggml that would be better to avoid for now. We can revisit that in the future, but for now requiring ggml_dependency instead of ggml_build_forward_expand shouldn't be too bad.

ggerganov added the refactoring Refactoring label Jun 25, 2023

slaren mentioned this issue Jun 25, 2023

ggml : improve API to allow allocating compute graphs on the heap #299

Closed

ggerganov mentioned this issue Jun 28, 2023

Support for large context sizes ggerganov/llama.cpp#2021

Closed

This was referenced Jul 15, 2023

ggml backends interface, ggml-cuda refactor ggerganov/llama.cpp#2230

Closed

ggml : adapt Metal to new ggml_backend interface ggerganov/llama.cpp#2258

Closed

ggml backends interface, ggml-cuda refactor ggerganov/llama.cpp#2239

Closed

slaren mentioned this issue Jul 26, 2023

ggml : add graph tensor allocator ggerganov/llama.cpp#2411

Merged

slaren closed this as completed in ggerganov/llama.cpp#2411 Jul 30, 2023

songkq mentioned this issue Nov 7, 2023

Plan to Support CUDA device QwenLM/qwen.cpp#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : improve memory management #288

ggml : improve memory management #288

ggerganov commented Jun 25, 2023

slaren commented Jun 25, 2023

thriverLio commented Jun 29, 2023

slaren commented Jun 30, 2023

ggml : improve memory management #288

ggml : improve memory management #288

Comments

ggerganov commented Jun 25, 2023

slaren commented Jun 25, 2023

thriverLio commented Jun 29, 2023

slaren commented Jun 30, 2023