-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : improve memory management #288
Comments
The main goal is to reduce the memory usage of intermediate results, and additionally to also simplify the overall memory management in ggml. By default, tensors are immediately allocated in the context memory and never freed until the context is deleted. This is great for constant such as the model weights and input parameters, but not so great for intermediate results of computations. Scratch buffers can alleviate this somewhat, but they are not optimal and require a lot of tedious, error-prone work from the user. The proposed solution would be to delay memory allocation of intermediate results until the graph has been completely built. Once we have the graph, we can determine the point at which each tensor is no longer needed and can be freed. In principle, we could use only strictly as much memory as is necessary in this way, but in practice it is likely that there will still be some memory lost to fragmentation. Tensors such as the weights and input parameters would still need to be allocated immediately so that the user can modify their values. This could be done by having two types of To determine how much memory has to be allocated for intermediate results, we would have a function that would take a graph and return the required size of the compute buffer. Initialization of the compute buffers could typically be done by creating a graph of maximum batch size and running it through this function. This is something that can be done very quickly, and I would expect that the entire process would take less than one millisecond (2). To make this work, all the dependencies between tensors would need to be represented in the graph. Currently, sometimes this is avoided by calling // store key and value to memory
// k and v are views of kv_self.k and kv_self.v
ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Kcur, k));
ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Vcur, v));
// code below uses kv_self.k and kv_self.v it would be done such as: k = ggml_dependency(ctx, kv_self.k, ggml_cpy(ctx0, Kcur, k));
v = ggml_dependency(ctx, kv_self.v, ggml_cpy(ctx0, Vcur, v));
// code below uses uses k and v We could also avoid this and other copies entirely if we had a way to specify the output tensor of the operation. For example, we could add a function struct ggml_tensor * Kcur = ggml_rope(ctx, ...);
struct ggml_tensor * k = ggml_view_1d(ctx, kv_self.k, ...);
ggml_set_output(Kcur, k);
// Kcur will no longer be allocated in the intermediate buffer, instead,
// the result will be stored in the same memory as k, and the ggml_cpy is no longer necessary (1): At least, not until the graph has been executed. Even then, it would not be guaranteed that their memory has not been overwritten unless they are the last operation in the graph. A better solution would be to allocate output tensors in an immediate context and copy the result to them, or use |
I don't think reuse the memory can give any performance improvement, in contrast, it will make the memory hard to manage. |
An alternative to the Similarly, ops could be made in-place automatically when possible. The I would also prefer if |
One of the biggest problems with
ggml
currently is that the user needs to manually pre-calculate the necessary sizes for all theggml_context
objects that they create. This is a result of the goal to have as little memory allocations as possible during runtime. However, it resulted in an unpleasant experience and needs to be improved.Additionally, the "scratch buffer" mechanism is also very difficult to use and needs to be overhauled as well.
This will be quite a big change to the core library and there are many different ways to approach it, so for now I will keep the description of the issue short. Tagging @slaren as he had shared some nice ideas regarding this topic, which we can discuss further here and decide on a good strategy to implement them
The text was updated successfully, but these errors were encountered: