-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement automatic NGL detection #6502
base: master
Are you sure you want to change the base?
Conversation
When loading a 13B model into a 6 gig GPU
|
@@ -836,7 +836,12 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa | |||
invalid_param = true; | |||
return true; | |||
} | |||
params.n_gpu_layers = std::stoi(argv[i]); | |||
std::string argValue = argv[i]; | |||
if (argValue == "auto" || argValue == "a") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it can be a breaking change, but I would prefer to have this approach as the default. i.e. if -ngl
is not passed: automatically offload the maximum possible layers to VRAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be. If someone doesn't want that, they could simply -ngl 0
or just not compile with GPU args passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but this is just my personal point of view. @ggerganov or @slaren would have a better global view
This is an important feature, but I would be opposed to merging an implementation that just assumes a value for the KV and compute buffer sizes, or doesn't correctly calculate the size of the layers, or ignores the CUDA pool size, or ignores multiple GPUs, or any other factors that would need to be taken into account. I would rather have the user test different values manually, than to give them a wrong default that they will assume is correct. Realistically, this cannot be implemented correctly without making significant changes to the architecture of llama.cpp. |
Yes, vLLM is doing it with a gpu memory target ratio, but you often got cuda malloc issue and need to decrease the target. So this is equivalent of our current |
Assuming KV was just to get a prototype done, calculating the layers is nearly done on my side (the Guesstimation was just for prototyping as well). I'm pretty sure that it can be implemented for single GPU use without any major changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As slaren said, this would definitely be a welcome addition, but there are many edge cases that need to be taken into account.
llama.cpp
Outdated
static int llm_determine_max_ngl(const llama_model_loader & ml, const llama_model & model, const int main_gpu) { | ||
const auto & hparams = model.hparams; | ||
|
||
size_t available_gpu_memory = llama_get_available_device_memory(main_gpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The program logic here is inconsistent with the --help
. The --help
says "based on VRAM size" which I would interpret as total VRAM but here you are using how much free VRAM there is.
llama.cpp
Outdated
available_gpu_memory = available_gpu_memory - total_buf_size; // buffer size | ||
|
||
// Calculate the maximum number of layers that can fit into the GPU memory | ||
int32_t max_ngl = std::floor(static_cast<float>(available_gpu_memory) / memory_per_layer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should leave a small amount of headroom when the number of layers is determined automatically; you would want to avoid a scenario where an application ooms because llama.cpp only left 5 MB of VRAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing that; thanks for reminding
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
KV still needs to be appropriately calculated. (will do soon) Otherwise, it should be fine now. Anything I missed? |
Quite frankly there are still many edge cases that are not yet being considered. Just off the top of my head:
|
The design also needs to take into account possible changes to other parts of the code. It is not ok to just duplicate the calculations in this function. This will become very quickly outdated, and then it will become a unacceptable maintenance burden, or just plain wrong and misleading. And this cannot be done without very significant changes to the llama.cpp and backend interfaces. I already mentioned this: realistically this cannot be done without significant changes to the architecture. |
It would be extremely nice if this were not "fully automatic one size fits all" but if one could specify absolute limits per gpu, so one can e.g. leave enough space for other workloads (such as the OS). |
@slaren I understand this is a sensitive feature but FYI we are very strongly interested by that one. If anything we could do. Best. |
Are you saying you're interested in implementing the feature or using it? |
I'm willing to resume this in the future if I either find an easier way to handle it or there are compromises you guys are willing to accept. I didn't have that much time recently that's why this comes a little late... |
I am interested in this feature and can explain our use case. We have a client app for Windows, Fusion Quill. It uses llama.cpp and runs on end user Windows 11/10 PCs. There is no easy way to tell the user what is the optimal configuration for the best number_of_layers to offload to the GPU. Having llama.cpp calculate it, is the best solution for our use case. It can be a new optional parameter so it does not interfere with current use cases. Multi-GPU support can be a future implementation. I have dual GPUs but most of my end users don't. Also, If someone has ideas on how this can be done via code outside of llama.cpp, I would be interested in hearing them. |
This is not ready for merging; I still want to change/improve some stuff.
I implemented the option to pass "a" or "auto" with the -ngl parameter to automatically detect the maximum amount of layers that fit into the VRAM. Some stuff is still hard-coded or implemented weirdly; I'll improve that in the next commit(s).
Feedback is most definitely appreciated.