fix: only reset LoRa configs when they have changed from previous batch#19280
fix: only reset LoRa configs when they have changed from previous batch#19280ggerganov merged 2 commits intoggml-org:masterfrom
Conversation
ggerganov
left a comment
There was a problem hiding this comment.
Better to move the check inside llama_context::set_adapter_lora() and not set the sched_need_reserve flag if the same adapters are passed.
8edc85d to
b131ad4
Compare
|
Had to implement the check in the Added some helpers to validate if the lora configs are the same, and if they are, skip clearing and skip reserving the scheduler. |
Can't we update the logic in llama.cpp/src/llama-context.cpp Lines 991 to 1001 in 6ab881b |
|
Apologies because I might be missing something. In the common_set_adapter_lora function, we call We could just remove that call to clear the LoRas (or move that to the llama_context::set_adapter_lora`, but I wasn't sure if there is some reason we wanted that in the common functionality. My latest change attempts to do what you're looking for, just in the |
483b739 to
3b82c3d
Compare
|
@ggerganov I was able to refactor by creating a new Is this more along the lines of what you had in mind? |
|
@ggerganov I can confirm this fixes issues I had goes from 8tps to 360tps for my small 1.2B parameter model. @agent-enemy-2 great work, thank you 😄 |
|
@agent-enemy-2 @tugot17 Could you confirm that the last version that I just pushed works as expected? |
|
@ggerganov Just tested your latest commit and it LGTM! @tugot17 Thanks for testing! |
…-org#19280)" This reverts commit 2d8015e.
…-org#19280)" This reverts commit 2d8015e.
Overview
Fix for #19217
Currently we are setting the LoRa config for each and every token request, even if the configuration between batches has not changed. It appears in this PR, in llama-context.cpp, line 1078 we added schedule reserving for when we set new LoRa configs, and so now when we have a LoRa config, for every batch we are decoding, we are reserving the scheduler.
This change adds a field to the server slot that holds the last batche's LoRa config. For each batch, we run a check to see if the config is the same as the previous batch, and only if it is not do we go ahead and set it. Otherwise, proceed.
Testing
I was able to re-produce the issue relatively easily with some debug logs
and was able to see this endlessly along with huge memory spikes:
After these changes, the issue no longer appears and memory remains stable.