Skip to content

fix: only reset LoRa configs when they have changed from previous batch#19280

Merged
ggerganov merged 2 commits intoggml-org:masterfrom
agent-enemy-2:bug-fix/memory-leak
Feb 14, 2026
Merged

fix: only reset LoRa configs when they have changed from previous batch#19280
ggerganov merged 2 commits intoggml-org:masterfrom
agent-enemy-2:bug-fix/memory-leak

Conversation

@agent-enemy-2
Copy link
Contributor

@agent-enemy-2 agent-enemy-2 commented Feb 3, 2026

Overview

Fix for #19217

Currently we are setting the LoRa config for each and every token request, even if the configuration between batches has not changed. It appears in this PR, in llama-context.cpp, line 1078 we added schedule reserving for when we set new LoRa configs, and so now when we have a LoRa config, for every batch we are decoding, we are reserving the scheduler.

This change adds a field to the server slot that holds the last batche's LoRa config. For each batch, we run a check to see if the config is the same as the previous batch, and only if it is not do we go ahead and set it. Otherwise, proceed.

Testing

I was able to re-produce the issue relatively easily with some debug logs

./llama-server -hf bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q8_0 --lora ~/Downloads/LoRA-Llama-3.1-8B-MultiReflection-f16.gguf -c 1024 --parallel 1 --host 0.0.0.0 --port 8080 --verbose

and was able to see this endlessly along with huge memory spikes:

Setting sched_need_reserve to TRUE in the set_adapter_lora functionset_embeddings: value = 0
Reserving memory during decoding

sched_reserve: reserving ...
sched_reserve: max_nodes = 3692
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" requests"}}],"created":1770077607,"id":"chatcmpl-1gqsKPcqCDHRRMZFNqfWqnl557ARXVc9","model":"bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q8_0","system_fingerprint":"b7916-0dfcd3b60","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":109,"prompt_ms":628.118,"prompt_per_token_ms":5.762550458715597,"prompt_per_second":173.5342722227352,"predicted_n":2,"predicted_ms":180.088,"predicted_per_token_ms":90.044,"predicted_per_second":11.105681666740704}}


sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:       MTL0 compute buffer size =   509.12 MiB
sched_reserve:        CPU compute buffer size =    18.01 MiB
sched_reserve: graph nodes  = 1903
sched_reserve: graph splits = 2

After these changes, the issue no longer appears and memory remains stable.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move the check inside llama_context::set_adapter_lora() and not set the sched_need_reserve flag if the same adapters are passed.

@agent-enemy-2 agent-enemy-2 force-pushed the bug-fix/memory-leak branch 3 times, most recently from 8edc85d to b131ad4 Compare February 4, 2026 05:36
@agent-enemy-2
Copy link
Contributor Author

agent-enemy-2 commented Feb 4, 2026

Had to implement the check in the common_set_adapter_lora because in the method we were clearing the loras every time, and then setting sched_need_reserve flag as well.

Added some helpers to validate if the lora configs are the same, and if they are, skip clearing and skip reserving the scheduler.

@ggerganov
Copy link
Member

ggerganov commented Feb 4, 2026

Had to implement the check in the common_set_adapter_lora because in the method we were clearing the loras every time, and then setting sched_need_reserve flag as well.

Can't we update the logic in llama_set_adapter_lora() to first check if they are the same and if they are - not do anything. We already have such logic in other functions, such as:

void llama_context::set_causal_attn(bool value) {
LLAMA_LOG_DEBUG("%s: value = %d\n", __func__, value);
if (cparams.causal_attn == value) {
return;
}
cparams.causal_attn = value;
sched_need_reserve = true;
}

@agent-enemy-2
Copy link
Contributor Author

Apologies because I might be missing something.

In the common_set_adapter_lora function, we call llama_clear_adapter_lora which would prevent any checks in lama_context::set_adapter_lora from being able to tell if the LoRa adapter being passed in is the same as what is currently in the context, since there would be no current adapter.

We could just remove that call to clear the LoRas (or move that to the llama_context::set_adapter_lora`, but I wasn't sure if there is some reason we wanted that in the common functionality.

My latest change attempts to do what you're looking for, just in the common_set_adapter_lora function, prior to clearing the context LoRas.

@agent-enemy-2
Copy link
Contributor Author

@ggerganov I was able to refactor by creating a new llama_context::put_adapter_loras function that first checks if the provided LoRa configs are the same as what exists on the context and if so, do nothing. Else it clears the LoRa adapters from context and then re-applies them (same logic as before).

Is this more along the lines of what you had in mind?

@tugot17
Copy link

tugot17 commented Feb 11, 2026

@ggerganov I can confirm this fixes issues I had goes from 8tps to 360tps for my small 1.2B parameter model.

@agent-enemy-2 great work, thank you 😄

@ggerganov
Copy link
Member

@agent-enemy-2 @tugot17 Could you confirm that the last version that I just pushed works as expected?

@agent-enemy-2
Copy link
Contributor Author

agent-enemy-2 commented Feb 13, 2026

@ggerganov Just tested your latest commit and it LGTM!

@tugot17 Thanks for testing!

@ggerganov ggerganov merged commit 2d8015e into ggml-org:master Feb 14, 2026
75 of 78 checks passed
@ggerganov ggerganov added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Feb 14, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 14, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants