-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Open
Labels
Description
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
./llama-server --grammar-file fuck-emdashes.gbnf -ngl 99 --host 0.0.0.0 -c 190000 -m Qwen3-4B-Instruct-2507-Q8_0.gguf -fa auto -cram 0 --slots --slot-save-path kv/qwen3-4b --no-mmapProblem description & steps to reproduce
slot update_slots: id 1 | task 9700 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.033170
decode: failed to find a memory slot for batch of size 2048
srv try_purge_id: purging slot 2 with 1724 tokens
srv update_slots: failed to find free space in the KV cache, retrying with smaller batch size, i = 0, n_batch = 2048, ret = 1
decode: failed to find a memory slot for batch of size 2048
srv try_purge_id: purging slot 3 with 184145 tokens
srv update_slots: failed to find free space in the KV cache, retrying with smaller batch size, i = 0, n_batch = 2048, ret = 1
slot update_slots: id 1 | task 9700 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 1 | task 9700 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, prog
There isn't any other slots in use, the code randomly decides after 4-5 requests that it's out of space, and nukes the slot's cached content.
Could you please not do that. Best regards.
Also, might as well revert the RAM caching misfeature, tmpfs (or other ramdrives) exists, and llama.cpp can't detect the total memory size anyways -> OOM killed every time.
First Bad Commit
No response
Relevant log output
skoulik