-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Name and Version
LLAMA API (Sunday, Oct 5, 13:43pm UTC)
DLLs built by github workers :p
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
libllama (core library)
Problem description & steps to reproduce
Actual bugs:
n_seq_maxdetermines the actual amount of sequences, instead of just being amax.
In addition, it splits the context between thenseqs, effectively making each seq havecontext_size/n_seq_max.- Specifying
n_seq_max = 2but passing 3 seqs throws an appropriate exception during decode,
...but leaving it to= 1(default) will just return bad decode results yet allow continuing anyway.
Actual problem: Existence of n_seq_max in the llama_context_params
This does a lot of magic, like splitting the total context size to n_max_seq partitions, even if some seqs are unused.
Few months ago, the context would automatically & dynamically self-handle its sequence count during updates.
Now, the context gets initialized with the specified amount of n_seq_max seqs, and gets locked on that forever.
The n_seq_max variable was fixed, but I find that a flexibility downgrade from what used to exist.
In addition, this completely disallows scheduling on the same llama_context. I'll expand on this below.
Currently, to dynamically support different amount of batches or scheduling, what one would have to do is:
- Create a context with say 10 batches (current load), start decoding.
- Say new request arrives while inferencing. Now, we want 11 batches.
- Create new context and copy over the old one -- the cache is lost & can't reuse the
memoryhandle.
Proposal:
Get rid of the n_seq_max completely, or, at worst-case scenario, make it just a safety net for context manipulation, instead of having it decide the actual amount of sequences that will always persist.