Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Trying to run embedding with a batch of inputs a la #2199 . This fails if the batch is longer than 256 tokens.
The problem is that currently we use Llama.context_params.kv_unified = False (the default), which according to https://github.com/ggml-org/llama.cpp/blob/master/src/llama-context.cpp#L183-L197 leads to n_ctx_seq being computed as n_ctx / n_seq_max leading to n_ctx_seq (length per individual input sequence) = 256 😥 (also reported as 256 in the verbose info from the context initialisation)
Current Behavior
n_seq_max is fixed at 256 (as of #2206 included in 0.3.23). This leads to n_ctx_seq being VERY small. -> embedding with batches is essentially unusable, because inputs are rarelly shorter than 256 tokens.
Failure Logs
decode: n_ctx is not divisible by n_seq_max - rounding down to XXX
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Trying to run embedding with a batch of inputs a la #2199 . This fails if the batch is longer than 256 tokens.
The problem is that currently we use
Llama.context_params.kv_unified = False(the default), which according to https://github.com/ggml-org/llama.cpp/blob/master/src/llama-context.cpp#L183-L197 leads ton_ctx_seqbeing computed asn_ctx / n_seq_maxleading ton_ctx_seq(length per individual input sequence) = 256 😥 (also reported as 256 in the verbose info from the context initialisation)Current Behavior
n_seq_maxis fixed at 256 (as of #2206 included in 0.3.23). This leads ton_ctx_seqbeing VERY small. -> embedding with batches is essentially unusable, because inputs are rarelly shorter than 256 tokens.Failure Logs
decode: n_ctx is not divisible by n_seq_max - rounding down to XXX