Skip to content

Need to set kv_unified=True to enable batch processing #2216

@mlisovyi

Description

@mlisovyi

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code (0.3.23). Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Trying to run embedding with a batch of inputs a la #2199 . This fails if the batch is longer than 256 tokens.
The problem is that currently we use Llama.context_params.kv_unified = False (the default), which according to https://github.com/ggml-org/llama.cpp/blob/master/src/llama-context.cpp#L183-L197 leads to n_ctx_seq being computed as n_ctx / n_seq_max leading to n_ctx_seq (length per individual input sequence) = 256 😥 (also reported as 256 in the verbose info from the context initialisation)

Current Behavior

n_seq_max is fixed at 256 (as of #2206 included in 0.3.23). This leads to n_ctx_seq being VERY small. -> embedding with batches is essentially unusable, because inputs are rarelly shorter than 256 tokens.

Failure Logs

decode: n_ctx is not divisible by n_seq_max - rounding down to XXX

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions