[Inference Backend] Enable attention Key/value caching #72

arnaudvl · 2023-05-17T16:01:55Z

Is there a plan to incorporate key/value caching to improve generation efficiency significantly? See e.g. Guidance's acceleration.

lbeurerkellner · 2023-05-18T09:00:06Z

As far as I understand Guidance's approach, the key idea is to only call the LLM to complete template variables and not to re-generate the entire template. This form of acceleration, LMQL already implements since its very first release.

For instance in the query shown below, only the concrete variable values are actually predicted by the LLM, whereas the surrounding template is automatically inserted by the runtime. The number of LLM calls/forward passes required to run the query, correspond to exactly the number of actual value tokens that are completed in the template, not the template itself.

What is currently not possible, is to provide default values that you can pass to avoid re-generation of already known values, e.g. if you already have a value for DESCRIPTION in the query above. However, we plan to add this in a future release, as it has been brought up repeatedly by the community.

Does this answer you question? What aspect of Guidance's key/value caching would you be looking for in LMQL? We are very interested in feedback and ideas surrounding this.

arnaudvl · 2023-05-18T11:43:03Z

Hi @lbeurerkellner , thanks for the quick response. I am referring to caching the LLM's key/value attention pairs for sequential variable value generation. So for instance in the above example, first the LLM populates the ID. For the next call it generates DESCRIPTION. At this point you already have computed the LLM's key/value pairs for the template up until ID, and should not have to recompute this. The longer the template to be filled in becomes, the more important (faster + cheaper) this caching becomes. For production use cases, this is a very significant benefit. Check this post for more detail. Of course (for now) this is only possible for self-hosted models, not external API calls such as using OpenAI models.

lbeurerkellner · 2023-05-18T13:05:38Z

I see, thanks for the article, this clarifies it for me. I was primed on the wrong abstraction level, not thinking of transformer internals.

Yes, this is definitely on the list of things we plan to do. All in all, a much deeper integration on the inference side (beyond just token masking) is possible and something we are working on. HuggingFace already provides a lot of the infrastructure for this, but we are are also exploring other options, especially python-independent inference backends.

Multi-part prompts in general should probably be more transparent from the inference side, to enable cross-call optimizations. Happy to also hear your thoughts, what other optimizations might be interesting from a production perspective.

doxav · 2023-07-02T22:29:33Z

The PagedAttention mechanism from vLLM may help https://github.com/vllm-project/vllm/blob/49b26e2cec8c56594668905e853fe4af34336b05/vllm/model_executor/layers/attention.py#L16

kongjiellx · 2023-10-30T01:58:47Z

any progress about this?

lbeurerkellner · 2023-11-02T11:36:02Z

There is a proof-of-concept implementation on a feature branch, but to make it work with batching, padding and multi-part prompting, it still requires some work. It may be worth to factor out support for non-batched KV caching for now.

lbeurerkellner added enhancement New feature or request question Questions about using LMQL. and removed enhancement New feature or request labels May 18, 2023

lbeurerkellner added the enhancement New feature or request label May 18, 2023

lbeurerkellner changed the title ~~Key/value caching~~ [Inference Backend] Enable attention Key/value caching May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference Backend] Enable attention Key/value caching #72

[Inference Backend] Enable attention Key/value caching #72

arnaudvl commented May 17, 2023

lbeurerkellner commented May 18, 2023 •

edited

arnaudvl commented May 18, 2023

lbeurerkellner commented May 18, 2023

doxav commented Jul 2, 2023 •

edited

kongjiellx commented Oct 30, 2023

lbeurerkellner commented Nov 2, 2023

[Inference Backend] Enable attention Key/value caching #72

[Inference Backend] Enable attention Key/value caching #72

Comments

arnaudvl commented May 17, 2023

lbeurerkellner commented May 18, 2023 • edited

arnaudvl commented May 18, 2023

lbeurerkellner commented May 18, 2023

doxav commented Jul 2, 2023 • edited

kongjiellx commented Oct 30, 2023

lbeurerkellner commented Nov 2, 2023

lbeurerkellner commented May 18, 2023 •

edited

doxav commented Jul 2, 2023 •

edited