Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference Backend] Enable attention Key/value caching #72

Open
arnaudvl opened this issue May 17, 2023 · 6 comments
Open

[Inference Backend] Enable attention Key/value caching #72

arnaudvl opened this issue May 17, 2023 · 6 comments
Labels
enhancement New feature or request question Questions about using LMQL.

Comments

@arnaudvl
Copy link

Is there a plan to incorporate key/value caching to improve generation efficiency significantly? See e.g. Guidance's acceleration.

@lbeurerkellner lbeurerkellner added enhancement New feature or request question Questions about using LMQL. and removed enhancement New feature or request labels May 18, 2023
@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented May 18, 2023

As far as I understand Guidance's approach, the key idea is to only call the LLM to complete template variables and not to re-generate the entire template. This form of acceleration, LMQL already implements since its very first release.

For instance in the query shown below, only the concrete variable values are actually predicted by the LLM, whereas the surrounding template is automatically inserted by the runtime. The number of LLM calls/forward passes required to run the query, correspond to exactly the number of actual value tokens that are completed in the template, not the template itself.

image

What is currently not possible, is to provide default values that you can pass to avoid re-generation of already known values, e.g. if you already have a value for DESCRIPTION in the query above. However, we plan to add this in a future release, as it has been brought up repeatedly by the community.

Does this answer you question? What aspect of Guidance's key/value caching would you be looking for in LMQL? We are very interested in feedback and ideas surrounding this.

@lbeurerkellner lbeurerkellner added the enhancement New feature or request label May 18, 2023
@arnaudvl
Copy link
Author

Hi @lbeurerkellner , thanks for the quick response. I am referring to caching the LLM's key/value attention pairs for sequential variable value generation. So for instance in the above example, first the LLM populates the ID. For the next call it generates DESCRIPTION. At this point you already have computed the LLM's key/value pairs for the template up until ID, and should not have to recompute this. The longer the template to be filled in becomes, the more important (faster + cheaper) this caching becomes. For production use cases, this is a very significant benefit. Check this post for more detail. Of course (for now) this is only possible for self-hosted models, not external API calls such as using OpenAI models.

@lbeurerkellner
Copy link
Collaborator

I see, thanks for the article, this clarifies it for me. I was primed on the wrong abstraction level, not thinking of transformer internals.

Yes, this is definitely on the list of things we plan to do. All in all, a much deeper integration on the inference side (beyond just token masking) is possible and something we are working on. HuggingFace already provides a lot of the infrastructure for this, but we are are also exploring other options, especially python-independent inference backends.

Multi-part prompts in general should probably be more transparent from the inference side, to enable cross-call optimizations. Happy to also hear your thoughts, what other optimizations might be interesting from a production perspective.

@lbeurerkellner lbeurerkellner changed the title Key/value caching [Inference Backend] Enable attention Key/value caching May 18, 2023
@doxav
Copy link

doxav commented Jul 2, 2023

@kongjiellx
Copy link

any progress about this?

@lbeurerkellner
Copy link
Collaborator

There is a proof-of-concept implementation on a feature branch, but to make it work with batching, padding and multi-part prompting, it still requires some work. It may be worth to factor out support for non-batched KV caching for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Questions about using LMQL.
Projects
None yet
Development

No branches or pull requests

4 participants