New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference Backend] Enable attention Key/value caching #72
Comments
Hi @lbeurerkellner , thanks for the quick response. I am referring to caching the LLM's key/value attention pairs for sequential variable value generation. So for instance in the above example, first the LLM populates the |
I see, thanks for the article, this clarifies it for me. I was primed on the wrong abstraction level, not thinking of transformer internals. Yes, this is definitely on the list of things we plan to do. All in all, a much deeper integration on the inference side (beyond just token masking) is possible and something we are working on. HuggingFace already provides a lot of the infrastructure for this, but we are are also exploring other options, especially python-independent inference backends. Multi-part prompts in general should probably be more transparent from the inference side, to enable cross-call optimizations. Happy to also hear your thoughts, what other optimizations might be interesting from a production perspective. |
The PagedAttention mechanism from vLLM may help https://github.com/vllm-project/vllm/blob/49b26e2cec8c56594668905e853fe4af34336b05/vllm/model_executor/layers/attention.py#L16 |
any progress about this? |
There is a proof-of-concept implementation on a feature branch, but to make it work with batching, padding and multi-part prompting, it still requires some work. It may be worth to factor out support for non-batched KV caching for now. |
Is there a plan to incorporate key/value caching to improve generation efficiency significantly? See e.g. Guidance's acceleration.
The text was updated successfully, but these errors were encountered: