Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV Cache Quantization #85

Closed
Interpause opened this issue May 9, 2024 · 4 comments
Closed

KV Cache Quantization #85

Interpause opened this issue May 9, 2024 · 4 comments
Labels

Comments

@Interpause
Copy link

Hey, thanks for your work. I saw https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16/discussions/2 about how 8-bit KV cache quantization can be enabled on vLLM. I am not too sure of how exactly KV cache is implemented for AQLM using Transformers, but would KV cache quantization be theoretically possible? It might address some of the concerns regarding high vram usage for context from https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/bringing_2bit_llms_to_production_new_aqlm_models/.

To be specific, would 4-bit cache quantization be possible? Turboderp managed to achieve negligable ppl loss somehow: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md. For reference, turboderp/exllamav2@324404e.

Thanks!

@justheuristic
Copy link
Collaborator

Hi! I am not an author, but a contributor, and I have some familiarity with the issue.

As you correctly describe, the AQLM does not do cache quantization, relying on standard transformers code.
If you plug in any data-free cache quantization, e.g. 8-bit KV compression, it is likely that the impact will be the same as when using KV quantization in any other model.
As for 4-bit variant - I did not try that specific one, but according to your description and code, it should be easy to implement with AQLM.

One way you can do this is by extending transformers Cache class roughly as follows:

  1. at __init__, create a storage of quantized KVs, similarly to StaticCache
  2. during update, de-quantize cache items, return de-quantized cache to the user
  3. as a side-effect of update, quantize user's key_states and value_states and write them to your KV storage

This should give you the expected memory saving since only one attention layer is de-quantized at a time.
As for speed-ups, this is unlikely to work any faster than 16bit attention. If you want speedups, you will need to use custom kernels for attention that accept KV inputs in 4 or 8 bit precision.

Since the neurips deadline is soon, it is unlikely that paper authors will be able to write this anytime soon. However, if you try this and share your observations, we'd be glad to take a look. In turn, if you have any issues with AQLM while doing so, please tell us.

Copy link

github-actions bot commented Jun 9, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jun 9, 2024
@Interpause
Copy link
Author

For anyone curious, huggingface has done it: https://huggingface.co/blog/kv-cache-quantization

@justheuristic
Copy link
Collaborator

TWIMC: @Vahe1994 recently also tested that their standard kv quantization works with aqlm-quantized models

He modified this notebook with the following code:

!pip install quanto
out = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), do_sample=False,min_new_tokens=512, max_new_tokens=512, cache_implementation="quantized", cache_config={"backend": "quanto", "nbits": 4})

This is not using AQLM for KV caches, but using AQLM for weights in combination of default cache quantization from quanto.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants