KV Cache Quantization #85

Interpause · 2024-05-09T06:31:24Z

Hey, thanks for your work. I saw https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16/discussions/2 about how 8-bit KV cache quantization can be enabled on vLLM. I am not too sure of how exactly KV cache is implemented for AQLM using Transformers, but would KV cache quantization be theoretically possible? It might address some of the concerns regarding high vram usage for context from https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/bringing_2bit_llms_to_production_new_aqlm_models/.

To be specific, would 4-bit cache quantization be possible? Turboderp managed to achieve negligable ppl loss somehow: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md. For reference, turboderp/exllamav2@324404e.

Thanks!

The text was updated successfully, but these errors were encountered:

justheuristic · 2024-05-09T11:13:30Z

Hi! I am not an author, but a contributor, and I have some familiarity with the issue.

As you correctly describe, the AQLM does not do cache quantization, relying on standard transformers code.
If you plug in any data-free cache quantization, e.g. 8-bit KV compression, it is likely that the impact will be the same as when using KV quantization in any other model.
As for 4-bit variant - I did not try that specific one, but according to your description and code, it should be easy to implement with AQLM.

One way you can do this is by extending transformers Cache class roughly as follows:

at __init__, create a storage of quantized KVs, similarly to StaticCache
during update, de-quantize cache items, return de-quantized cache to the user
as a side-effect of update, quantize user's key_states and value_states and write them to your KV storage

This should give you the expected memory saving since only one attention layer is de-quantized at a time.
As for speed-ups, this is unlikely to work any faster than 16bit attention. If you want speedups, you will need to use custom kernels for attention that accept KV inputs in 4 or 8 bit precision.

Since the neurips deadline is soon, it is unlikely that paper authors will be able to write this anytime soon. However, if you try this and share your observations, we'd be glad to take a look. In turn, if you have any issues with AQLM while doing so, please tell us.

github-actions · 2024-06-09T01:52:03Z

This issue is stale because it has been open for 30 days with no activity.

Interpause · 2024-06-15T12:06:48Z

For anyone curious, huggingface has done it: https://huggingface.co/blog/kv-cache-quantization

justheuristic · 2024-06-27T13:26:44Z

TWIMC: @Vahe1994 recently also tested that their standard kv quantization works with aqlm-quantized models

He modified this notebook with the following code:

!pip install quanto
out = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), do_sample=False,min_new_tokens=512, max_new_tokens=512, cache_implementation="quantized", cache_config={"backend": "quanto", "nbits": 4})

This is not using AQLM for KV caches, but using AQLM for weights in combination of default cache quantization from quanto.

github-actions bot added the stale label Jun 9, 2024

Interpause closed this as completed Jun 15, 2024

Interpause mentioned this issue Jun 28, 2024

Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader oobabooga/text-generation-webui#6126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV Cache Quantization #85

KV Cache Quantization #85

Interpause commented May 9, 2024

justheuristic commented May 9, 2024

github-actions bot commented Jun 9, 2024

Interpause commented Jun 15, 2024

justheuristic commented Jun 27, 2024

KV Cache Quantization #85

KV Cache Quantization #85

Comments

Interpause commented May 9, 2024

justheuristic commented May 9, 2024

github-actions bot commented Jun 9, 2024

Interpause commented Jun 15, 2024

justheuristic commented Jun 27, 2024