How to use low-bit KV Cache in flashinfer? #125

zhaoyang-star · 2024-02-18T01:10:21Z

From the blog I noticed that FlashInfer implements low-precision attention kernels so that we can achieve nearly linear speedup to the compression ratio (~4x for 4bit, ~2x for 8bit). This feature is great! and I try to use it. But there is no demo or toy code about how to use it. Could you please share more details about it?

yzh119 · 2024-02-18T03:58:20Z

I haven't exposed low-bit KV-Cache in PyTorch APIs (they are available in C++ APIs), will do it tmr :)

zhaoyang-star · 2024-02-18T12:20:11Z

Glad to hear that! Cannot wait to try it out. I think quantizing KV Cache from float16/bfloat16 to 4-bits will need calibration. It will be better if the feature released with demo and benchmark results (latency, throughput or accuracy).

BTW, there is already someone trying to port flashinfer to vLLM (see #2772) to boost decode phase. I also ported FlashAttention to vLLM (see #2744) and plan to benchmark FA and flashinfer in vLLM framwork.

yzh119 · 2024-02-18T21:35:44Z

Thanks for letting me know, it's interesting to see that FlashAttention starts supporting paged kv-cache.

yzh119 · 2024-02-18T21:37:01Z

It will be better if the feature released with demo and benchmark results (latency, throughput or accuracy).

You can check our manuscript: Atom: Low-bit Quantization for Efficient and Accurate LLM Serving.

requested in #150 #155 #125

yzh119 · 2024-03-05T15:07:29Z

PyTorch APIs for fp8 kv-cache are exposed in #156 .

I'm finalizing the int4/int8 fused-dequant attention kernels with some optimizations such as fast int4/int8-to-float16 conversions. I expect to merge these changes by this Thursday.

zhyncs · 2024-03-28T12:03:24Z

PyTorch APIs for fp8 kv-cache are exposed in #156 .

I'm finalizing the int4/int8 fused-dequant attention kernels with some optimizations such as fast int4/int8-to-float16 conversions. I expect to merge these changes by this Thursday.

Hi @yzh119 As mentioned in https://flashinfer.ai/2024/02/02/introduce-flashinfer.html.

Our next release will include the 4-bit fused dequantize+attention operators proposed in Atom and LoRA operators used in Punica.

When is Atom quantization expected to be fully integrated into FlashInfer? Is there a detailed timeline available? Thanks.

yzh119 added the enhancement New feature or request label Feb 18, 2024

yzh119 mentioned this issue Feb 27, 2024

[Roadmap] 0.0.3 Release Checklist #138

Closed

3 tasks

yzh119 mentioned this issue Mar 5, 2024

feat: pytorch api of fp8 kv-cache #156

Merged

yzh119 added a commit that referenced this issue Mar 5, 2024

feat: pytorch api of fp8 kv-cache (#156)

66ee066

requested in #150 #155 #125

yzh119 mentioned this issue Mar 5, 2024

quant support #150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use low-bit KV Cache in flashinfer? #125

How to use low-bit KV Cache in flashinfer? #125

zhaoyang-star commented Feb 18, 2024

yzh119 commented Feb 18, 2024 •

edited

zhaoyang-star commented Feb 18, 2024 •

edited

yzh119 commented Feb 18, 2024

yzh119 commented Feb 18, 2024

yzh119 commented Mar 5, 2024

zhyncs commented Mar 28, 2024

How to use low-bit KV Cache in flashinfer? #125

How to use low-bit KV Cache in flashinfer? #125

Comments

zhaoyang-star commented Feb 18, 2024

yzh119 commented Feb 18, 2024 • edited

zhaoyang-star commented Feb 18, 2024 • edited

yzh119 commented Feb 18, 2024

yzh119 commented Feb 18, 2024

yzh119 commented Mar 5, 2024

zhyncs commented Mar 28, 2024

yzh119 commented Feb 18, 2024 •

edited

zhaoyang-star commented Feb 18, 2024 •

edited