v1.2.0
What's New
- vLLM upstream PR: Native
--kv-cache-dtype tq4submitted to vllm-project/vllm as PR #39008 — 3.76x compression vs FP16 with 4-bit nibble packing - README: Added vLLM upstream integration section
vLLM PR Highlights
The upstream PR adds:
--kv-cache-dtype tq4CLI option- Triton kernel for 4-bit nibble packing (two int4 per uint8)
- Rotation pre-processing for near-optimal MSE
- Per-token-head dynamic scaling
- 22 unit tests
Until the PR merges, use aither-kvcache[vllm] for hook/plugin integration.