v1.2.0

wizzense released this 05 Apr 03:43

· 10 commits to main since this release

3010510

What's New

vLLM upstream PR: Native --kv-cache-dtype tq4 submitted to vllm-project/vllm as PR #39008 — 3.76x compression vs FP16 with 4-bit nibble packing
README: Added vLLM upstream integration section

vLLM PR Highlights

The upstream PR adds:

--kv-cache-dtype tq4 CLI option
Triton kernel for 4-bit nibble packing (two int4 per uint8)
Rotation pre-processing for near-optimal MSE
Per-token-head dynamic scaling
22 unit tests

Until the PR merges, use aither-kvcache[vllm] for hook/plugin integration.

Assets 2