Skip to content

v1.2.0

Choose a tag to compare

@wizzense wizzense released this 05 Apr 03:43
· 10 commits to main since this release

What's New

  • vLLM upstream PR: Native --kv-cache-dtype tq4 submitted to vllm-project/vllm as PR #39008 — 3.76x compression vs FP16 with 4-bit nibble packing
  • README: Added vLLM upstream integration section

vLLM PR Highlights

The upstream PR adds:

  • --kv-cache-dtype tq4 CLI option
  • Triton kernel for 4-bit nibble packing (two int4 per uint8)
  • Rotation pre-processing for near-optimal MSE
  • Per-token-head dynamic scaling
  • 22 unit tests

Until the PR merges, use aither-kvcache[vllm] for hook/plugin integration.