Skip to content

v1.2.1

Choose a tag to compare

@wizzense wizzense released this 05 Apr 13:12
· 9 commits to main since this release

Bug Fixes

  • Custom op aliasing error: tq::decode_step and tq::hybrid_decode_step now return output.clone() to satisfy torch's custom op aliasing requirements on vLLM 0.15.1+
  • Causal mask for chunked prefill: Builds explicit causal mask when q_len < kv_len instead of using is_causal=False which allowed future token leakage
  • Prefill path: Uses original vLLM forward for prefill attention (handles chunked prefill correctly), TQ encode runs alongside

Shadow Mode (verified working)

AITHER_TQ_MODE=shadow now produces correct output on stock vLLM 0.15.1:

pip install aither-kvcache[vllm]
export AITHER_TQ_MODE=shadow AITHER_TQ_BITS=4
# Use sitecustomize hook for auto-patching
vllm serve your-model --attention-backend TRITON_ATTN

Tested on Qwen2.5-1.5B-Instruct — output matches baseline.

Primary Mode (experimental)

AITHER_TQ_MODE=tq4-primary (3.8x compression) requires a custom attention backend to handle the uint8 packed cache during chunked prefill. This is what vLLM PR #38479 provides natively. Primary mode via hooks remains experimental until that PR merges or we build a standalone custom backend.