v1.2.1
Bug Fixes
- Custom op aliasing error:
tq::decode_stepandtq::hybrid_decode_stepnow returnoutput.clone()to satisfy torch's custom op aliasing requirements on vLLM 0.15.1+ - Causal mask for chunked prefill: Builds explicit causal mask when
q_len < kv_leninstead of usingis_causal=Falsewhich allowed future token leakage - Prefill path: Uses original vLLM forward for prefill attention (handles chunked prefill correctly), TQ encode runs alongside
Shadow Mode (verified working)
AITHER_TQ_MODE=shadow now produces correct output on stock vLLM 0.15.1:
pip install aither-kvcache[vllm]
export AITHER_TQ_MODE=shadow AITHER_TQ_BITS=4
# Use sitecustomize hook for auto-patching
vllm serve your-model --attention-backend TRITON_ATTNTested on Qwen2.5-1.5B-Instruct — output matches baseline.
Primary Mode (experimental)
AITHER_TQ_MODE=tq4-primary (3.8x compression) requires a custom attention backend to handle the uint8 packed cache during chunked prefill. This is what vLLM PR #38479 provides natively. Primary mode via hooks remains experimental until that PR merges or we build a standalone custom backend.