Release v1.2.1 · Aitherium/aitherkvcache

Bug Fixes

Custom op aliasing error: tq::decode_step and tq::hybrid_decode_step now return output.clone() to satisfy torch's custom op aliasing requirements on vLLM 0.15.1+
Causal mask for chunked prefill: Builds explicit causal mask when q_len < kv_len instead of using is_causal=False which allowed future token leakage
Prefill path: Uses original vLLM forward for prefill attention (handles chunked prefill correctly), TQ encode runs alongside

Shadow Mode (verified working)

AITHER_TQ_MODE=shadow now produces correct output on stock vLLM 0.15.1:

pip install aither-kvcache[vllm]
export AITHER_TQ_MODE=shadow AITHER_TQ_BITS=4
# Use sitecustomize hook for auto-patching
vllm serve your-model --attention-backend TRITON_ATTN

Tested on Qwen2.5-1.5B-Instruct — output matches baseline.

Primary Mode (experimental)

AITHER_TQ_MODE=tq4-primary (3.8x compression) requires a custom attention backend to handle the uint8 packed cache during chunked prefill. This is what vLLM PR #38479 provides natively. Primary mode via hooks remains experimental until that PR merges or we build a standalone custom backend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Bug Fixes

Shadow Mode (verified working)

Primary Mode (experimental)

Uh oh!