v0.4.0 — Phase 2: TritonAttentionImpl subclass + TQGPUCache
What's New
Phase 2 vLLM backend — complete rewrite of the attention implementation.
Backend (turboquant.vllm.backend)
TurboQuantImplnow subclassesTritonAttentionImpl— standard attention path is identical toTRITON_ATTN(zero risk)- TQ encode runs after each forward pass, never blocks attention
- Lazy class creation via module
__getattr__— zero circular imports during sitecustomize accept_output_buffer = Truefor proper vLLM output pre-allocationsupported_kv_cache_dtypesincludesfp8_e4m3/fp8_e5m2get_kv_cache_shape()implemented for V1 engine
GPU Cache (turboquant.vllm.cache)
TQGPUCache: GPU-resident TQ-compressed KV storage with DDR5 cold tierspill_blocks()/warm_blocks()for VRAM ↔ DDR5 transfer- WSL2-safe
pin_memorydetection (avoids segfault in Docker on WSL2)
Activation
vllm serve model --attention-backend CUSTOM --enforce-eagerEnv vars: AITHER_TQ_BITS=4, AITHER_TQ_FUSED=1
Note
--enforce-eager is required — vLLM's torch.compile pipeline doesn't support @torch.compiler.disable graph breaks in custom attention impls yet. The Triton attention kernels themselves are still fully compiled as custom ops.