Skip to content

v0.4.0 — Phase 2: TritonAttentionImpl subclass + TQGPUCache

Choose a tag to compare

@wizzense wizzense released this 28 Mar 07:13
· 29 commits to main since this release

What's New

Phase 2 vLLM backend — complete rewrite of the attention implementation.

Backend (turboquant.vllm.backend)

  • TurboQuantImpl now subclasses TritonAttentionImpl — standard attention path is identical to TRITON_ATTN (zero risk)
  • TQ encode runs after each forward pass, never blocks attention
  • Lazy class creation via module __getattr__ — zero circular imports during sitecustomize
  • accept_output_buffer = True for proper vLLM output pre-allocation
  • supported_kv_cache_dtypes includes fp8_e4m3/fp8_e5m2
  • get_kv_cache_shape() implemented for V1 engine

GPU Cache (turboquant.vllm.cache)

  • TQGPUCache: GPU-resident TQ-compressed KV storage with DDR5 cold tier
  • spill_blocks() / warm_blocks() for VRAM ↔ DDR5 transfer
  • WSL2-safe pin_memory detection (avoids segfault in Docker on WSL2)

Activation

vllm serve model --attention-backend CUSTOM --enforce-eager

Env vars: AITHER_TQ_BITS=4, AITHER_TQ_FUSED=1

Note

--enforce-eager is required — vLLM's torch.compile pipeline doesn't support @torch.compiler.disable graph breaks in custom attention impls yet. The Triton attention kernels themselves are still fully compiled as custom ops.