Release v0.4.0 — Phase 2: TritonAttentionImpl subclass + TQGPUCache · Aitherium/aitherkvcache

What's New

Phase 2 vLLM backend — complete rewrite of the attention implementation.

Backend (`turboquant.vllm.backend`)

TurboQuantImpl now subclasses TritonAttentionImpl — standard attention path is identical to TRITON_ATTN (zero risk)
TQ encode runs after each forward pass, never blocks attention
Lazy class creation via module __getattr__ — zero circular imports during sitecustomize
accept_output_buffer = True for proper vLLM output pre-allocation
supported_kv_cache_dtypes includes fp8_e4m3/fp8_e5m2
get_kv_cache_shape() implemented for V1 engine

GPU Cache (`turboquant.vllm.cache`)

TQGPUCache: GPU-resident TQ-compressed KV storage with DDR5 cold tier
spill_blocks() / warm_blocks() for VRAM ↔ DDR5 transfer
WSL2-safe pin_memory detection (avoids segfault in Docker on WSL2)

Activation

vllm serve model --attention-backend CUSTOM --enforce-eager

Env vars: AITHER_TQ_BITS=4, AITHER_TQ_FUSED=1

Note

--enforce-eager is required — vLLM's torch.compile pipeline doesn't support @torch.compiler.disable graph breaks in custom attention impls yet. The Triton attention kernels themselves are still fully compiled as custom ops.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0 — Phase 2: TritonAttentionImpl subclass + TQGPUCache

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Backend (`turboquant.vllm.backend`)

GPU Cache (`turboquant.vllm.cache`)

Activation

Note

Uh oh!

v0.4.0 — Phase 2: TritonAttentionImpl subclass + TQGPUCache

What's New

Backend (turboquant.vllm.backend)

GPU Cache (turboquant.vllm.cache)

Activation

Note

Uh oh!

Backend (`turboquant.vllm.backend`)

GPU Cache (`turboquant.vllm.cache`)