Release v2.1.0: Heterogeneous Head Dimension Support · Aitherium/aitherkvcache

What's New

Heterogeneous attention head dimensions — models like Gemma 4 26B-A4B that use different head_dim values across layers (256 for local/sliding-window, 512 for global attention) now work correctly with TQ KV cache compression.

Changes

Per-head-dim quantizer dict: _tq_quantizers[head_dim] replaces the global _tq_quantizer singleton. Each unique head dimension gets its own TurboQuant instance, created and cached on first encounter.
Page-size alignment: New _align_page_size() rounds page sizes to the next power of 2, ensuring vLLM's unify_kv_cache_spec_page_size() succeeds when layers have different page sizes.
512 head_dim support: Added to get_supported_head_sizes() in the custom backend.
All encode/decode/fused functions now derive the quantizer from the actual head_size parameter instead of a global, making them heterogeneous-safe.

Backward Compatible

Uniform models (single head_dim, e.g. DeepSeek, Nemotron) work identically — the dict just has one entry.

Fixed Bug

NotImplementedError: The page size of the layer is not divisible by the maximum page size. Cannot unify by adjusting block_size.

This crash occurred during vLLM startup with kv_cache_dtype=tq-t4nc on Gemma 4, caused by three layers of uniform-head assumptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.0: Heterogeneous Head Dimension Support

Choose a tag to compare

Sorry, something went wrong.