Skip to content

v2.1.0: Heterogeneous Head Dimension Support

Latest

Choose a tag to compare

@wizzense wizzense released this 15 Apr 00:35
· 2 commits to main since this release

What's New

Heterogeneous attention head dimensions — models like Gemma 4 26B-A4B that use different head_dim values across layers (256 for local/sliding-window, 512 for global attention) now work correctly with TQ KV cache compression.

Changes

  • Per-head-dim quantizer dict: _tq_quantizers[head_dim] replaces the global _tq_quantizer singleton. Each unique head dimension gets its own TurboQuant instance, created and cached on first encounter.
  • Page-size alignment: New _align_page_size() rounds page sizes to the next power of 2, ensuring vLLM's unify_kv_cache_spec_page_size() succeeds when layers have different page sizes.
  • 512 head_dim support: Added to get_supported_head_sizes() in the custom backend.
  • All encode/decode/fused functions now derive the quantizer from the actual head_size parameter instead of a global, making them heterogeneous-safe.

Backward Compatible

Uniform models (single head_dim, e.g. DeepSeek, Nemotron) work identically — the dict just has one entry.

Fixed Bug

NotImplementedError: The page size of the layer is not divisible by the maximum page size. Cannot unify by adjusting block_size.

This crash occurred during vLLM startup with kv_cache_dtype=tq-t4nc on Gemma 4, caused by three layers of uniform-head assumptions.