What's New
Heterogeneous attention head dimensions — models like Gemma 4 26B-A4B that use different head_dim values across layers (256 for local/sliding-window, 512 for global attention) now work correctly with TQ KV cache compression.
Changes
- Per-head-dim quantizer dict:
_tq_quantizers[head_dim]replaces the global_tq_quantizersingleton. Each unique head dimension gets its ownTurboQuantinstance, created and cached on first encounter. - Page-size alignment: New
_align_page_size()rounds page sizes to the next power of 2, ensuring vLLM'sunify_kv_cache_spec_page_size()succeeds when layers have different page sizes. - 512 head_dim support: Added to
get_supported_head_sizes()in the custom backend. - All encode/decode/fused functions now derive the quantizer from the actual
head_sizeparameter instead of a global, making them heterogeneous-safe.
Backward Compatible
Uniform models (single head_dim, e.g. DeepSeek, Nemotron) work identically — the dict just has one entry.
Fixed Bug
NotImplementedError: The page size of the layer is not divisible by the maximum page size. Cannot unify by adjusting block_size.
This crash occurred during vLLM startup with kv_cache_dtype=tq-t4nc on Gemma 4, caused by three layers of uniform-head assumptions.