fix: gemma-4 with exllamav3#1672
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the EXL3 quantization logic to support models with K=V attention, such as Gemma 4, where the v_proj tensor may be absent and duplicated from k_proj. Feedback was provided to optimize the implementation by using short-circuiting and reducing redundant lookups to improve model initialization performance.
| has_q = self._is_exl3_prefix(f"{base}.q_proj") | ||
| has_k = self._is_exl3_prefix(f"{base}.k_proj") | ||
| has_v = self._is_exl3_prefix(f"{base}.v_proj") | ||
| # Gemma 4 full-attention layers can use K=V attention and store | ||
| # only q_proj/k_proj tensors. The model loader duplicates K into | ||
| # V, so the fused qkv_proj still needs EXL3 parameters. | ||
| return has_q and has_k and (has_v or self._storage_entry(f"{base}.v_proj") is None) |
There was a problem hiding this comment.
The current implementation performs redundant lookups and lacks short-circuiting, which can negatively impact model initialization performance. Specifically, _is_exl3_prefix is called for all three projections regardless of the results, and _storage_entry is called twice for v_proj when it is missing from the storage. Since _storage_entry involves multiple string operations and dictionary lookups, it is more efficient to use short-circuiting and store the entry for v_proj to avoid redundant work.
| has_q = self._is_exl3_prefix(f"{base}.q_proj") | |
| has_k = self._is_exl3_prefix(f"{base}.k_proj") | |
| has_v = self._is_exl3_prefix(f"{base}.v_proj") | |
| # Gemma 4 full-attention layers can use K=V attention and store | |
| # only q_proj/k_proj tensors. The model loader duplicates K into | |
| # V, so the fused qkv_proj still needs EXL3 parameters. | |
| return has_q and has_k and (has_v or self._storage_entry(f"{base}.v_proj") is None) | |
| if not (self._is_exl3_prefix(f"{base}.q_proj") and | |
| self._is_exl3_prefix(f"{base}.k_proj")): | |
| return False | |
| v_entry = self._storage_entry(f"{base}.v_proj") | |
| # Gemma 4 full-attention layers can use K=V attention and store | |
| # only q_proj/k_proj tensors. The model loader duplicates K into | |
| # V, so the fused qkv_proj still needs EXL3 parameters. | |
| return v_entry is None or v_entry.get("quant_format") == "exl3" |
No description provided.