fix: gemma-4 with exllamav3 by AlpinDale · Pull Request #1672 · dphnAI/aphrodite-engine

AlpinDale · 2026-05-07T02:11:50Z

No description provided.

This reverts commit f327076.

gemini-code-assist

Code Review

This pull request updates the EXL3 quantization logic to support models with K=V attention, such as Gemma 4, where the v_proj tensor may be absent and duplicated from k_proj. Feedback was provided to optimize the implementation by using short-circuiting and reducing redundant lookups to improve model initialization performance.

gemini-code-assist · 2026-05-07T02:13:09Z

+            has_q = self._is_exl3_prefix(f"{base}.q_proj")
+            has_k = self._is_exl3_prefix(f"{base}.k_proj")
+            has_v = self._is_exl3_prefix(f"{base}.v_proj")
+            # Gemma 4 full-attention layers can use K=V attention and store
+            # only q_proj/k_proj tensors. The model loader duplicates K into
+            # V, so the fused qkv_proj still needs EXL3 parameters.
+            return has_q and has_k and (has_v or self._storage_entry(f"{base}.v_proj") is None)


The current implementation performs redundant lookups and lacks short-circuiting, which can negatively impact model initialization performance. Specifically, _is_exl3_prefix is called for all three projections regardless of the results, and _storage_entry is called twice for v_proj when it is missing from the storage. Since _storage_entry involves multiple string operations and dictionary lookups, it is more efficient to use short-circuiting and store the entry for v_proj to avoid redundant work.

Suggested change

has_q = self._is_exl3_prefix(f"{base}.q_proj")

has_k = self._is_exl3_prefix(f"{base}.k_proj")

has_v = self._is_exl3_prefix(f"{base}.v_proj")

# Gemma 4 full-attention layers can use K=V attention and store

# only q_proj/k_proj tensors. The model loader duplicates K into

# V, so the fused qkv_proj still needs EXL3 parameters.

return has_q and has_k and (has_v or self._storage_entry(f"{base}.v_proj") is None)

if not (self._is_exl3_prefix(f"{base}.q_proj") and

self._is_exl3_prefix(f"{base}.k_proj")):

return False

v_entry = self._storage_entry(f"{base}.v_proj")

# Gemma 4 full-attention layers can use K=V attention and store

# only q_proj/k_proj tensors. The model loader duplicates K into

# V, so the fused qkv_proj still needs EXL3 parameters.

return v_entry is None or v_entry.get("quant_format") == "exl3"

AlpinDale added 3 commits May 7, 2026 06:19

fix: gemma-4 with exllamav3

470aca2

temporarily remove the ascii art logo

f327076

Revert "temporarily remove the ascii art logo"

21f5d66

This reverts commit f327076.

AlpinDale merged commit c16f370 into main May 7, 2026
1 check failed

AlpinDale deleted the fix/gemma4-exl3 branch May 7, 2026 02:12

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: gemma-4 with exllamav3#1672

fix: gemma-4 with exllamav3#1672
AlpinDale merged 3 commits into
mainfrom
fix/gemma4-exl3

AlpinDale commented May 7, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AlpinDale commented May 7, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant