Implement Doge model with dynamic attention mechanism #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

LoserCheems merged 2 commits into main from Support-backward

Aug 7, 2025

+1,070 −0

Collaborator

LoserCheems commented Aug 7, 2025

Introduce a new transformer architecture featuring dynamic masked attention, mixture of experts, and flexible attention backends to enhance model scalability and efficiency.


          Adds Doge model implementation with dynamic attention

e8d38d0

Introduces a new transformer model architecture featuring dynamic masked attention (DMA) mechanism that adaptively masks tokens based on learned importance scores. The implementation includes support for mixture of experts (MoE) with cross-domain routing, RoPE positional embeddings, and flexible attention backends.

Key features include configurable sliding window attention, dynamic masking for sparse attention patterns, and efficient expert routing for scalable model capacity.

LoserCheems requested review from Evanwu1125, SNHuan, Copilot and wubingheng111

August 7, 2025 15:47

LoserCheems assigned SNHuan, Evanwu1125, wubingheng111 and LoserCheems

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR introduces a new Doge transformer model architecture that incorporates dynamic masked attention (DMA), mixture of experts (CDMoE), and flexible attention backends to improve model scalability and efficiency. The implementation includes a complete model architecture with support for both standard MLP and mixture-of-experts configurations.

Key changes:

Introduces DogeAttention with dynamic attention masking that selectively attends to top-k tokens based on learned parameters
Implements Cross Domain Mixture of Experts (CDMoE) using shared and routed experts with retrieval-based routing
Adds comprehensive model classes including DogeModel, DogeForCausalLM, and DogeForSequenceClassification

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
examples/modeling/modeling_doge.py	Complete model implementation with attention, MLP/MoE layers, and model classes
examples/modeling/configuration_doge.py	Configuration class defining all model hyperparameters and architecture options

examples/modeling/modeling_doge.py Outdated Show resolved Hide resolved

examples/modeling/modeling_doge.py

+                          attention_mask=attention_mask,
+                      )
+                      attention_interface: Callable = flash_dmattn_func_auto(backend="flex")

Copilot AI Aug 7, 2025

The hard-coded backend='flex' parameter may cause issues if the flex backend is not available. Consider making this configurable or adding fallback logic to handle missing backends.

Suggested change

      
                    attention_interface: Callable = flash_dmattn_func_auto(backend="flex")
          
                    backend = "flex" if is_torch_flex_attn_available() else None
          
                    attention_interface: Callable = (
          
                        flash_dmattn_func_auto(backend=backend) if backend else flash_dmattn_func_auto()
          
                    )

Copilot uses AI. Check for mistakes.

examples/modeling/modeling_doge.py

+                      attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+                      attn_output = self.o_proj(attn_output)
+                      return attn_output, None

Copilot AI Aug 7, 2025

Returning None for attention weights is inconsistent with other attention implementations that return actual attention weights. This could break compatibility with code expecting attention weights.

Suggested change

      
                    return attn_output, None
          
                    # Return a zero tensor for attention weights to maintain compatibility
          
                    attn_weights = torch.zeros(
          
                        query_states.shape[0],  # batch size
          
                        query_states.shape[1],  # num heads
          
                        query_states.shape[2],  # query length
          
                        key_states.shape[2],    # key length
          
                        device=attn_output.device,
          
                        dtype=attn_output.dtype,
          
                    )
          
                    return attn_output, attn_weights

Copilot uses AI. Check for mistakes.

examples/modeling/modeling_doge.py

+                      router_logits = self.router_gate(hidden_states).view(2, bsz * seq_len, -1)
+                      # get experts with the highest routing logits
+                      (scores_x, scores_y), (indices_x, indices_y) = router_logits.topk(self.num_keys, dim=-1)

Copilot AI Aug 7, 2025

This line attempts to unpack router_logits.topk() into nested tuples, but topk() returns a tuple of (values, indices). The unpacking pattern should be 'scores, indices = router_logits.topk()' and then handle the tensor dimensions appropriately.

Suggested change

      
                    (scores_x, scores_y), (indices_x, indices_y) = router_logits.topk(self.num_keys, dim=-1)
          
                    scores, indices = router_logits.topk(self.num_keys, dim=-1)
          
                    scores_x, scores_y = scores[0], scores[1]
          
                    indices_x, indices_y = indices[0], indices[1]

Copilot uses AI. Check for mistakes.

examples/modeling/modeling_doge.py

Comment on lines +476 to +478

+                      hidden_states = self.mlp(hidden_states)
+                      if isinstance(hidden_states, tuple):
+                          hidden_states, _ = hidden_states

Copilot AI Aug 7, 2025

The MLP forward method conditionally returns a tuple (for MoE) or tensor (for standard MLP). This inconsistency should be handled more explicitly, perhaps by ensuring both return the same type or using a more structured approach.

Suggested change

      
                    hidden_states = self.mlp(hidden_states)
          
                    if isinstance(hidden_states, tuple):
          
                        hidden_states, _ = hidden_states
          
                    hidden_states = _unpack_mlp_output(self.mlp(hidden_states))

Copilot uses AI. Check for mistakes.

examples/modeling/modeling_doge.py

Comment on lines +645 to +646

		(scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1)

Copilot AI Aug 7, 2025

Same issue as line 410 - incorrect unpacking pattern for topk() return value. This will cause a ValueError at runtime.

Suggested change

      
                    (scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1)
          
                    scores, indices = layer_gate_logits.topk(num_keys, dim=-1)
          
                    scores_x, scores_y = scores.unbind(-1)
          
                    indices_x, indices_y = indices.unbind(-1)

Copilot uses AI. Check for mistakes.


          Update examples/modeling/modeling_doge.py

a90271b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

LoserCheems merged commit 0773bf5 into main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet