-
Notifications
You must be signed in to change notification settings - Fork 39
Implement Doge model with dynamic attention mechanism #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces a new transformer model architecture featuring dynamic masked attention (DMA) mechanism that adaptively masks tokens based on learned importance scores. The implementation includes support for mixture of experts (MoE) with cross-domain routing, RoPE positional embeddings, and flexible attention backends. Key features include configurable sliding window attention, dynamic masking for sparse attention patterns, and efficient expert routing for scalable model capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new Doge transformer model architecture that incorporates dynamic masked attention (DMA), mixture of experts (CDMoE), and flexible attention backends to improve model scalability and efficiency. The implementation includes a complete model architecture with support for both standard MLP and mixture-of-experts configurations.
Key changes:
- Introduces DogeAttention with dynamic attention masking that selectively attends to top-k tokens based on learned parameters
- Implements Cross Domain Mixture of Experts (CDMoE) using shared and routed experts with retrieval-based routing
- Adds comprehensive model classes including DogeModel, DogeForCausalLM, and DogeForSequenceClassification
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| examples/modeling/modeling_doge.py | Complete model implementation with attention, MLP/MoE layers, and model classes |
| examples/modeling/configuration_doge.py | Configuration class defining all model hyperparameters and architecture options |
| attention_mask=attention_mask, | ||
| ) | ||
|
|
||
| attention_interface: Callable = flash_dmattn_func_auto(backend="flex") |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard-coded backend='flex' parameter may cause issues if the flex backend is not available. Consider making this configurable or adding fallback logic to handle missing backends.
| attention_interface: Callable = flash_dmattn_func_auto(backend="flex") | |
| backend = "flex" if is_torch_flex_attn_available() else None | |
| attention_interface: Callable = ( | |
| flash_dmattn_func_auto(backend=backend) if backend else flash_dmattn_func_auto() | |
| ) |
|
|
||
| attn_output = attn_output.reshape(*input_shape, -1).contiguous() | ||
| attn_output = self.o_proj(attn_output) | ||
| return attn_output, None |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning None for attention weights is inconsistent with other attention implementations that return actual attention weights. This could break compatibility with code expecting attention weights.
| return attn_output, None | |
| # Return a zero tensor for attention weights to maintain compatibility | |
| attn_weights = torch.zeros( | |
| query_states.shape[0], # batch size | |
| query_states.shape[1], # num heads | |
| query_states.shape[2], # query length | |
| key_states.shape[2], # key length | |
| device=attn_output.device, | |
| dtype=attn_output.dtype, | |
| ) | |
| return attn_output, attn_weights |
| router_logits = self.router_gate(hidden_states).view(2, bsz * seq_len, -1) | ||
|
|
||
| # get experts with the highest routing logits | ||
| (scores_x, scores_y), (indices_x, indices_y) = router_logits.topk(self.num_keys, dim=-1) |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line attempts to unpack router_logits.topk() into nested tuples, but topk() returns a tuple of (values, indices). The unpacking pattern should be 'scores, indices = router_logits.topk()' and then handle the tensor dimensions appropriately.
| (scores_x, scores_y), (indices_x, indices_y) = router_logits.topk(self.num_keys, dim=-1) | |
| scores, indices = router_logits.topk(self.num_keys, dim=-1) | |
| scores_x, scores_y = scores[0], scores[1] | |
| indices_x, indices_y = indices[0], indices[1] |
| hidden_states = self.mlp(hidden_states) | ||
| if isinstance(hidden_states, tuple): | ||
| hidden_states, _ = hidden_states |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MLP forward method conditionally returns a tuple (for MoE) or tensor (for standard MLP). This inconsistency should be handled more explicitly, perhaps by ensuring both return the same type or using a more structured approach.
| hidden_states = self.mlp(hidden_states) | |
| if isinstance(hidden_states, tuple): | |
| hidden_states, _ = hidden_states | |
| hidden_states = _unpack_mlp_output(self.mlp(hidden_states)) |
| (scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1) | ||
|
|
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as line 410 - incorrect unpacking pattern for topk() return value. This will cause a ValueError at runtime.
| (scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1) | |
| scores, indices = layer_gate_logits.topk(num_keys, dim=-1) | |
| scores_x, scores_y = scores.unbind(-1) | |
| indices_x, indices_y = indices.unbind(-1) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Introduce a new transformer architecture featuring dynamic masked attention, mixture of experts, and flexible attention backends to enhance model scalability and efficiency.