Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Introduce a new transformer architecture featuring dynamic masked attention, mixture of experts, and flexible attention backends to enhance model scalability and efficiency.

Introduces a new transformer model architecture featuring dynamic masked attention (DMA) mechanism that adaptively masks tokens based on learned importance scores. The implementation includes support for mixture of experts (MoE) with cross-domain routing, RoPE positional embeddings, and flexible attention backends.

Key features include configurable sliding window attention, dynamic masking for sparse attention patterns, and efficient expert routing for scalable model capacity.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new Doge transformer model architecture that incorporates dynamic masked attention (DMA), mixture of experts (CDMoE), and flexible attention backends to improve model scalability and efficiency. The implementation includes a complete model architecture with support for both standard MLP and mixture-of-experts configurations.

Key changes:

  • Introduces DogeAttention with dynamic attention masking that selectively attends to top-k tokens based on learned parameters
  • Implements Cross Domain Mixture of Experts (CDMoE) using shared and routed experts with retrieval-based routing
  • Adds comprehensive model classes including DogeModel, DogeForCausalLM, and DogeForSequenceClassification

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
examples/modeling/modeling_doge.py Complete model implementation with attention, MLP/MoE layers, and model classes
examples/modeling/configuration_doge.py Configuration class defining all model hyperparameters and architecture options

attention_mask=attention_mask,
)

attention_interface: Callable = flash_dmattn_func_auto(backend="flex")
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded backend='flex' parameter may cause issues if the flex backend is not available. Consider making this configurable or adding fallback logic to handle missing backends.

Suggested change
attention_interface: Callable = flash_dmattn_func_auto(backend="flex")
backend = "flex" if is_torch_flex_attn_available() else None
attention_interface: Callable = (
flash_dmattn_func_auto(backend=backend) if backend else flash_dmattn_func_auto()
)

Copilot uses AI. Check for mistakes.

attn_output = attn_output.reshape(*input_shape, -1).contiguous()
attn_output = self.o_proj(attn_output)
return attn_output, None
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None for attention weights is inconsistent with other attention implementations that return actual attention weights. This could break compatibility with code expecting attention weights.

Suggested change
return attn_output, None
# Return a zero tensor for attention weights to maintain compatibility
attn_weights = torch.zeros(
query_states.shape[0], # batch size
query_states.shape[1], # num heads
query_states.shape[2], # query length
key_states.shape[2], # key length
device=attn_output.device,
dtype=attn_output.dtype,
)
return attn_output, attn_weights

Copilot uses AI. Check for mistakes.
router_logits = self.router_gate(hidden_states).view(2, bsz * seq_len, -1)

# get experts with the highest routing logits
(scores_x, scores_y), (indices_x, indices_y) = router_logits.topk(self.num_keys, dim=-1)
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line attempts to unpack router_logits.topk() into nested tuples, but topk() returns a tuple of (values, indices). The unpacking pattern should be 'scores, indices = router_logits.topk()' and then handle the tensor dimensions appropriately.

Suggested change
(scores_x, scores_y), (indices_x, indices_y) = router_logits.topk(self.num_keys, dim=-1)
scores, indices = router_logits.topk(self.num_keys, dim=-1)
scores_x, scores_y = scores[0], scores[1]
indices_x, indices_y = indices[0], indices[1]

Copilot uses AI. Check for mistakes.
Comment on lines +476 to +478
hidden_states = self.mlp(hidden_states)
if isinstance(hidden_states, tuple):
hidden_states, _ = hidden_states
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MLP forward method conditionally returns a tuple (for MoE) or tensor (for standard MLP). This inconsistency should be handled more explicitly, perhaps by ensuring both return the same type or using a more structured approach.

Suggested change
hidden_states = self.mlp(hidden_states)
if isinstance(hidden_states, tuple):
hidden_states, _ = hidden_states
hidden_states = _unpack_mlp_output(self.mlp(hidden_states))

Copilot uses AI. Check for mistakes.
Comment on lines +645 to +646
(scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1)

Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as line 410 - incorrect unpacking pattern for topk() return value. This will cause a ValueError at runtime.

Suggested change
(scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1)
scores, indices = layer_gate_logits.topk(num_keys, dim=-1)
scores_x, scores_y = scores.unbind(-1)
indices_x, indices_y = indices.unbind(-1)

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@LoserCheems LoserCheems merged commit 0773bf5 into main Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants