[BUG] Attention mask in example is incorrect

Currently, in the flexattention variant/modeling_doge, the "KV Score" or 'zoh' is based on the whole sequence instead of causal sequence.
[https://github.com/SmallDoges/flash-dmattn/blob/801e8161e30eae64d49da1fb9d5139f14bb5aa2f/examples/modeling/modeling_doge.py#L326](url)
`        attn_bias = dt_states[:, :, None, :].expand(
            -1, -1, hidden_states.shape[1], -1
        )  # [batch_size, num_heads, query_len, key_len]`
the causal mask is missing here (after this step). Therefore, the topK KV-selection is the same across all queries. This causes reduced attention range for earlier tokens.

I am unfamiliar with cuda, so I don't know if this issue persists in other versions. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Attention mask in example is incorrect #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Attention mask in example is incorrect #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions