-
Notifications
You must be signed in to change notification settings - Fork 45
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Currently, in the flexattention variant/modeling_doge, the "KV Score" or 'zoh' is based on the whole sequence instead of causal sequence.
https://github.com/SmallDoges/flash-dmattn/blob/801e8161e30eae64d49da1fb9d5139f14bb5aa2f/examples/modeling/modeling_doge.py#L326
attn_bias = dt_states[:, :, None, :].expand( -1, -1, hidden_states.shape[1], -1 ) # [batch_size, num_heads, query_len, key_len]
the causal mask is missing here (after this step). Therefore, the topK KV-selection is the same across all queries. This causes reduced attention range for earlier tokens.
I am unfamiliar with cuda, so I don't know if this issue persists in other versions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working