Support gradients to attention bias #636

EBGU · 2023-01-12T10:27:45Z

🚀 Feature

When I use memory_efficient_attention with a bias of Torch.Tensor, I got an error No operator found for this attention. Then I found a remark in xformers.ops.fmha , class _fMHA as "Only gradients to Q/K/V is implemented. For instance, it's not possible to backpropagate through the attention mask". I thought it means I could not get gradients for attention bias. I wonder if you could add a feature to support gradient for attention bias.

Motivation

In models like AlphaFold2, we used biased attention a lot, and the pairwise representation should require gradients.

Pitch

Backpropagate on attention bias for function memory_efficient_attention.

Additional context

I didn't see any args or kwargs for key_padding_mask. I wonder if it is proper to masked_fill the attention bias with float('-inf') to achieve key_padding_mask?

danthe3rd · 2023-01-12T10:49:02Z

Hi @EBGU
Thanks for your report.
There is a PR right now adding support for exactly that: #587
We plan to merge it after we release 0.0.16 - hopefully next week

EBGU closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gradients to attention bias #636

Support gradients to attention bias #636

EBGU commented Jan 12, 2023

danthe3rd commented Jan 12, 2023

Support gradients to attention bias #636

Support gradients to attention bias #636

Comments

EBGU commented Jan 12, 2023

🚀 Feature

Motivation

Pitch

Additional context

danthe3rd commented Jan 12, 2023