-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RTX 3090] Raise NotImplementedError: No operator found for this attention: Inputs when I backward the loss #628
Comments
Oh I see, this is related to this: #517 |
I try to use fp16 to run the demo, it return
I use half() to enable fp16 in demo.py
with conda list |
It looks like you are doing the right things. |
Thanks |
@danthe3rd I have one reproducable situation where it works, and one where it doesn't. How can I help drill down to solve this issue? Works: def AttentionBase(features: int, head_features: int, num_heads: int) -> nn.Module:
mid_features = head_features * num_heads
to_out = nn.Linear(in_features=mid_features, out_features=features, bias=False)
def forward(
q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None
) -> Tensor:
# Use memory efficient attention
out = xformers.ops.memory_efficient_attention(q, k, v)
return to_out(out)
return Module([to_out], forward) Doesn't work: def OldLinearAttentionBase(features: int, head_features: int, num_heads: int) -> nn.Module:
scale = head_features**-0.5
num_heads = num_heads
mid_features = head_features * num_heads
to_out = nn.Linear(in_features=mid_features, out_features=features, bias=False)
# supposed to be functionally equivalent to memory_efficient_attention
# source: https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.memory_efficient_attention:~:text=to%20be%201-,Equivalent%20pytorch%20code,-scale%20%3D%201
def atten(query, key, value):
scale = 1 / query.shape[-1] ** 0.5
query = query * scale
attn = query @ key.transpose(-2, -1)
attn = attn.softmax(-1)
return attn @ value
def forward(q: Tensor, k: Tensor, v: Tensor) -> Tensor:
q, k, v = map(lambda t: rearrange(t, "b t c -> b c t").contiguous(), (q, k, v))
# Attending over channel dim
attn = xformers.ops.memory_efficient_attention(q, k, v) # crashes during backward pass
#attn = atten(q, k, v) # works fine
attn = rearrange(attn, "b c t -> b t c")
return to_out(attn)
return Module([to_out], forward) |
Can you also provide the inputs that lead to the NANs? |
I'm not sure the best way to share large raw matrices over the internet. Is there a standard way to do so? In the meantime, I inserted a print statement here (cutlass.py line 183): @classmethod
def apply(cls, ctx: Context, inp: Inputs, grad: torch.Tensor) -> Gradients:
if inp.attn_bias is not None and not isinstance(
inp.attn_bias, LowerTriangularMask
):
raise NotImplementedError("Unsupported attn_bias type")
causal = isinstance(inp.attn_bias, LowerTriangularMask)
dtype = inp.query.dtype
print("grad: ", grad.shape)
force_pad_inf = torch.cuda.get_device_capability(inp.query.device) == (7, 5)
(grad_q, grad_k, grad_v,) = cls.OPERATOR(
grad.to(dtype),
inp.query,
inp.key,
inp.value,
ctx.get_padded_lse(32, force_pad_inf=force_pad_inf),
ctx.out.to(dtype),
causal=causal,
scale=inp.scale,
)
return Gradients(dq=grad_q, dk=grad_k, dv=grad_v) And also enabled CUDA_LAUNCH_BLOCKING, and this was the result: https://pastebin.com/dXjkDgXe |
Wait you have |
It is possible I messed something up in the channel-wise Linear Attention function above. The idea is to apply attention channel-wise instead of time-wise. I added more print statements for the other args going into
How do I make it gracefully fall back to a compatible implementation for the only backward pass when the shared mem can't handle it? It works fine both ways for most of the attention units I have and at least for the forward pass on others. EDIT: I tried forcing all the other implementations (low K, flash) and it didn't work with those, so I'm now leaning towards this being an issue with my model. I agree a more descriptive error message with possible solutions could be helpful for others in the future. |
A few things: |
Related issue: #517 |
It would be great if Xformers could detect cases that are only supported by vanilla PyTorch impl and fall back to that so we can keep the speed/memory benefits for the vast majority of attention calls that are within those bounds (but also gain the speedups when we're allocated A100s). |
yes might not be for a little while unfortunately, need some reworking to have a unique code path for half precision, k <= 128, and SM80. in general, these changes apply to you if you have head_dim > 128. |
馃悰 Bug
Command
To Reproduce
Steps to reproduce the behavior:
I initialize a UNetModel in stablediffusion, simulate an input to get the output of the Unet model.
Then I initialize another fake_label with the same shape as the output.
Finally I use mseloss function to get the loss and call backward. However, it seems that I can get the output with the same Unet network, but when I call backward, it raise
Traceback (most recent call last):
File "/home/anaconda/envs/pyDF/lib/python3.9/site-packages/torch/autograd/function.py", line 399, in wrapper
outputs = fn(ctx, *args)
File "/home/anaconda/envs/pyDF/lib/python3.9/site-packages/xformers/ops/fmha/init.py", line 111, in backward
grads = _memory_efficient_attention_backward(
File "/home/anaconda/envs/pyDF/lib/python3.9/site-packages/xformers/ops/fmha/init.py", line 376, in _memory_efficient_attention_backward
op = _dispatch_bw(inp)
File "/home/anaconda/envs/pyDF/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 68, in _dispatch_bw
raise NotImplementedError(f"No operator found for this attention: {inp}")
NotImplementedError: No operator found for this attention: Inputs(query=tensor([[[[ 0.1457, 0.8941, -0.0281, ..., -0.0386, -0.2712, 0.9171]],
python-BaseException
Here is my code, I download the stable diffusion project, and use the ldm.modules.diffusionmodules.openaimodel
python -m xformers.info
The text was updated successfully, but these errors were encountered: