-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RTX/A6000 GPUs] NaNs in backward pass when training with the huggingface diffusers-style trainer and unet. #631
Comments
Hi, In any case, I also opened #632 to make it easier to debug what's happening in those cases |
@danthe3rd You caught me as I was rolling forward, to commit:
Also note: I'm building from source using Edit: |
I don't understand what's going on... Can you go back to master, and print the name of the kernel used for the FW / BW? xformers/xformers/ops/fmha/__init__.py Line 326 in 6cd1b36
xformers/xformers/ops/fmha/__init__.py Line 381 in 6cd1b36
|
I made the print statements give a bit more info:
|
so the one causing the NaNs should be the last one, aka So it's the xformers/xformers/ops/fmha/cutlass.py Lines 184 to 193 in 6cd1b36
Can you dump all the values there when there is a nan and send that as a pickle file for instance? ( |
Thanks for reporting and trying to solve the issue. Just wanted to report I also hit the same issue with A4000, in case this helps prioritization. I tried rolling back to
|
also having this issue, package is the new one installed with: running same dreambooth example from diffusers, the loss goes to nan around 120 iterations in my case; can also confirm that removing xformers has fixed the problem. The issue might be with their specific implementation of xformers and how it works with the latest version (idk) wasted nearly a day trying to figure out this, didn't think the problem was xformers since everything seemed to be working well; just getting black outputs at infer time. Looking forward to a fix! |
xformers-0.0.15.dev343 (1b1fd8a) was the last working published revision on my 3090. unfortunately pre-compiled wheels are no longer available. |
I'll be away next week, but after that I want to start working on making it compatible on Sm86 |
I'll admit this is speculative, but for what it's worth... I was having a hell of a time getting Dreambooth extension working in A111's stable diffusion UI. I was having some unrelated issues, so I rolled forward to latest versions of everything (iirc, this included an xformers update). I started to quickly get NaN bombed when training the model. This bug report tipped me off that it might be xformers. I changed only one parameter in the training - switching from xformers to flash_attention as my memory attention option - and the NaN bombs are now gone. I'm still too dumb about how all of this works, but it points my suspicion at xformers (at the very least, there is some incompatibility issue, but that may not be entirely due to xformers). xformers version 0.0.14.dev0 is the one I had installed (which I'm just now realizing... why do I have a dev build?). I'm running an RTX 4090 if that seems relevant, and with bf16 precision. On Windows 11. More details from the log....
|
RTX 4090 as well but i have seen the issue on the full spectrum on consumers card from the reports we have been getting. |
@Zuxier thanks for providing a way to get around this issue! May I ask how to build v0.14dev0? I tried checking the tags, releases, PyPI package versions, and searching in the commit messages, and couldn't find 0.0.14dev0 or 0.14dev0. Did you build it using a certain commit id? Thanks a lot! |
I found an A10g GPU to work on this. Do we have specific steps to reproduce or inputs I could try it on? (also interested if you have the input shapes) |
Sorry for not following back up, I only have access to a limited number of GPUs and needed to use them for other jobs.
The reason for this is because the Stable Diffusion attention head dims are too large for FlashAttention on A6000s (>64. The actual head size is 80, iirc).
I'll try to set something up, but it might be a while (>5 days) |
Oh okay. That's weird because with the current code I would expect a CUDA error rather than silent nans ... I'll try to push a fix and let you know to test it again |
here is the original diffusers example. And this is part of their documentation. |
Here's an example that's failing for me. I've removed the specific thing I'm working on (the densenet) and replaced it with a dummy MSELoss and it still fails in the same way. One weird thing is that if I skip the densenet201 instantiation, it will return: I'm working with: Example:
|
I've got something working with any value of K for Sm86/Sm89, and it's reasonably fast for K<=96 now. I hope to be able to push it this week or next week |
It's landed as part of 82d5881 |
Awesome, I'll try it out this morning. We can control the batch size on training. I think most people are using 1-2, but some of the people with 24GB VRAM cards use 10 or higher for batch size. |
If you have 16 heads, you should use at least 108/16=6 for the batch size (for the memory_efficient backward pass). Below that, the kernel won't entirely use the GPU (it wasn't optimized for this use-case) |
First test, it seems to be working 👍 |
My testing looks good too. Thanks Dan! For anyone else reading this, the wheels have been posted already |
Is there any chance you could add Torch2 builds to the published wheels? We are building our own atm, but it would be nice to have an official version of it. |
Thanks for the feedback :)
Unfortunately PyTorch's ABI is not stable. This means that an XFormers binary built for a PyTorch version will not be compatible with the next one. A binary build would only be compatible with a single PT nightly, so we would need to build that every day and this would require some additional work to manage that. Of course, once PyTorch publishes 2.0 officially, we will change the binary builds to be for 2.0 instead of 1.13.1 |
@danthe3rd Major props here. I spent hours trying to debug this on an a10g yesterday. You came through at the perfect time. |
Thanks! |
@danthe3rd I'm still running tests with different head shapes and sizes, but so far it's looking good for me! If I don't report back in another day or two, then you can consider my report resolved. Thanks! |
Closing as fixed :) |
Sorry if this bug report is sub-optimal, I'm going to append additional information. When training a diffusion model using a training script similar to/exact with the huggingface diffusers u-net.
See: https://github.com/harubaru/waifu-diffusion/tree/main/trainer
for the trainer.
🐛 Bug
With a random-initialized unet and bfloat16 enabled, as well as pytorch's anomaly-detection enabled, on the very first run of the model, I will get:
RuntimeError: Function '_fMHABackward' returned nan values in its 0th output.
With torch anomaly detection enabled. If anomaly detection is disabled and the trainer left running long enough, the loss will eventually devolve into NaNs.
When using bfloat16 and enabling xformers for training. Hardware: A6000 sm_86 gpus.
To Reproduce
Steps to reproduce the behavior:
Enable:
torch.autograd.set_detect_anomaly(True)
Set the data type to bfloat16.
Train on a data set of images...
Expected behavior
Not getting NaNs when running backward when running bfloat16
Environment
conda
,pip
, source): pipNVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0
(nvcc reports:cuda_11.8.r11.8/compiler.31833905_0
xformers.info output when nans (latest commit)
Transformers is version:
4.22.2
Diffusers is:
0.11.1
Additional Information
c733c99
(Nov 29th) and earlier, and nans no longer occur/torch AMP with the anomaly detector, no longer trips/nans no longer occur.affe4da
(Dec 9th and later) and the nans occur.Note: I do not know if
c733c99
is the inflection point for when the nans begin to reliably occur, however that is the most recent version which I know works, and have tested.Edit x1:
Stepping forward to commit:
1924b196
(Dec 6) and nans do not occur. Will drill into it more later.The text was updated successfully, but these errors were encountered: