-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: Invalid argument when training #1146
Comments
Can you try rebooting your device? |
I have yes, I also reinstalled CUDA to test that. Issue persists. I noticed setting to FP16 "sometimes" works but that seems to be particular to the model. Yet I still get the same error like the training gets so many steps and flashes the progress bar then returns to the same error. It doesn't seem to be running out of memory as I get an overhead. It takes a while after stalling to then roll back to said same error. While if the precision is off it errors before it even starts. Automatic1111 itself doesn't have this issue even when training. I've used DB before no problem before torch 2.0, xformers worked. Could this be an issue with SDP? It won't run at all without the SDP, though so can't test. |
Hmm, I don't have any great answers for this atm. I'm using:
ops:
on native Windows 11, RTX 3080ti and everything is working fine. You could always try xformers, the 0.0.17 wheel is build with torch 2. |
xformers won't work at all with DB for me. Automatic1111 rolls it back for some reason, DB then errors. All my repos are latest. I did install xformers 0..0.17 and tried it with torch 1 and 2. The code doesn't have a lot of assertions around CUDA so it's hard for me to trace the code myself, even though I am a programmer. Thanks |
The stacktrace is a broad pytorch error, so you could try to search their github page, but there's not much to go off of. Xformers should be working fine as long as you're using the right version combinations. I'd recommend Torch 2 + xformers 0.0.17 or 0.0.18 |
Still getting same errors with xformers on this version, will try yet another fresh install next chance. |
I'd guess your issue is separate from the other thread - that one is a mac specific issue. I'd recommend checking the torch github; this doesn't seem like an issue caused by the extension code |
This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days |
Kindly read and fill this form in its entirety.
1. Please find the following lines in the console and paste them below.
2. Describe the bug
I get this error when training on a fresh install of both Automatic1111 and sd_dreambooth_extension:
![image](https://user-images.githubusercontent.com/84904462/228694762-c67875ad-447c-49c1-a007-0725e793d23d.png)
It happens after it's preprocessed/generated the images. It doesn't matter how I create a model or if I select an existing one.
I am using CUDA 11.7 with the correct torch and torchvision versions and my command line is --opt-sdp-no-mem-attention and get the same error with --opt-sdp-attention
Automatic1111 itself does not error, so may be it's the way the extension is using CUDA? I have the latest version of the DB repo.
Screenshots/Config
Here is the model's db_config:
db_config.txt
I've tried all kinds of parameters including the default, to no avail.
"python -m torch.utils.collect_env" in the venv says the correct CUDA version as well, as above.
3. Provide logs
If a crash has occurred, please provide the entire stack trace from the log, including the last few log messages before the crash occurred.
4. Environment
What OS? Windows 10
If Windows - WSL or native? Native
What GPU are you using? RTX 3090
The text was updated successfully, but these errors were encountered: