CUDA error: Invalid argument when training #1146

KC2021 · 2023-03-30T00:21:52Z

⚠️If you do not follow the template, your issue may be closed without a response ⚠️

Kindly read and fill this form in its entirety.

1. Please find the following lines in the console and paste them below.

#######################################################################################################
Initializing Dreambooth
Dreambooth revision: 3324b6ab7fa661cf7d6b5ef186227dc5e8ad1878
Successfully installed accelerate-0.17.1 fastapi-0.90.1 gitpython-3.1.31 google-auth-oauthlib-0.4.6 requests-2.28.2 starlette-0.23.1 transformers-4.26.1

Does your project take forever to startup?
Repetitive dependency installation may be the reason.
Automatic1111's base project sets strict requirements on outdated dependencies.
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.

[!] xformers NOT installed.
[+] torch version 2.0.0+cu117 installed.
[+] torchvision version 0.15.1+cu117 installed.
[+] accelerate version 0.17.1 installed.
[+] diffusers version 0.14.0 installed.
[+] transformers version 4.26.1 installed.
[+] bitsandbytes version 0.35.4 installed.
#######################################################################################################

2. Describe the bug

I get this error when training on a fresh install of both Automatic1111 and sd_dreambooth_extension:

It happens after it's preprocessed/generated the images. It doesn't matter how I create a model or if I select an existing one.
I am using CUDA 11.7 with the correct torch and torchvision versions and my command line is --opt-sdp-no-mem-attention and get the same error with --opt-sdp-attention
Automatic1111 itself does not error, so may be it's the way the extension is using CUDA? I have the latest version of the DB repo.

Screenshots/Config
Here is the model's db_config:
db_config.txt
I've tried all kinds of parameters including the default, to no avail.

"python -m torch.utils.collect_env" in the venv says the correct CUDA version as well, as above.

3. Provide logs

If a crash has occurred, please provide the entire stack trace from the log, including the last few log messages before the crash occurred.

Initializing bucket counter!
  ***** Running training *****
  Num batches each epoch = 688
  Num Epochs = 125
  Batch Size Per Device = 1
  Gradient Accumulation steps = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Text Encoder Epochs: 0
  Total optimization steps = 72000
  Total training steps = 86000
  Resuming from checkpoint: False
  First resume epoch: 0
  First resume step: 0
  Lora: False, Optimizer: 8bit AdamW, Prec: no
  Gradient Checkpointing: True
  EMA: False
  UNET: True
  Freeze CLIP Normalization Layers: False
  LR: 1e-06
  V2: False
Steps:   0%|                                                                                 | 0/86000 [00:00<?, ?it/s]T
raceback (most recent call last):
  File "C:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\ui_functions.py", line 724, in start_training
    result = main(class_gen_method=class_gen_method)
  File "C:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1429, in main
    return inner_loop()
  File "C:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 119, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "C:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1295, in inner_loop
    accelerator.backward(loss)
  File "C:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 1636, in backward
    loss.backward(**kwargs)
  File "C:\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "C:\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "C:\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 274, in apply
    return user_fn(self, *args)
  File "C:\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "C:\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

4. Environment

What OS? Windows 10

If Windows - WSL or native? Native

What GPU are you using? RTX 3090

The text was updated successfully, but these errors were encountered:

ArrowM · 2023-03-30T15:39:43Z

Can you try rebooting your device?

KC2021 · 2023-03-30T16:53:38Z

I have yes, I also reinstalled CUDA to test that. Issue persists. I noticed setting to FP16 "sometimes" works but that seems to be particular to the model. Yet I still get the same error like the training gets so many steps and flashes the progress bar then returns to the same error. It doesn't seem to be running out of memory as I get an overhead. It takes a while after stalling to then roll back to said same error. While if the precision is off it errors before it even starts.

Automatic1111 itself doesn't have this issue even when training. I've used DB before no problem before torch 2.0, xformers worked. Could this be an issue with SDP? It won't run at all without the SDP, though so can't test.

ArrowM · 2023-03-30T22:37:24Z

Hmm, I don't have any great answers for this atm. I'm using:

Dreambooth revision: 7380dcc8b37a19416d0244cf1c6c8fb9fd9ba139
[!] xformers NOT installed.
[+] torch version 2.1.0.dev20230329+cu118 installed.
[+] torchvision version 0.16.0.dev20230329+cu118 installed.
[+] accelerate version 0.18.0 installed.
[+] diffusers version 0.14.0 installed.
[+] transformers version 4.26.1 installed.
[+] bitsandbytes version 0.35.4 installed.

ops:

COMMANDLINE_ARGS=--opt-sdp-attention --no-half-vae

on native Windows 11, RTX 3080ti and everything is working fine.

You could always try xformers, the 0.0.17 wheel is build with torch 2.

KC2021 · 2023-03-31T12:05:49Z

xformers won't work at all with DB for me. Automatic1111 rolls it back for some reason, DB then errors. All my repos are latest. I did install xformers 0..0.17 and tried it with torch 1 and 2.
That's why I'm using SDP and it's faster on my GPU, anyway. I've fought xformers for days, which is especially unfun when using remote access. Others had these issues, too. So at least SDP is simpler.

The code doesn't have a lot of assertions around CUDA so it's hard for me to trace the code myself, even though I am a programmer.
Perhaps, there is a way for the code to give more detail into weird GPU errors like these? The precision issue lacks assertion, too, so I presume there's some kind of failed expectation whenever CUDA receives calls.

Thanks

ArrowM · 2023-03-31T13:21:51Z

The stacktrace is a broad pytorch error, so you could try to search their github page, but there's not much to go off of. Xformers should be working fine as long as you're using the right version combinations. I'd recommend Torch 2 + xformers 0.0.17 or 0.0.18

KC2021 · 2023-04-03T21:04:44Z

Still getting same errors with xformers on this version, will try yet another fresh install next chance.
I see from the mentioning thread, #1160 (comment) others had same error midway, but I haven't seen the error with triton, etc.

ArrowM · 2023-04-03T21:23:57Z

I'd guess your issue is separate from the other thread - that one is a mac specific issue. I'd recommend checking the torch github; this doesn't seem like an issue caused by the extension code

github-actions · 2023-04-18T00:32:57Z

This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days

KC2021 added the new Just added, you should probably sort this. label Mar 30, 2023

rkfg mentioned this issue Apr 2, 2023

Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

Closed

github-actions bot added the Stale label Apr 18, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: Invalid argument when training #1146

CUDA error: Invalid argument when training #1146

KC2021 commented Mar 30, 2023

ArrowM commented Mar 30, 2023

KC2021 commented Mar 30, 2023

ArrowM commented Mar 30, 2023

KC2021 commented Mar 31, 2023

ArrowM commented Mar 31, 2023

KC2021 commented Apr 3, 2023

ArrowM commented Apr 3, 2023

github-actions bot commented Apr 18, 2023

CUDA error: Invalid argument when training #1146

CUDA error: Invalid argument when training #1146

Comments

KC2021 commented Mar 30, 2023

⚠️If you do not follow the template, your issue may be closed without a response ⚠️

1. Please find the following lines in the console and paste them below.

2. Describe the bug

3. Provide logs

4. Environment

ArrowM commented Mar 30, 2023

KC2021 commented Mar 30, 2023

ArrowM commented Mar 30, 2023

KC2021 commented Mar 31, 2023

ArrowM commented Mar 31, 2023

KC2021 commented Apr 3, 2023

ArrowM commented Apr 3, 2023

github-actions bot commented Apr 18, 2023