Does not run on 10GB GPUs and below #13

Raibeat · 2022-11-07T22:44:58Z

I've enabled all the suggested flags to reduce VRAM (8-bit, fp16, Gradient Checkpointing, Don't Cache Latents), but the out of memory error remains. I have 10GB of VRAM. Is it possible to run in 10GB?

AmmBe · 2022-11-07T22:49:46Z

I also failed at 12 GB

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.2GB
 Reserved: 0.2GB

  Total target lifetime optimization steps = 1000
 CPU: False Adam: False, Prec: fp16, Prior: False, Grad: True, TextTr: True
 Allocated: 3.8GB
 Reserved: 3.9GB

Steps:   0%|                                                                                  | 0/1000 [00:00<?, ?it/s]Error completing request
Arguments: ('House', 'D:\\Generate\\Dreambooth\\House\\IMG', '', 'house', '', '', '', 1.0, 7.5, 20.0, 0, 512, False, True, 1, 1, 1, 1000, 1, True, 5e-06, False, 'constant', 0, False, 0.9, 0.999, 0.01, 1e-08, 1, 100, 500, 'fp16', True, '', False) {}

LaikaSa · 2022-11-07T22:50:15Z

I am using this on window right now and have the same problem with my RTX 3090 24gb VRAM...
Specifically got this error:

Traceback (most recent call last):
  File "E:\Stable\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "E:\Stable\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 766, in main
    accelerator.backward(loss)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)raceback (most recent call last):
  File "E:\Stable\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "E:\Stable\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 766, in main
    accelerator.backward(loss)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 5.06 GiB (GPU 0; 24.00 GiB total capacity; 12.03 GiB already allocated; 234.32 MiB free; 21.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 5.06 GiB (GPU 0; 24.00 GiB total capacity; 12.03 GiB already allocated; 234.32 MiB free; 21.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

d8ahazard · 2022-11-07T22:53:35Z

If you guys are having issues with OOM, please post the settings output from console before training starts. There's a lot of config options.

LaikaSa · 2022-11-07T23:05:12Z

This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

Trying to parse: E:\Stable\stable-diffusion-webui\Datasets\Jackie+hekapoo.json
Unable to load concepts as JSON, trying as file.
 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.4GB
 Reserved: 0.4GB

  Total target lifetime optimization steps = 1100
 CPU: False Adam: True, Prec: fp16, Prior: True, Grad: True, TextTr: False
 Allocated: 3.6GB
 Reserved: 3.7GB

Steps:   0%|                                                                                  | 0/1100 [00:00<?, ?it/s]Error completing request
Arguments: ('JackiehekapooNAI1100', '', '', '*', '*', '', '', 1.0, 7.5, 50.0, 1500, 768, False, False, 1, 1, 1, 1100, 1, True, 5e-05, False, 'constant', 0, True, 0.9, 0.999, 0.01, 1e-08, 1, 200, 200, 'fp16', True, 'E:\\Stable\\stable-diffusion-webui\\Datasets\\Jackie+hekapoo.json', False) {}

d8ahazard · 2022-11-07T23:32:48Z

This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

Trying to parse: E:\Stable\stable-diffusion-webui\Datasets\Jackie+hekapoo.json
Unable to load concepts as JSON, trying as file.
 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.4GB
 Reserved: 0.4GB

  Total target lifetime optimization steps = 1100
 CPU: False Adam: True, Prec: fp16, Prior: True, Grad: True, TextTr: False
 Allocated: 3.6GB
 Reserved: 3.7GB

Steps:   0%|                                                                                  | 0/1100 [00:00<?, ?it/s]Error completing request
Arguments: ('JackiehekapooNAI1100', '', '', '*', '*', '', '', 1.0, 7.5, 50.0, 1500, 768, False, False, 1, 1, 1, 1100, 1, True, 5e-05, False, 'constant', 0, True, 0.9, 0.999, 0.01, 1e-08, 1, 200, 200, 'fp16', True, 'E:\\Stable\\stable-diffusion-webui\\Datasets\\Jackie+hekapoo.json', False) {}

I suspect the multiple concepts is the issue. More concepts==more stuff in VRAM.

You could try unchecking "train text encoder" to save some VRAM, enabling 8-bit Adam, and/or setting precision to fp16.

You can also refer to the https://github.com/d8ahazard/sd_dreambooth_extension#readme for more tips on optimizing memory.

Last, an option I haven't explored much yet, is by doing 'accelerate config' as described here: https://github.com/bmaltais/kohya_ss

And then modifying the "webui.bat" of stable-diffusion-webui so that it looks like so (to launch with accelerate):

AmmBe · 2022-11-07T23:59:51Z

When i uncheck "train text encoder" i get this error when training

Arguments: ('House', 'D:\\Generate\\Dreambooth\\House\\IMG', '', 'house', '', '', '', 1.0, 7.5, 20.0, 0, 512, False, False, 1, 1, 1, 1000, 1, True, 5e-06, False, 'constant', 0, False, 0.9, 0.999, 0.01, 1e-08, 1, 25, 500, 'fp16', True, '', False) {}
Traceback (most recent call last):
  File "D:\Programs\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "D:\Programs\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "D:\Programs\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "D:\Programs\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 745, in main
    encoder_hidden_states = text_encoder(batch["input_ids"])[0]
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 722, in forward
    return self.text_model(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 643, in forward
    encoder_outputs = self.encoder(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 574, in forward
    layer_outputs = encoder_layer(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 317, in forward
    hidden_states, attn_weights = self.self_attn(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 257, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

when I check "Use 8bit Adam" then this

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary D:\Programs\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
Exception importing 8bit adam: [WinError 193] %1 is not a Win32 application
The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.

rabidcopy · 2022-11-08T00:45:22Z

On the subject of VRAM, I was only able to get training to start by unchecking train text encoder, using 8bit adam, and touching nothing else. 16GB VRAM on a A4000. Anything involving the text encoder always ran me OOM. And trying to set mixed precision to fp16 in combination with train text encoder leads to this.

Traceback (most recent call last):
  File "/notebooks/SDW/modules/ui.py", line 180, in f
    res = list(func(*args, **kwargs))
  File "/notebooks/SDW/webui.py", line 55, in f
    res = func(*args, **kwargs)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 745, in main
    encoder_hidden_states = text_encoder(batch["input_ids"])[0]
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 722, in forward
    return self.text_model(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 643, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 574, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 317, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 257, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

rabidcopy · 2022-11-08T01:10:09Z

Did want to add, when I did successfully train a style for 1000 steps at 4e-6, it appears to have worked well. Also this belongs in its own issue but I don't think the custom preview prompts appeared to have worked though? All the images in logging look like it was just generating "sks style" alone. Same preview prompt used on final ckpt produced gives anticipated results.

chrisparkernz · 2022-11-08T01:41:28Z

My 12gb card on Windows 11 always fails the first time I attempt training with a Cuda out of memory error, however on the second attempt with the settings below it will work. Is it something to do with the Allocated/Reserved memory amounts? They appear to be different every time I run the training.

Settings that sometimes work are:

Training steps 500
Don't cache latents = true
Train Text encoder = false
Use 8bit Adam = true
Gradient Checkpointing = true
Mixed Precision = no (it has never worked when switched to FP16 in my testing)

Everything else at defaults. I hadn't run Accelerate Config when this worked, and running that doesn't seemed to have impacted the failures.

*Edit - I also do not use class image to keep under the memory limit.
*Edit 2 - Training steps do not seem to impact

chrisparkernz · 2022-11-08T02:51:32Z

It looks like using the class images is tipping me into a Cuda out of memory error with my 12gb card.

sgsdxzy · 2022-11-08T04:09:41Z

I got training on 3080Ti 12G working, wrote my experience here: AUTOMATIC1111/stable-diffusion-webui#4436

GForceWeb · 2022-11-08T10:19:52Z

Haven't had any luck training on my RTX3080 10G as yet. Following @sgsdxzy's guide I set the following:

classification images disabled
500 training steps
Dont cache Latents disabled
Use 8bit Adam enabled
FP16 Mixed Precision

In the WebUI settings I also checked both "Move VAE and CLIP to RAM when training if possible. Saves VRAM." & "Move face restoration model from VRAM into RAM after processing"

I'm getting to about 9.2GB allocated when I run out of memory:

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.00 GiB total capacity; 9.16 GiB already allocated; 0 bytes free; 9.27 GiB reserved in total by PyTorch)

Not sure what else I can close/disable in Windows to try and edge out a bit more VRAM space for SD to use

sgsdxzy · 2022-11-08T10:22:57Z

Haven't had any luck training on my RTX3080 10G as yet. Following @sgsdxzy's guide I set the following:

classification images disabled

500 training steps

Dont cache Latents disabled

Use 8bit Adam enabled

FP16 Mixed Precision

In the WebUI settings I also checked both "Move VAE and CLIP to RAM when training if possible. Saves VRAM." & "Move face restoration model from VRAM into RAM after processing"

I'm getting to about 9.2GB allocated when I run out of memory:
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.00 GiB total capacity; 9.16 GiB already allocated; 0 bytes free; 9.27 GiB reserved in total by PyTorch)
Not sure what else I can close/disable in Windows to try and edge out a bit more VRAM space for SD to use

I need around 11.8G/12G when training, so probably it cannot work on 10G yet. I am trying to make xformers working, which is reported to reduced ~1G vram usage.

sgsdxzy · 2022-11-08T10:24:42Z

On the subject of VRAM, I was only able to get training to start by unchecking train text encoder, using 8bit adam, and touching nothing else. 16GB VRAM on a A4000. Anything involving the text encoder always ran me OOM. And trying to set mixed precision to fp16 in combination with train text encoder leads to this.

Traceback (most recent call last):
  File "/notebooks/SDW/modules/ui.py", line 180, in f
    res = list(func(*args, **kwargs))
  File "/notebooks/SDW/webui.py", line 55, in f
    res = func(*args, **kwargs)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 745, in main
    encoder_hidden_states = text_encoder(batch["input_ids"])[0]
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 722, in forward
    return self.text_model(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 643, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 574, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 317, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 257, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

I have the same issue when training with mixed precision = fp16 and NOT train text encoder. It works for me with mixed precision = fp16 and train text encoder.

sgsdxzy · 2022-11-08T10:26:36Z

And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram.

GForceWeb · 2022-11-08T10:28:19Z

And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram.

Is this possible to use with this extension? I've got 128gb RAM so I've got plenty spare there if there's a way to offload from VRAM

sgsdxzy · 2022-11-08T10:31:47Z

And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram.

Is this possible to use with this extension? I've got 128gb RAM so I've got plenty spare there if there's a way to offload from VRAM

It seems deepspeed is not implemented in this extension yet. You can checkout the original repo: https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth#training-on-a-8-gb-gpu

zark119 · 2022-11-08T18:45:01Z

on 3060 12gb stops after 1 step, due to CUDA out of memory

CUDA SETUP: Loading binary C:\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
Scheduler Loaded
Allocated: 1.7GB
Reserved: 1.7GB

Total target lifetime optimization steps = 1000
CPU: False Adam: True, Prec: no, Prior: False, Grad: True, TextTr: False
Allocated: 4.9GB
Reserved: 4.9GB

Steps: 0%| | 1/1000 [00:12<3:20:53, 12.07s/it, loss=0.247, lr=5e-6]Error completing request

HiPhiSch · 2022-11-09T17:05:33Z

Ok, it is possible to get it running with a 3060ti 8gb on Windows (more or less). And sadly I cannot recall everything I did, but this is roughly what finally lead to success. Don't expect this to be an easy tutorial but it might help if you are willing to tinker.

Install Windows 11 22H2 (no windows 10 does not work with deepspeed), you also need at least 32 GB RAM
Install WSL2 and a Linux Subsystem (I used Ubuntu 20.04LTS), configure WSL2 to get as much RAM as possible
Install the CUDA 10.6.2-Toolkit (might not be necessary?) for Windows
In Ubuntu install Python 3.10 (need to add another repository to apt)
Get pip for Python with something like python3.10 -m ensurepip
Get the cuda toolkit for Cuda 10.6 by following NVIDIA's instruction and being careful to not kill the WSL toolkit
Install git, get the stable diffusion webui, set up python10.3 as interpreter and install this extension
I updated torch (not sure if it is required)?, update all packages which do not work with the new version (pip install --upgrade ...)
activate the stable diffusion venv (. venv/bin/activate) while being inside the webui folder
Install deepspeed (via pip) <-- this is the reason why you need WSL/Linux
configure accelerate via accelerate config, enable zero opimization at stage 2, fp16 and cpu offloading where possible
install xformers (needed manual compilation in my case) - Does not work out of the box and required some digging to get it working.
launch the webui via accelerate launch launch.py --xformers
modify deepseed stage_1_and_2.py in _create_param_mapping(self) add if lp in self.param_names: before lp_name = self.param_names[lp] and indent the next to lines. This is a workaround to the fact that not all model parameters are trained/mentioned in the initilization and deepspeed cannot work like this.
start the dreambooth training, no textencoder training, disable do not cache gradients, mixed precision: fp16
open train_dreambooth.py of this extension and change the parameters of from_pretrained(...) to not include subfolder= but use the os.path.join(...) form
wait for crashes because of incompatible parameters CPU <> GPU, Half vs Float
force autocasting by adding a with torch.autocast("cuda"): for this part in train_dreambooth.py - this probably breaks CPU training, repeat from 17 on

Current state:

Image generation still crashes. Set it up at more steps than you want to train. There is still a bug that it crashes imediatly if you put 0 as parameter (div by zero).
It still crashes after the first checkpoint was generated, I have to look into it.
However, it trains with rougly 5 s per iteration and saves a single checkpoint afterwards.

d8ahazard · 2022-11-09T19:13:57Z

Ok, it is possible to get it running with a 3060ti 8gb on Windows (more or less). And sadly I cannot recall everything I did, but this is roughly what finally lead to success. Don't expect this to be an easy tutorial but it might help if you are willing to tinker.

Install Windows 11 22H2 (no windows 10 does not work with deepspeed), you also need at least 32 GB RAM

Install WSL2 and a Linux Subsystem (I used Ubuntu 20.04LTS), configure WSL2 to get as much RAM as possible

Install the CUDA 10.6.2-Toolkit (might not be necessary?) for Windows

In Ubuntu install Python 3.10 (need to add another repository to apt)

Get pip for Python with something like python3.10 -m ensurepip

Get the cuda toolkit for Cuda 10.6 by following NVIDIA's instruction and being careful to not kill the WSL toolkit

Install git, get the stable diffusion webui, set up python10.3 as interpreter and install this extension

I updated torch (not sure if it is required)?, update all packages which do not work with the new version (pip install --upgrade ...)

activate the stable diffusion venv (. venv/bin/activate) while being inside the webui folder

Install deepspeed (via pip) <-- this is the reason why you need WSL/Linux

configure accelerate via accelerate config, enable zero opimization at stage 2, fp16 and cpu offloading where possible

install xformers (needed manual compilation in my case) - Does not work out of the box and required some digging to get it working.

launch the webui via accelerate launch launch.py --xformers

modify deepseed stage_1_and_2.py in _create_param_mapping(self) add if lp in self.param_names: before lp_name = self.param_names[lp] and indent the next to lines. This is a workaround to the fact that not all model parameters are trained/mentioned in the initilization and deepspeed cannot work like this.

start the dreambooth training, no textencoder training, disable do not cache gradients, mixed precision: fp16

open train_dreambooth.py of this extension and change the parameters of from_pretrained(...) to not include subfolder= but use the os.path.join(...) form

wait for crashes because of incompatible parameters CPU <> GPU, Half vs Float

force autocasting by adding a with torch.autocast("cuda"): for this part in train_dreambooth.py - this probably breaks CPU training, repeat from 17 on

Current state:

Image generation still crashes. Set it up at more steps than you want to train. There is still a bug that it crashes imediatly if you put 0 as parameter (div by zero).

It still crashes after the first checkpoint was generated, I have to look into it.

However, it trains with rougly 5 s per iteration and saves a single checkpoint afterwards.

Not to muddy the waters, but one observation here:

You can now actually use Deepspeed on windows without WSL, although I'm not sure how successful it will be. I had it working, but didn't have 8-bit Adam going. I now have native 8-bit adam support going on Windows.

microsoft/DeepSpeed#2428

I also created a PR for the main repo to allow adding a "set ACCELERATE="True"" flag to the webui-user.bat script, which should allow proper running of "accelerate launch", which in turn can summon up deepspeed, etc.

You would still need to run "accelerate config" once to store settings, else the launch throws an error, but after configuring (via venv), you might be able to use deepspeed on windows.

AUTOMATIC1111/stable-diffusion-webui#4527

d8ahazard · 2022-11-09T19:16:19Z

Oh, also, I forgot to mention the --xformers flag. Add that to webui-user.bat, along with accelerate (when merged), might help.

HiPhiSch · 2022-11-09T19:39:42Z

You can now actually use Deepspeed on windows without WSL, although I'm not sure how successful it will be. I had it working, but didn't have 8-bit Adam going. I now have native 8-bit adam support going on Windows.

This is what I tried first. It did not work then. But the relevant merge has not yet been in the pip repository version (0.7.4) . If this truly works out, this would be great. As of now I am more than happy to have a mostly working WSL version for training.

CypherQube · 2022-11-10T06:47:30Z

You would still need to run "accelerate config" once to store settings, else the launch throws an error, but after configuring (via venv), you might be able to use deepspeed on windows.

Could you please elaborate? How do you run "accelerate config" Sorry I'm a noob here

HiPhiSch · 2022-11-10T16:38:33Z

Accelerate is a python module which allows e.g. to split deep learning on multiple GPUs or to use DeepSpeed instead of the standard configuration. It is automatically installed by the dream booth extension but if you want to fully use it it needs to be configured. The aforementioned branch of the webui injects a call to accelerate for your task which than uses the configured features.

Configuration can be done from a console window. First activate the venv by calling venv\Scripts\acivate.bat from the stable diffusion directory from within the console. Then call accelerate config and answer the questions.

However, if this should save VRAM by offloading to the CPU and the system RAM you will need to have DeepSpeed installed in the same environment. Currently, the windows version only supports using the model but not training it. This is, if you managed to compile it at all. So it sadly won't work that way.

Setting everything up correctly in WSL2 (Windows subsystem for Linux) works but is anything but straight forward as of now.

d8ahazard · 2022-11-10T17:42:20Z

I'm going to close this and direct it to the https://github.com/d8ahazard/sd_dreambooth_extension/discussions/77 on optimization for <=12GB GPUs.

kou201 · 2022-11-10T22:56:21Z

This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw
Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

Trying to parse: E:\Stable\stable-diffusion-webui\Datasets\Jackie+hekapoo.json
Unable to load concepts as JSON, trying as file.
 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.4GB
 Reserved: 0.4GB

  Total target lifetime optimization steps = 1100
 CPU: False Adam: True, Prec: fp16, Prior: True, Grad: True, TextTr: False
 Allocated: 3.6GB
 Reserved: 3.7GB

Steps:   0%|                                                                                  | 0/1100 [00:00<?, ?it/s]Error completing request
Arguments: ('JackiehekapooNAI1100', '', '', '*', '*', '', '', 1.0, 7.5, 50.0, 1500, 768, False, False, 1, 1, 1, 1100, 1, True, 5e-05, False, 'constant', 0, True, 0.9, 0.999, 0.01, 1e-08, 1, 200, 200, 'fp16', True, 'E:\\Stable\\stable-diffusion-webui\\Datasets\\Jackie+hekapoo.json', False) {}
I suspect the multiple concepts is the issue. More concepts==more stuff in VRAM.

You could try unchecking "train text encoder" to save some VRAM, enabling 8-bit Adam, and/or setting precision to fp16.

You can also refer to the https://github.com/d8ahazard/sd_dreambooth_extension#readme for more tips on optimizing memory.

Last, an option I haven't explored much yet, is by doing 'accelerate config' as described here: https://github.com/bmaltais/kohya_ss

And then modifying the "webui.bat" of stable-diffusion-webui so that it looks like so (to launch with accelerate):

Is this normal?

chrisparkernz mentioned this issue Nov 8, 2022

CPU training fails #14

Closed

d8ahazard added the OOM Issues It's DEEPSPEED, Baby!! label Nov 9, 2022

d8ahazard closed this as completed Nov 10, 2022

marinohardin mentioned this issue Nov 18, 2022

MacOS is slow #251

Closed

LinuxNascent mentioned this issue Mar 5, 2023

TensorFlow Library - AVX Instructions #1018

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not run on 10GB GPUs and below #13

Does not run on 10GB GPUs and below #13

Raibeat commented Nov 7, 2022

AmmBe commented Nov 7, 2022 •

edited

Loading

LaikaSa commented Nov 7, 2022

d8ahazard commented Nov 7, 2022

LaikaSa commented Nov 7, 2022

d8ahazard commented Nov 7, 2022

AmmBe commented Nov 7, 2022 •

edited

Loading

rabidcopy commented Nov 8, 2022

rabidcopy commented Nov 8, 2022 •

edited

Loading

chrisparkernz commented Nov 8, 2022 •

edited

Loading

chrisparkernz commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

GForceWeb commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

GForceWeb commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

zark119 commented Nov 8, 2022

HiPhiSch commented Nov 9, 2022

d8ahazard commented Nov 9, 2022

d8ahazard commented Nov 9, 2022

HiPhiSch commented Nov 9, 2022

CypherQube commented Nov 10, 2022 •

edited

Loading

HiPhiSch commented Nov 10, 2022

d8ahazard commented Nov 10, 2022

kou201 commented Nov 10, 2022

Does not run on 10GB GPUs and below #13

Does not run on 10GB GPUs and below #13

Comments

Raibeat commented Nov 7, 2022

AmmBe commented Nov 7, 2022 • edited Loading

LaikaSa commented Nov 7, 2022

d8ahazard commented Nov 7, 2022

LaikaSa commented Nov 7, 2022

d8ahazard commented Nov 7, 2022

AmmBe commented Nov 7, 2022 • edited Loading

rabidcopy commented Nov 8, 2022

rabidcopy commented Nov 8, 2022 • edited Loading

chrisparkernz commented Nov 8, 2022 • edited Loading

chrisparkernz commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

GForceWeb commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

GForceWeb commented Nov 8, 2022

sgsdxzy commented Nov 8, 2022

zark119 commented Nov 8, 2022

on 3060 12gb stops after 1 step, due to CUDA out of memory

HiPhiSch commented Nov 9, 2022

d8ahazard commented Nov 9, 2022

d8ahazard commented Nov 9, 2022

HiPhiSch commented Nov 9, 2022

CypherQube commented Nov 10, 2022 • edited Loading

HiPhiSch commented Nov 10, 2022

d8ahazard commented Nov 10, 2022

kou201 commented Nov 10, 2022

AmmBe commented Nov 7, 2022 •

edited

Loading

AmmBe commented Nov 7, 2022 •

edited

Loading

rabidcopy commented Nov 8, 2022 •

edited

Loading

chrisparkernz commented Nov 8, 2022 •

edited

Loading

CypherQube commented Nov 10, 2022 •

edited

Loading