Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not run on 10GB GPUs and below #13

Closed
Raibeat opened this issue Nov 7, 2022 · 26 comments
Closed

Does not run on 10GB GPUs and below #13

Raibeat opened this issue Nov 7, 2022 · 26 comments
Labels
OOM Issues It's DEEPSPEED, Baby!!

Comments

@Raibeat
Copy link

Raibeat commented Nov 7, 2022

I've enabled all the suggested flags to reduce VRAM (8-bit, fp16, Gradient Checkpointing, Don't Cache Latents), but the out of memory error remains. I have 10GB of VRAM. Is it possible to run in 10GB?

@AmmBe
Copy link

AmmBe commented Nov 7, 2022

I also failed at 12 GB

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.2GB
 Reserved: 0.2GB

  Total target lifetime optimization steps = 1000
 CPU: False Adam: False, Prec: fp16, Prior: False, Grad: True, TextTr: True
 Allocated: 3.8GB
 Reserved: 3.9GB

Steps:   0%|                                                                                  | 0/1000 [00:00<?, ?it/s]Error completing request
Arguments: ('House', 'D:\\Generate\\Dreambooth\\House\\IMG', '', 'house', '', '', '', 1.0, 7.5, 20.0, 0, 512, False, True, 1, 1, 1, 1000, 1, True, 5e-06, False, 'constant', 0, False, 0.9, 0.999, 0.01, 1e-08, 1, 100, 500, 'fp16', True, '', False) {}

@LaikaSa
Copy link

LaikaSa commented Nov 7, 2022

I am using this on window right now and have the same problem with my RTX 3090 24gb VRAM...
Specifically got this error:

Traceback (most recent call last):
  File "E:\Stable\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "E:\Stable\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 766, in main
    accelerator.backward(loss)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)raceback (most recent call last):
  File "E:\Stable\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "E:\Stable\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "E:\Stable\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 766, in main
    accelerator.backward(loss)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 5.06 GiB (GPU 0; 24.00 GiB total capacity; 12.03 GiB already allocated; 234.32 MiB free; 21.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "E:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 5.06 GiB (GPU 0; 24.00 GiB total capacity; 12.03 GiB already allocated; 234.32 MiB free; 21.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@d8ahazard
Copy link
Owner

If you guys are having issues with OOM, please post the settings output from console before training starts. There's a lot of config options.

@LaikaSa
Copy link

LaikaSa commented Nov 7, 2022

This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

Trying to parse: E:\Stable\stable-diffusion-webui\Datasets\Jackie+hekapoo.json
Unable to load concepts as JSON, trying as file.
 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.4GB
 Reserved: 0.4GB

  Total target lifetime optimization steps = 1100
 CPU: False Adam: True, Prec: fp16, Prior: True, Grad: True, TextTr: False
 Allocated: 3.6GB
 Reserved: 3.7GB

Steps:   0%|                                                                                  | 0/1100 [00:00<?, ?it/s]Error completing request
Arguments: ('JackiehekapooNAI1100', '', '', '*', '*', '', '', 1.0, 7.5, 50.0, 1500, 768, False, False, 1, 1, 1, 1100, 1, True, 5e-05, False, 'constant', 0, True, 0.9, 0.999, 0.01, 1e-08, 1, 200, 200, 'fp16', True, 'E:\\Stable\\stable-diffusion-webui\\Datasets\\Jackie+hekapoo.json', False) {}

@d8ahazard
Copy link
Owner

This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

Trying to parse: E:\Stable\stable-diffusion-webui\Datasets\Jackie+hekapoo.json
Unable to load concepts as JSON, trying as file.
 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.4GB
 Reserved: 0.4GB

  Total target lifetime optimization steps = 1100
 CPU: False Adam: True, Prec: fp16, Prior: True, Grad: True, TextTr: False
 Allocated: 3.6GB
 Reserved: 3.7GB

Steps:   0%|                                                                                  | 0/1100 [00:00<?, ?it/s]Error completing request
Arguments: ('JackiehekapooNAI1100', '', '', '*', '*', '', '', 1.0, 7.5, 50.0, 1500, 768, False, False, 1, 1, 1, 1100, 1, True, 5e-05, False, 'constant', 0, True, 0.9, 0.999, 0.01, 1e-08, 1, 200, 200, 'fp16', True, 'E:\\Stable\\stable-diffusion-webui\\Datasets\\Jackie+hekapoo.json', False) {}

I suspect the multiple concepts is the issue. More concepts==more stuff in VRAM.

You could try unchecking "train text encoder" to save some VRAM, enabling 8-bit Adam, and/or setting precision to fp16.

You can also refer to the https://github.com/d8ahazard/sd_dreambooth_extension#readme for more tips on optimizing memory.

Last, an option I haven't explored much yet, is by doing 'accelerate config' as described here: https://github.com/bmaltais/kohya_ss

And then modifying the "webui.bat" of stable-diffusion-webui so that it looks like so (to launch with accelerate):

image

@AmmBe
Copy link

AmmBe commented Nov 7, 2022

When i uncheck "train text encoder" i get this error when training

Arguments: ('House', 'D:\\Generate\\Dreambooth\\House\\IMG', '', 'house', '', '', '', 1.0, 7.5, 20.0, 0, 512, False, False, 1, 1, 1, 1000, 1, True, 5e-06, False, 'constant', 0, False, 0.9, 0.999, 0.01, 1e-08, 1, 25, 500, 'fp16', True, '', False) {}
Traceback (most recent call last):
  File "D:\Programs\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "D:\Programs\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "D:\Programs\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "D:\Programs\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 745, in main
    encoder_hidden_states = text_encoder(batch["input_ids"])[0]
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 722, in forward
    return self.text_model(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 643, in forward
    encoder_outputs = self.encoder(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 574, in forward
    layer_outputs = encoder_layer(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 317, in forward
    hidden_states, attn_weights = self.self_attn(
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Programs\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 257, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

when I check "Use 8bit Adam" then this

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary D:\Programs\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
Exception importing 8bit adam: [WinError 193] %1 is not a Win32 application
The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.

@rabidcopy
Copy link

On the subject of VRAM, I was only able to get training to start by unchecking train text encoder, using 8bit adam, and touching nothing else. 16GB VRAM on a A4000. Anything involving the text encoder always ran me OOM. And trying to set mixed precision to fp16 in combination with train text encoder leads to this.

Traceback (most recent call last):
  File "/notebooks/SDW/modules/ui.py", line 180, in f
    res = list(func(*args, **kwargs))
  File "/notebooks/SDW/webui.py", line 55, in f
    res = func(*args, **kwargs)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 745, in main
    encoder_hidden_states = text_encoder(batch["input_ids"])[0]
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 722, in forward
    return self.text_model(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 643, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 574, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 317, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 257, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

@rabidcopy
Copy link

rabidcopy commented Nov 8, 2022

Did want to add, when I did successfully train a style for 1000 steps at 4e-6, it appears to have worked well. Also this belongs in its own issue but I don't think the custom preview prompts appeared to have worked though? All the images in logging look like it was just generating "sks style" alone. Same preview prompt used on final ckpt produced gives anticipated results.

@chrisparkernz
Copy link

chrisparkernz commented Nov 8, 2022

My 12gb card on Windows 11 always fails the first time I attempt training with a Cuda out of memory error, however on the second attempt with the settings below it will work. Is it something to do with the Allocated/Reserved memory amounts? They appear to be different every time I run the training.

Settings that sometimes work are:

Training steps 500
Don't cache latents = true
Train Text encoder = false
Use 8bit Adam = true
Gradient Checkpointing = true
Mixed Precision = no (it has never worked when switched to FP16 in my testing)

Everything else at defaults. I hadn't run Accelerate Config when this worked, and running that doesn't seemed to have impacted the failures.

*Edit - I also do not use class image to keep under the memory limit.
*Edit 2 - Training steps do not seem to impact

@chrisparkernz
Copy link

It looks like using the class images is tipping me into a Cuda out of memory error with my 12gb card.

@sgsdxzy
Copy link
Collaborator

sgsdxzy commented Nov 8, 2022

I got training on 3080Ti 12G working, wrote my experience here: AUTOMATIC1111/stable-diffusion-webui#4436

@GForceWeb
Copy link

Haven't had any luck training on my RTX3080 10G as yet. Following @sgsdxzy's guide I set the following:

  • classification images disabled
  • 500 training steps
  • Dont cache Latents disabled
  • Use 8bit Adam enabled
  • FP16 Mixed Precision

In the WebUI settings I also checked both "Move VAE and CLIP to RAM when training if possible. Saves VRAM." & "Move face restoration model from VRAM into RAM after processing"

I'm getting to about 9.2GB allocated when I run out of memory:

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.00 GiB total capacity; 9.16 GiB already allocated; 0 bytes free; 9.27 GiB reserved in total by PyTorch)

Not sure what else I can close/disable in Windows to try and edge out a bit more VRAM space for SD to use

@sgsdxzy
Copy link
Collaborator

sgsdxzy commented Nov 8, 2022

Haven't had any luck training on my RTX3080 10G as yet. Following @sgsdxzy's guide I set the following:

  • classification images disabled
  • 500 training steps
  • Dont cache Latents disabled
  • Use 8bit Adam enabled
  • FP16 Mixed Precision

In the WebUI settings I also checked both "Move VAE and CLIP to RAM when training if possible. Saves VRAM." & "Move face restoration model from VRAM into RAM after processing"

I'm getting to about 9.2GB allocated when I run out of memory:

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.00 GiB total capacity; 9.16 GiB already allocated; 0 bytes free; 9.27 GiB reserved in total by PyTorch)

Not sure what else I can close/disable in Windows to try and edge out a bit more VRAM space for SD to use

I need around 11.8G/12G when training, so probably it cannot work on 10G yet. I am trying to make xformers working, which is reported to reduced ~1G vram usage.

@sgsdxzy
Copy link
Collaborator

sgsdxzy commented Nov 8, 2022

On the subject of VRAM, I was only able to get training to start by unchecking train text encoder, using 8bit adam, and touching nothing else. 16GB VRAM on a A4000. Anything involving the text encoder always ran me OOM. And trying to set mixed precision to fp16 in combination with train text encoder leads to this.

Traceback (most recent call last):
  File "/notebooks/SDW/modules/ui.py", line 180, in f
    res = list(func(*args, **kwargs))
  File "/notebooks/SDW/webui.py", line 55, in f
    res = func(*args, **kwargs)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/dreambooth.py", line 256, in start_training
    trained_steps = main(config)
  File "/notebooks/SDW/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 745, in main
    encoder_hidden_states = text_encoder(batch["input_ids"])[0]
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 722, in forward
    return self.text_model(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 643, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 574, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 317, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/clip/modeling_clip.py", line 257, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

I have the same issue when training with mixed precision = fp16 and NOT train text encoder. It works for me with mixed precision = fp16 and train text encoder.

@sgsdxzy
Copy link
Collaborator

sgsdxzy commented Nov 8, 2022

And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram.

@GForceWeb
Copy link

And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram.

Is this possible to use with this extension? I've got 128gb RAM so I've got plenty spare there if there's a way to offload from VRAM

@sgsdxzy
Copy link
Collaborator

sgsdxzy commented Nov 8, 2022

And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram.

Is this possible to use with this extension? I've got 128gb RAM so I've got plenty spare there if there's a way to offload from VRAM

It seems deepspeed is not implemented in this extension yet. You can checkout the original repo: https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth#training-on-a-8-gb-gpu

@zark119
Copy link

zark119 commented Nov 8, 2022

on 3060 12gb stops after 1 step, due to CUDA out of memory

CUDA SETUP: Loading binary C:\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
Scheduler Loaded
Allocated: 1.7GB
Reserved: 1.7GB

Total target lifetime optimization steps = 1000
CPU: False Adam: True, Prec: no, Prior: False, Grad: True, TextTr: False
Allocated: 4.9GB
Reserved: 4.9GB

Steps: 0%| | 1/1000 [00:12<3:20:53, 12.07s/it, loss=0.247, lr=5e-6]Error completing request

@HiPhiSch
Copy link

HiPhiSch commented Nov 9, 2022

Ok, it is possible to get it running with a 3060ti 8gb on Windows (more or less). And sadly I cannot recall everything I did, but this is roughly what finally lead to success. Don't expect this to be an easy tutorial but it might help if you are willing to tinker.

  1. Install Windows 11 22H2 (no windows 10 does not work with deepspeed), you also need at least 32 GB RAM
  2. Install WSL2 and a Linux Subsystem (I used Ubuntu 20.04LTS), configure WSL2 to get as much RAM as possible
  3. Install the CUDA 10.6.2-Toolkit (might not be necessary?) for Windows
  4. In Ubuntu install Python 3.10 (need to add another repository to apt)
  5. Get pip for Python with something like python3.10 -m ensurepip
  6. Get the cuda toolkit for Cuda 10.6 by following NVIDIA's instruction and being careful to not kill the WSL toolkit
  7. Install git, get the stable diffusion webui, set up python10.3 as interpreter and install this extension
  8. I updated torch (not sure if it is required)?, update all packages which do not work with the new version (pip install --upgrade ...)
  9. activate the stable diffusion venv (. venv/bin/activate) while being inside the webui folder
  10. Install deepspeed (via pip) <-- this is the reason why you need WSL/Linux
  11. configure accelerate via accelerate config, enable zero opimization at stage 2, fp16 and cpu offloading where possible
  12. install xformers (needed manual compilation in my case) - Does not work out of the box and required some digging to get it working.
  13. launch the webui via accelerate launch launch.py --xformers
  14. modify deepseed stage_1_and_2.py in _create_param_mapping(self) add if lp in self.param_names: before lp_name = self.param_names[lp] and indent the next to lines. This is a workaround to the fact that not all model parameters are trained/mentioned in the initilization and deepspeed cannot work like this.
  15. start the dreambooth training, no textencoder training, disable do not cache gradients, mixed precision: fp16
  16. open train_dreambooth.py of this extension and change the parameters of from_pretrained(...) to not include subfolder= but use the os.path.join(...) form
  17. wait for crashes because of incompatible parameters CPU <> GPU, Half vs Float
  18. force autocasting by adding a with torch.autocast("cuda"): for this part in train_dreambooth.py - this probably breaks CPU training, repeat from 17 on

Current state:

  • Image generation still crashes. Set it up at more steps than you want to train. There is still a bug that it crashes imediatly if you put 0 as parameter (div by zero).
  • It still crashes after the first checkpoint was generated, I have to look into it.
  • However, it trains with rougly 5 s per iteration and saves a single checkpoint afterwards.

@d8ahazard
Copy link
Owner

Ok, it is possible to get it running with a 3060ti 8gb on Windows (more or less). And sadly I cannot recall everything I did, but this is roughly what finally lead to success. Don't expect this to be an easy tutorial but it might help if you are willing to tinker.

  1. Install Windows 11 22H2 (no windows 10 does not work with deepspeed), you also need at least 32 GB RAM
  2. Install WSL2 and a Linux Subsystem (I used Ubuntu 20.04LTS), configure WSL2 to get as much RAM as possible
  3. Install the CUDA 10.6.2-Toolkit (might not be necessary?) for Windows
  4. In Ubuntu install Python 3.10 (need to add another repository to apt)
  5. Get pip for Python with something like python3.10 -m ensurepip
  6. Get the cuda toolkit for Cuda 10.6 by following NVIDIA's instruction and being careful to not kill the WSL toolkit
  7. Install git, get the stable diffusion webui, set up python10.3 as interpreter and install this extension
  8. I updated torch (not sure if it is required)?, update all packages which do not work with the new version (pip install --upgrade ...)
  9. activate the stable diffusion venv (. venv/bin/activate) while being inside the webui folder
  10. Install deepspeed (via pip) <-- this is the reason why you need WSL/Linux
  11. configure accelerate via accelerate config, enable zero opimization at stage 2, fp16 and cpu offloading where possible
  12. install xformers (needed manual compilation in my case) - Does not work out of the box and required some digging to get it working.
  13. launch the webui via accelerate launch launch.py --xformers
  14. modify deepseed stage_1_and_2.py in _create_param_mapping(self) add if lp in self.param_names: before lp_name = self.param_names[lp] and indent the next to lines. This is a workaround to the fact that not all model parameters are trained/mentioned in the initilization and deepspeed cannot work like this.
  15. start the dreambooth training, no textencoder training, disable do not cache gradients, mixed precision: fp16
  16. open train_dreambooth.py of this extension and change the parameters of from_pretrained(...) to not include subfolder= but use the os.path.join(...) form
  17. wait for crashes because of incompatible parameters CPU <> GPU, Half vs Float
  18. force autocasting by adding a with torch.autocast("cuda"): for this part in train_dreambooth.py - this probably breaks CPU training, repeat from 17 on

Current state:

  • Image generation still crashes. Set it up at more steps than you want to train. There is still a bug that it crashes imediatly if you put 0 as parameter (div by zero).
  • It still crashes after the first checkpoint was generated, I have to look into it.
  • However, it trains with rougly 5 s per iteration and saves a single checkpoint afterwards.

Not to muddy the waters, but one observation here:

You can now actually use Deepspeed on windows without WSL, although I'm not sure how successful it will be. I had it working, but didn't have 8-bit Adam going. I now have native 8-bit adam support going on Windows.

microsoft/DeepSpeed#2428

I also created a PR for the main repo to allow adding a "set ACCELERATE="True"" flag to the webui-user.bat script, which should allow proper running of "accelerate launch", which in turn can summon up deepspeed, etc.

You would still need to run "accelerate config" once to store settings, else the launch throws an error, but after configuring (via venv), you might be able to use deepspeed on windows.

AUTOMATIC1111/stable-diffusion-webui#4527

@d8ahazard
Copy link
Owner

Oh, also, I forgot to mention the --xformers flag. Add that to webui-user.bat, along with accelerate (when merged), might help.

image

@d8ahazard d8ahazard added the OOM Issues It's DEEPSPEED, Baby!! label Nov 9, 2022
@HiPhiSch
Copy link

HiPhiSch commented Nov 9, 2022

You can now actually use Deepspeed on windows without WSL, although I'm not sure how successful it will be. I had it working, but didn't have 8-bit Adam going. I now have native 8-bit adam support going on Windows.

This is what I tried first. It did not work then. But the relevant merge has not yet been in the pip repository version (0.7.4) . If this truly works out, this would be great. As of now I am more than happy to have a mostly working WSL version for training.

@CypherQube
Copy link

CypherQube commented Nov 10, 2022

You would still need to run "accelerate config" once to store settings, else the launch throws an error, but after configuring (via venv), you might be able to use deepspeed on windows.

Could you please elaborate? How do you run "accelerate config" Sorry I'm a noob here

@HiPhiSch
Copy link

Accelerate is a python module which allows e.g. to split deep learning on multiple GPUs or to use DeepSpeed instead of the standard configuration. It is automatically installed by the dream booth extension but if you want to fully use it it needs to be configured. The aforementioned branch of the webui injects a call to accelerate for your task which than uses the configured features.

Configuration can be done from a console window. First activate the venv by calling venv\Scripts\acivate.bat from the stable diffusion directory from within the console. Then call accelerate config and answer the questions.

However, if this should save VRAM by offloading to the CPU and the system RAM you will need to have DeepSpeed installed in the same environment. Currently, the windows version only supports using the model but not training it. This is, if you managed to compile it at all. So it sadly won't work that way.

Setting everything up correctly in WSL2 (Windows subsystem for Linux) works but is anything but straight forward as of now.

@d8ahazard
Copy link
Owner

I'm going to close this and direct it to the https://github.com/d8ahazard/sd_dreambooth_extension/discussions/77 on optimization for <=12GB GPUs.

@kou201
Copy link

kou201 commented Nov 10, 2022

This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw

Starting Dreambooth training...
 VRAM cleared.
 Allocated: 0.0GB
 Reserved: 0.0GB

Trying to parse: E:\Stable\stable-diffusion-webui\Datasets\Jackie+hekapoo.json
Unable to load concepts as JSON, trying as file.
 Loaded model.
 Allocated: 0.0GB
 Reserved: 0.0GB

The config attributes {'set_alpha_to_one': False, 'skip_prk_steps': True, 'steps_offset': 1} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
 Scheduler Loaded
 Allocated: 0.4GB
 Reserved: 0.4GB

  Total target lifetime optimization steps = 1100
 CPU: False Adam: True, Prec: fp16, Prior: True, Grad: True, TextTr: False
 Allocated: 3.6GB
 Reserved: 3.7GB

Steps:   0%|                                                                                  | 0/1100 [00:00<?, ?it/s]Error completing request
Arguments: ('JackiehekapooNAI1100', '', '', '*', '*', '', '', 1.0, 7.5, 50.0, 1500, 768, False, False, 1, 1, 1, 1100, 1, True, 5e-05, False, 'constant', 0, True, 0.9, 0.999, 0.01, 1e-08, 1, 200, 200, 'fp16', True, 'E:\\Stable\\stable-diffusion-webui\\Datasets\\Jackie+hekapoo.json', False) {}

I suspect the multiple concepts is the issue. More concepts==more stuff in VRAM.

You could try unchecking "train text encoder" to save some VRAM, enabling 8-bit Adam, and/or setting precision to fp16.

You can also refer to the https://github.com/d8ahazard/sd_dreambooth_extension#readme for more tips on optimizing memory.

Last, an option I haven't explored much yet, is by doing 'accelerate config' as described here: https://github.com/bmaltais/kohya_ss

And then modifying the "webui.bat" of stable-diffusion-webui so that it looks like so (to launch with accelerate):

image

image

Is this normal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OOM Issues It's DEEPSPEED, Baby!!
Projects
None yet
Development

No branches or pull requests