-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not run on 10GB GPUs and below #13
Comments
I also failed at 12 GB
|
I am using this on window right now and have the same problem with my RTX 3090 24gb VRAM...
|
If you guys are having issues with OOM, please post the settings output from console before training starts. There's a lot of config options. |
This is my output from the console, 22.5 VRAM immediately filled but still got OOM error above and no actual training was in the process, it just freeze, I was trying to train 2 concepts with a json file btw
|
I suspect the multiple concepts is the issue. More concepts==more stuff in VRAM. You could try unchecking "train text encoder" to save some VRAM, enabling 8-bit Adam, and/or setting precision to fp16. You can also refer to the https://github.com/d8ahazard/sd_dreambooth_extension#readme for more tips on optimizing memory. Last, an option I haven't explored much yet, is by doing 'accelerate config' as described here: https://github.com/bmaltais/kohya_ss And then modifying the "webui.bat" of stable-diffusion-webui so that it looks like so (to launch with accelerate): |
When i uncheck "train text encoder" i get this error when training
when I check "Use 8bit Adam" then this
|
On the subject of VRAM, I was only able to get training to start by unchecking train text encoder, using 8bit adam, and touching nothing else. 16GB VRAM on a A4000. Anything involving the text encoder always ran me OOM. And trying to set mixed precision to fp16 in combination with train text encoder leads to this.
|
Did want to add, when I did successfully train a style for 1000 steps at 4e-6, it appears to have worked well. Also this belongs in its own issue but I don't think the custom preview prompts appeared to have worked though? All the images in logging look like it was just generating "sks style" alone. Same preview prompt used on final ckpt produced gives anticipated results. |
My 12gb card on Windows 11 always fails the first time I attempt training with a Cuda out of memory error, however on the second attempt with the settings below it will work. Is it something to do with the Allocated/Reserved memory amounts? They appear to be different every time I run the training. Settings that sometimes work are: Training steps 500 Everything else at defaults. I hadn't run Accelerate Config when this worked, and running that doesn't seemed to have impacted the failures. *Edit - I also do not use class image to keep under the memory limit. |
It looks like using the class images is tipping me into a Cuda out of memory error with my 12gb card. |
I got training on 3080Ti 12G working, wrote my experience here: AUTOMATIC1111/stable-diffusion-webui#4436 |
Haven't had any luck training on my RTX3080 10G as yet. Following @sgsdxzy's guide I set the following:
In the WebUI settings I also checked both "Move VAE and CLIP to RAM when training if possible. Saves VRAM." & "Move face restoration model from VRAM into RAM after processing" I'm getting to about 9.2GB allocated when I run out of memory:
Not sure what else I can close/disable in Windows to try and edge out a bit more VRAM space for SD to use |
I need around 11.8G/12G when training, so probably it cannot work on 10G yet. I am trying to make xformers working, which is reported to reduced ~1G vram usage. |
I have the same issue when training with mixed precision = fp16 and NOT train text encoder. It works for me with mixed precision = fp16 and train text encoder. |
And in the original repo training with 8GB only works with deepspeed, which unload part of the vram to ram, and requires around 25GB ram. |
Is this possible to use with this extension? I've got 128gb RAM so I've got plenty spare there if there's a way to offload from VRAM |
It seems deepspeed is not implemented in this extension yet. You can checkout the original repo: https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth#training-on-a-8-gb-gpu |
on 3060 12gb stops after 1 step, due to CUDA out of memoryCUDA SETUP: Loading binary C:\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... Total target lifetime optimization steps = 1000 Steps: 0%| | 1/1000 [00:12<3:20:53, 12.07s/it, loss=0.247, lr=5e-6]Error completing request |
Ok, it is possible to get it running with a 3060ti 8gb on Windows (more or less). And sadly I cannot recall everything I did, but this is roughly what finally lead to success. Don't expect this to be an easy tutorial but it might help if you are willing to tinker.
Current state:
|
Not to muddy the waters, but one observation here: You can now actually use Deepspeed on windows without WSL, although I'm not sure how successful it will be. I had it working, but didn't have 8-bit Adam going. I now have native 8-bit adam support going on Windows. I also created a PR for the main repo to allow adding a "set ACCELERATE="True"" flag to the webui-user.bat script, which should allow proper running of "accelerate launch", which in turn can summon up deepspeed, etc. You would still need to run "accelerate config" once to store settings, else the launch throws an error, but after configuring (via venv), you might be able to use deepspeed on windows. |
This is what I tried first. It did not work then. But the relevant merge has not yet been in the pip repository version (0.7.4) . If this truly works out, this would be great. As of now I am more than happy to have a mostly working WSL version for training. |
You would still need to run "accelerate config" once to store settings, else the launch throws an error, but after configuring (via venv), you might be able to use deepspeed on windows. Could you please elaborate? How do you run "accelerate config" Sorry I'm a noob here |
Accelerate is a python module which allows e.g. to split deep learning on multiple GPUs or to use DeepSpeed instead of the standard configuration. It is automatically installed by the dream booth extension but if you want to fully use it it needs to be configured. The aforementioned branch of the webui injects a call to accelerate for your task which than uses the configured features. Configuration can be done from a console window. First activate the venv by calling However, if this should save VRAM by offloading to the CPU and the system RAM you will need to have DeepSpeed installed in the same environment. Currently, the windows version only supports using the model but not training it. This is, if you managed to compile it at all. So it sadly won't work that way. Setting everything up correctly in WSL2 (Windows subsystem for Linux) works but is anything but straight forward as of now. |
I'm going to close this and direct it to the https://github.com/d8ahazard/sd_dreambooth_extension/discussions/77 on optimization for <=12GB GPUs. |
Is this normal? |
I've enabled all the suggested flags to reduce VRAM (8-bit, fp16, Gradient Checkpointing, Don't Cache Latents), but the out of memory error remains. I have 10GB of VRAM. Is it possible to run in 10GB?
The text was updated successfully, but these errors were encountered: