What can a 40gb+ vram gpu train that a 24gb vram gpu can't? #912

minienglish1 · 2023-02-07T09:41:52Z

minienglish1
Feb 7, 2023

Considering buying another gpu in the future. Either another RTX 3090 (24gb) or an RTX A6000 (48gb). But the A6000 is 5x expensive, and basically the same speed. An 80gb gpu is 15x as expensive.

Since it's for my company, I'm willing to invest if a 40+gb vram gpu can deliver a higher quality model. Higher quality output meaning: higher detailed images, higher resolution, and pictures that more accurately represent what I'm trying to create. Model training speed is not a concern. I'm only concerned with quality.

Anybody got any experience to tell me what I'll be able to do with a 40+gb vram gpu and that I can't already do with my 24gb vram gpu?

cerega66 · 2023-02-07T11:30:53Z

cerega66
Feb 7, 2023

I think you are unlikely to get a detailed answer in the form of real experience with video cards with 48 or 80 GB of memory. People with 3090 and 3090TI are quite rare, and those who understand their capabilities in terms of training are probably not even dozens, but only a few. I can give you an example of my transition from 3060 to 3090.

On 3060 I trained with fp16 and 8bit adam and without gradient_checkpointing I was able to set max batch size to 3. With gradient_checkpointing it was 10. On 3090 without gradient_checkpointing batch size I can set 20. With gradient_checkpointing set to 70.
In theory, increasing memory will only give you the ability to train at larger batch sizes and at higher resolutions with higher accuracy. But is it worth it. The transition from 3060 to 3090 still makes sense, but the transition from 3060 to 4090 also raises the question of necessity. Now I am satisfied with the power of my 3090 and see no point in increasing the memory.

Also, if we consider the batch size, then it has a non-uniform rectilinear distribution of efficiency. The most efficient value is of course when the batch size is equal to the number of images. But the more images, the bigger the batch size. And I roughly measured that increasing the batch size value by 20 at fp16, 8 bits adam and gradient_checkpointing increases memory consumption by about 4GB. At the same time, an increase in batch size from 1, the efficiency from its value has the fastest growth rate up to 10, then the efficiency growth rate begins to fall more and more.

In theory, with this amount of memory, you could run training in fp64 without any problems, only technically this is not yet implemented, since Adams do not yet support this format.

Perhaps, in theory, this amount of memory will give you the opportunity to work with neural networks in terms of already vid2vid, for example, with the recently shown Gen-1 from Runway. But here again there are many questions: when will it be released? will it be publicly available? what are the requirements? etc.

8 replies

minienglish1 Feb 7, 2023
Author

That's going to take a while to read and longer to understand. But with a quick glance I see a lot of concepts I'm interested in.

As far as batch sizes, I mostly messed with learning rate and comparisons at equal time training. I've got an excel sheet where I logged my results of my tests. Here's the short version

All settings equal for all training sessions: Set Gradients to None When Zeroing; Gradient Checkpointing; Use EMA; Use 8bit Adam; fp16; xformers; Cache Latents

SD2.1 768 model, 768*768 resolution. 300 total photos, 10 different people of various ages/genders. Mix of various focuses and body positions. Very detailed captions.

Batch sizes:

batch 1; gradient 1
batch 10; gradient 1
batch 30; gradient 2
batch 150; gradient 3

Learning rates:

base learning rate: 1e-6.
sqrt(effective batch size) * 1e-6
effective batch size * 1e-6
Some other random learning rates just to see what would happen, 1e-3, 1e-4, 1e-7, and some others

I let them train way too long. Some were stopped early when it was clear it was a failure. Other trained so long that I was way past overtraining and the pictures started getting weird. I wanted to see what the model could do, to find it it's limits in achieving realism.

Effectively, the result was that at batch size 1 & learning rate 1e-6, the output images very closely resembled the training images. When choosing a random sample, it was very clear which person from the training data it was. Batch training just didn't produce as clean an image. I could guess which person it was, but it was never as accurate as batch size one. Even when measuring by time spent training, batch size one beat all other batch sizes when it came to quality, regardless of the learning rate used by larger batch sizes.

It seemed that larger batch sizes focused more on the macro aspects, and lower batch size could get the micro details. I don't quite understand why. Might make larger batches better for training styles. Maybe I'll figure it out after reading your notes.

My recent tests have shown good results with lower learning rates. I just did a short test with 960*960 resolution @ 3.5e7, with very promising results.

Thomas-MMJ Feb 8, 2023

It seemed that larger batch sizes focused more on the macro aspects, and lower batch size could get the micro details. I don't quite understand why. Might make larger batches better for training styles. Maybe I'll figure it out after reading your notes.

Large batches the gradient is averaged over more images and a larger learning rate is used. Smaller batches more information is retained from each image with a smaller learning rate.

cerega66 Feb 9, 2023

Large batches the gradient is averaged over more images and a larger learning rate is used. Smaller batches more information is retained from each image with a smaller learning rate.

By using a higher learning rate, you will lose out on the details. Batches the size of the number of images have the best convergence, while the stochastic method and mini-batches are much worse. But the larger the batch size, the more epochs (not steps) must be passed.
The simplest example, having 10 images from batch size 1, we need to go through 100 epochs or 1000 steps. With batch size 10, we need to go through ~601 epochs or 601 steps already. Both trainings are held at the same speed. The second one requires more time, but the quality of the repetition of details will be higher.

minienglish1 Feb 9, 2023
Author

Started reading your training parameter page, and downloaded your excel. Let me run some test sets to better understand how it works. Thanks.

cerega66 Feb 9, 2023

Started reading your training parameter page, and downloaded your excel. Let me run some test sets to better understand how it works. Thanks.

I'm also waiting for your results, but I'm not sure that they will help you in training the sd2.1-768 model. I have little experience in training this model and I need to make adjustments for it, which I am not sure about yet. And if you have any questions, it's better to discuss it in that topic. It's not very good to breed offtopic here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What can a 40gb+ vram gpu train that a 24gb vram gpu can't? #912

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What can a 40gb+ vram gpu train that a 24gb vram gpu can't? #912

minienglish1 Feb 7, 2023

Replies: 1 comment · 8 replies

cerega66 Feb 7, 2023

minienglish1 Feb 7, 2023 Author

Thomas-MMJ Feb 8, 2023

cerega66 Feb 9, 2023

minienglish1 Feb 9, 2023 Author

cerega66 Feb 9, 2023

minienglish1
Feb 7, 2023

Replies: 1 comment 8 replies

cerega66
Feb 7, 2023

minienglish1 Feb 7, 2023
Author

minienglish1 Feb 9, 2023
Author