lowering gpu requirement hyperparameter setting #16

StolasIn · 2022-07-01T06:22:00Z

First of all thank you very much for this great work. It really helped me a lot. but I have a few questions that I want to ask.

I'm a beginner in text to image synthesis task, so my following questions may be a little stupid.

for training environment issue, I only had one gpu (RTX3090) avaliable right now. I think it will not Influence gamma (batch size = 16 on one gpu) and other hyperparameters like itc, itd etc (in COCO dataset and use ground truth setting). Am I correct ? If not, Is there any tuning advice can you give me in this situation ?

I think gamma is only sensitive in dataset and batch size per gpu.

in the paper, are you using ada='noaug' setting because the training data is sufficient (after x-flip) ?

drboog · 2022-07-01T08:59:55Z

Hi, my suggestion is to start with gamma=10 and try different itc and itd. The reason is: 1) I only roughly tuned hyper-parameters, the provided hyper-parameters are not optimal, so if you are looking for better performance, you may want to tune them; 2) although each GPU has 16 samples, the total mini-batch size is different when you use different number of GPUs, which I think will influence the hyper-parameter settings, but I think gamma=10 will lead to promising results.
I used 'ada=noaug' because 1) for fair comparison, I want to compare with previous methods fairly, which didn't use augmentation, to show our effectiveness under the same setting; 2) I did some experiments on different datasets (although the hyper-parameters may not be well-tuned), the ADA augmentation is not guaranteed to improve performance on all the datasets, so I didn't use it in final experiments for simplicity.

StolasIn · 2022-07-01T10:40:45Z

Thanks for your advice, It's a great help for me.

I still have some questions about hyper-parameters tuning and batch size.

In the paper, you select itd and itc from 0 ~ 50. my question is those hyper-parameters might still be in this range right ? perhaps close to the original setting (ex. itd = 5, itc = 10) ?
In the latest model which use contrastive learing method, I noticed their batch size were usually seted quite large (>256). does large batch size result in better performance in Lafite ? Or using lower batch size have some benefits ?

drboog · 2022-07-01T18:47:26Z

Yes, I think you can tune the hyper-parameters by searching in this range. I think larger batch size will lead to performance improvement, because it could provide better discriminative information for training. But in that case you may also need to tune some hyper-parameters for contrastive loss (--temp=0.5, --lam=0. are tuned based on batch size=16 per GPU).

StolasIn · 2022-07-02T06:50:16Z

I appreciate for you answering my questions. I will close this issue and do some experiment mentioned above.

Cwj1212 · 2022-09-12T06:30:07Z

@StolasIn Have you reproduced the results with one gpu now? I have reproduced the results of the paper under the setting of four gpus (batch=32, btach_gpu=8 gets better results than the paper). But when I try to experiment with one gpu, I get only poor results.
@drboog ,Thank you so much for your work, for which I have to keep harassing you again. As for why the hyperparameters need to be re-adjusted for one gpu, my observation is that the {gather:flase} setting in the contrastive loss in the code is to distribute and calculate the contrastive loss on each gpu. I don't know what else causes the difference between one gpu and four gpu?
What confuses me is that I modified the calculation method of contrastive learning under the one gpu setting, simulating it as the {gather:false} setting (divide a batch of samples into four parts, and calculate their contrastive losses separately) , but still only get poor results.

drboog · 2022-09-12T07:16:08Z

The performance is related to many things, batch size, learning rate, regularizer ... For example, for StyleGAN2 without contrastive loss (for image generation not text-to-image generation), GPUs still matters a lot.
https://github.com/NVlabs/stylegan2-ada/blob/main/docs/stylegan2-ada-training-curves.png

drboog · 2022-09-12T07:19:03Z

Assume under 4 GPU setting, each GPU has N samples, resulting in a batch size of 4N. Are you using batch size of 4N when using one GPU?

Cwj1212 · 2022-09-12T07:35:10Z

@drboog Yes, I did, but the performance of one card is still significantly worse than four cards. Thank you very much for providing this picture. I originally thought it was the difference caused by the different number of gups corresponding to different hyperparameters when cfg=auto.
But as a beginner, I still can't understand why the forward and back propagation of the network are not equivalent at this time. Are they equivalent in theory?

StolasIn closed this as completed Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lowering gpu requirement hyperparameter setting #16

lowering gpu requirement hyperparameter setting #16

StolasIn commented Jul 1, 2022 •

edited

Loading

drboog commented Jul 1, 2022

StolasIn commented Jul 1, 2022

drboog commented Jul 1, 2022

StolasIn commented Jul 2, 2022

Cwj1212 commented Sep 12, 2022

drboog commented Sep 12, 2022

drboog commented Sep 12, 2022

Cwj1212 commented Sep 12, 2022 •

edited

Loading

lowering gpu requirement hyperparameter setting #16

lowering gpu requirement hyperparameter setting #16

Comments

StolasIn commented Jul 1, 2022 • edited Loading

drboog commented Jul 1, 2022

StolasIn commented Jul 1, 2022

drboog commented Jul 1, 2022

StolasIn commented Jul 2, 2022

Cwj1212 commented Sep 12, 2022

drboog commented Sep 12, 2022

drboog commented Sep 12, 2022

Cwj1212 commented Sep 12, 2022 • edited Loading

StolasIn commented Jul 1, 2022 •

edited

Loading

Cwj1212 commented Sep 12, 2022 •

edited

Loading