Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lowering gpu requirement hyperparameter setting #16

Closed
StolasIn opened this issue Jul 1, 2022 · 8 comments
Closed

lowering gpu requirement hyperparameter setting #16

StolasIn opened this issue Jul 1, 2022 · 8 comments

Comments

@StolasIn
Copy link

StolasIn commented Jul 1, 2022

First of all thank you very much for this great work. It really helped me a lot. but I have a few questions that I want to ask.

I'm a beginner in text to image synthesis task, so my following questions may be a little stupid.

  1. for training environment issue, I only had one gpu (RTX3090) avaliable right now. I think it will not Influence gamma (batch size = 16 on one gpu) and other hyperparameters like itc, itd etc (in COCO dataset and use ground truth setting). Am I correct ? If not, Is there any tuning advice can you give me in this situation ?

image
I think gamma is only sensitive in dataset and batch size per gpu.

  1. in the paper, are you using ada='noaug' setting because the training data is sufficient (after x-flip) ?
@drboog
Copy link
Owner

drboog commented Jul 1, 2022

Hi, my suggestion is to start with gamma=10 and try different itc and itd. The reason is: 1) I only roughly tuned hyper-parameters, the provided hyper-parameters are not optimal, so if you are looking for better performance, you may want to tune them; 2) although each GPU has 16 samples, the total mini-batch size is different when you use different number of GPUs, which I think will influence the hyper-parameter settings, but I think gamma=10 will lead to promising results.
I used 'ada=noaug' because 1) for fair comparison, I want to compare with previous methods fairly, which didn't use augmentation, to show our effectiveness under the same setting; 2) I did some experiments on different datasets (although the hyper-parameters may not be well-tuned), the ADA augmentation is not guaranteed to improve performance on all the datasets, so I didn't use it in final experiments for simplicity.

@StolasIn
Copy link
Author

StolasIn commented Jul 1, 2022

Thanks for your advice, It's a great help for me.

I still have some questions about hyper-parameters tuning and batch size.

  1. In the paper, you select itd and itc from 0 ~ 50. my question is those hyper-parameters might still be in this range right ? perhaps close to the original setting (ex. itd = 5, itc = 10) ?

  2. In the latest model which use contrastive learing method, I noticed their batch size were usually seted quite large (>256). does large batch size result in better performance in Lafite ? Or using lower batch size have some benefits ?

@drboog
Copy link
Owner

drboog commented Jul 1, 2022

Yes, I think you can tune the hyper-parameters by searching in this range. I think larger batch size will lead to performance improvement, because it could provide better discriminative information for training. But in that case you may also need to tune some hyper-parameters for contrastive loss (--temp=0.5, --lam=0. are tuned based on batch size=16 per GPU).

@StolasIn
Copy link
Author

StolasIn commented Jul 2, 2022

I appreciate for you answering my questions. I will close this issue and do some experiment mentioned above.

@StolasIn StolasIn closed this as completed Jul 2, 2022
@Cwj1212
Copy link

Cwj1212 commented Sep 12, 2022

@StolasIn Have you reproduced the results with one gpu now? I have reproduced the results of the paper under the setting of four gpus (batch=32, btach_gpu=8 gets better results than the paper). But when I try to experiment with one gpu, I get only poor results.
@drboog ,Thank you so much for your work, for which I have to keep harassing you again. As for why the hyperparameters need to be re-adjusted for one gpu, my observation is that the {gather:flase} setting in the contrastive loss in the code is to distribute and calculate the contrastive loss on each gpu. I don't know what else causes the difference between one gpu and four gpu?
What confuses me is that I modified the calculation method of contrastive learning under the one gpu setting, simulating it as the {gather:false} setting (divide a batch of samples into four parts, and calculate their contrastive losses separately) , but still only get poor results.

@drboog
Copy link
Owner

drboog commented Sep 12, 2022

The performance is related to many things, batch size, learning rate, regularizer ... For example, for StyleGAN2 without contrastive loss (for image generation not text-to-image generation), GPUs still matters a lot.
https://github.com/NVlabs/stylegan2-ada/blob/main/docs/stylegan2-ada-training-curves.png

@drboog
Copy link
Owner

drboog commented Sep 12, 2022

Assume under 4 GPU setting, each GPU has N samples, resulting in a batch size of 4N. Are you using batch size of 4N when using one GPU?

@Cwj1212
Copy link

Cwj1212 commented Sep 12, 2022

@drboog Yes, I did, but the performance of one card is still significantly worse than four cards. Thank you very much for providing this picture. I originally thought it was the difference caused by the different number of gups corresponding to different hyperparameters when cfg=auto.
But as a beginner, I still can't understand why the forward and back propagation of the network are not equivalent at this time. Are they equivalent in theory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants