What's the point if we do not gather all outputs in different GPUs to compute contrastive loss #20

BaohaoLiao · 2021-09-02T18:29:57Z

Hi,

this is a really great work. However, I have a general question for the contrastive loss.

In your code, you use 8GPUs for a total batch size of 256. It means 32 samples in one GPU. You compute the contrastive loss of these 32 samples on the same GPU firstly, then gather the loss from different GPUs to compute the final gradient.

However, it makes little sense for me to use this way to increase the batch size. One challenge for the contrastive loss is to find hard negative. Normally we increase the batch size on one single GPU to handle this problem. Since larger batch size offer us more possibility to find hard negatives. But if we use DDP, this kind of larger total batch size is not useful.

For example, I use 16 GPUs for a total batch size of 512. This will result in the same number of samples (32) on one GPU as above. Would it better to gather all of the output embeddings from different GPUs to one GPU to compute the contrastive loss?

In Table 2 of your paper, how do your change the batch size? Increasing the samples on a single GPU and fix the number of GPUs, or increasing the number of GPUs and fix the number of samples on a single GPU? The result is a little weird for me, total batch size of 4096 is the worst.

endernewton · 2021-09-03T15:16:00Z

Is it posted on the wrong repo? For SimSiam, we use l2-loss (cosine similarity), we do not use contrastive loss.

endernewton · 2021-09-03T15:16:59Z

4096 batch size being worse is due to the difficulty of training large batch size, it is observed in other training (e.g. supervised ImageNet) as well.

BaohaoLiao · 2021-09-04T00:03:57Z

Thank you for your response. It seems I mix your method with moco. Sorry for this

endernewton closed this as completed Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss #20

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss #20

BaohaoLiao commented Sep 2, 2021

endernewton commented Sep 3, 2021

endernewton commented Sep 3, 2021

BaohaoLiao commented Sep 4, 2021

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss #20

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss #20

Comments

BaohaoLiao commented Sep 2, 2021

endernewton commented Sep 3, 2021

endernewton commented Sep 3, 2021

BaohaoLiao commented Sep 4, 2021