loss stuck in multi-gpu #7

Zhuysheng · 2020-12-19T02:37:18Z

when training the ssl model in multi-gpu setting, loss get stuck at around 15, but in single gpu, loss decrease normally. my environment is pytorch 1.5, CUDA10.2, GeForce RTX 2080 Ti

BestJuly · 2020-12-19T06:22:15Z

Hi, @Zhuysheng. Thank you for your interest.

In your case, are the multi-gpu settings such as batchsize, learning rate exactly the same as that in single gpu mode?

And I also tried multi-gpu setting in one of my experimental environments: pytorch 1.4, CUDA 10.1, V100. I show the logs in the following.

Logs when using 2 GPUs

Using 2 GPUs
Train: [1/240][590/596] 	loss 15.278 (17.058)	1_p -0.006 (-0.004)	2_p 0.000 (0.000))
Train: [2/240][590/596] 	loss 15.249 (15.257)	1_p 0.007 (0.002))	2_p -0.000 (0.001)
Train: [3/240][590/596] 	loss 15.245 (15.254)	1_p 0.011 (0.002))	2_p 0.002 (0.002))
Train: [4/240][590/596] 	loss 15.242 (15.253)	1_p 0.021 (0.003))	2_p -0.002 (0.005)
Train: [5/240][590/596] 	loss 15.199 (15.248)	1_p -0.013 (0.004)	2_p 0.088 (0.021))
Train: [6/240][590/596] 	loss 15.241 (15.239)	1_p 0.013 (0.021))	2_p 0.027 (0.047))
Train: [7/240][590/596] 	loss 15.220 (15.222)	1_p 0.058 (0.069)	2_p 0.025 (0.063))
Train: [8/240][590/596] 	loss 15.232 (15.204)	1_p 0.064 (0.109)	2_p 0.029 (0.096)
Train: [9/240][590/596] 	loss 15.214 (15.171)	1_p 0.144 (0.172)	2_p 0.044 (0.136)
Train: [10/240][590/596]	loss 15.246 (15.132)	1_p 0.162 (0.251)	2_p 0.107 (0.201)
Train: [11/240][590/596]	loss 15.073 (15.075)	1_p 0.485 (0.372)	2_p 0.366 (0.288)
Train: [12/240][590/596]	loss 4.763 (14.979)	1_p 1.307 (0.618)	2_p 0.443 (0.399)
...

Logs when using 1 GPU: training script is the same

Train: [1/240][590/596] 	loss 15.279 (17.406))   1_p -0.005 (-0.001)     2_p -0.009 (-0.005)
Train: [2/240][590/596] 	loss 15.252 (15.252)    1_p -0.000 (0.004)      2_p 0.014 (0.005))
Train: [3/240][590/596] 	loss 15.253 (15.250)    1_p 0.007 (0.006))      2_p 0.009 (0.008))
Train: [4/240][590/596] 	loss 15.253 (15.248)    1_p 0.004 (0.012))      2_p 0.006 (0.011))
Train: [5/240][590/596] 	loss 15.268 (15.248)    1_p 0.011 (0.026))      2_p 0.018 (0.019))
Train: [6/240][590/596] 	loss 15.262 (15.243)    1_p 0.024 (0.041))      2_p 0.007 (0.031))
Train: [7/240][590/596] 	loss 15.259 (15.235)    1_p 0.039 (0.072)       2_p 0.017 (0.056))
Train: [8/240][590/596] 	loss 15.254 (15.217)    1_p 0.100 (0.119)       2_p 0.021 (0.086)
...

It seems it runs fine for me: the loss decreases as the number of epochs increases.
The multi-gpu part here uses 'nn.nn.DataParallel'. One thing I observed here is that with multi-gpu, the performance is different. I searched for this part and find this. However, it still can not explain or solve the problem you met.

I am not sure whether your situation is affected by the environment. Or for multi-gpu training, you can also try distributed training. And you can also try different training settings such as adjusting learning rate.

Zhuysheng · 2020-12-20T04:15:48Z

@BestJuly Thanks for your quick reply.

In a single RTX 2080 Ti, I set the batch_size=6 for c3d.
In 4 RTX 2080 Ti with 'nn.DataParallel', batch_size=24.

I think main difference is the environment and the size of batch, I will try your suggestions.

By the way, does batch_size affect SSL learning a lot?

BestJuly · 2020-12-20T11:44:06Z

@Zhuysheng
In many papers, batch size will affect SSL learning results. However, in our case, I do not have ablation studies/experiments on batch size. And usually, if you change the batch size, the learning rate should be changed according to the batch size.
Ususally, it follows new_lr = old_lr * new_batch_size / old_batch_size.

And if you want to ask whether the learning rate affects the performance of SSL, my answer is YES.
I also do not have ablation studies/experiments on the code of this repo. However, on a modified experimental code version, I found different learning rate (with the same batch size) will affect the performance. 4% improvements on video retrieval and 1% improments on video recognition can be achieved when changing to a different learning rate. Therefore, you can change learning rate to have a try.

Zhuysheng closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss stuck in multi-gpu #7

loss stuck in multi-gpu #7

Zhuysheng commented Dec 19, 2020

BestJuly commented Dec 19, 2020

Zhuysheng commented Dec 20, 2020

BestJuly commented Dec 20, 2020 •

edited

Loading

loss stuck in multi-gpu #7

loss stuck in multi-gpu #7

Comments

Zhuysheng commented Dec 19, 2020

BestJuly commented Dec 19, 2020

Zhuysheng commented Dec 20, 2020

BestJuly commented Dec 20, 2020 • edited Loading

BestJuly commented Dec 20, 2020 •

edited

Loading