Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss stuck in multi-gpu #7

Closed
Zhuysheng opened this issue Dec 19, 2020 · 3 comments
Closed

loss stuck in multi-gpu #7

Zhuysheng opened this issue Dec 19, 2020 · 3 comments

Comments

@Zhuysheng
Copy link

when training the ssl model in multi-gpu setting, loss get stuck at around 15, but in single gpu, loss decrease normally. my environment is pytorch 1.5, CUDA10.2, GeForce RTX 2080 Ti

@BestJuly
Copy link
Owner

Hi, @Zhuysheng. Thank you for your interest.

In your case, are the multi-gpu settings such as batchsize, learning rate exactly the same as that in single gpu mode?

And I also tried multi-gpu setting in one of my experimental environments: pytorch 1.4, CUDA 10.1, V100. I show the logs in the following.

Logs when using 2 GPUs

Using 2 GPUs
Train: [1/240][590/596] 	loss 15.278 (17.058)	1_p -0.006 (-0.004)	2_p 0.000 (0.000))
Train: [2/240][590/596] 	loss 15.249 (15.257)	1_p 0.007 (0.002))	2_p -0.000 (0.001)
Train: [3/240][590/596] 	loss 15.245 (15.254)	1_p 0.011 (0.002))	2_p 0.002 (0.002))
Train: [4/240][590/596] 	loss 15.242 (15.253)	1_p 0.021 (0.003))	2_p -0.002 (0.005)
Train: [5/240][590/596] 	loss 15.199 (15.248)	1_p -0.013 (0.004)	2_p 0.088 (0.021))
Train: [6/240][590/596] 	loss 15.241 (15.239)	1_p 0.013 (0.021))	2_p 0.027 (0.047))
Train: [7/240][590/596] 	loss 15.220 (15.222)	1_p 0.058 (0.069)	2_p 0.025 (0.063))
Train: [8/240][590/596] 	loss 15.232 (15.204)	1_p 0.064 (0.109)	2_p 0.029 (0.096)
Train: [9/240][590/596] 	loss 15.214 (15.171)	1_p 0.144 (0.172)	2_p 0.044 (0.136)
Train: [10/240][590/596]	loss 15.246 (15.132)	1_p 0.162 (0.251)	2_p 0.107 (0.201)
Train: [11/240][590/596]	loss 15.073 (15.075)	1_p 0.485 (0.372)	2_p 0.366 (0.288)
Train: [12/240][590/596]	loss 4.763 (14.979)	1_p 1.307 (0.618)	2_p 0.443 (0.399)
...

Logs when using 1 GPU: training script is the same

Train: [1/240][590/596] 	loss 15.279 (17.406))   1_p -0.005 (-0.001)     2_p -0.009 (-0.005)
Train: [2/240][590/596] 	loss 15.252 (15.252)    1_p -0.000 (0.004)      2_p 0.014 (0.005))
Train: [3/240][590/596] 	loss 15.253 (15.250)    1_p 0.007 (0.006))      2_p 0.009 (0.008))
Train: [4/240][590/596] 	loss 15.253 (15.248)    1_p 0.004 (0.012))      2_p 0.006 (0.011))
Train: [5/240][590/596] 	loss 15.268 (15.248)    1_p 0.011 (0.026))      2_p 0.018 (0.019))
Train: [6/240][590/596] 	loss 15.262 (15.243)    1_p 0.024 (0.041))      2_p 0.007 (0.031))
Train: [7/240][590/596] 	loss 15.259 (15.235)    1_p 0.039 (0.072)       2_p 0.017 (0.056))
Train: [8/240][590/596] 	loss 15.254 (15.217)    1_p 0.100 (0.119)       2_p 0.021 (0.086)
...

It seems it runs fine for me: the loss decreases as the number of epochs increases.
The multi-gpu part here uses 'nn.nn.DataParallel'. One thing I observed here is that with multi-gpu, the performance is different. I searched for this part and find this. However, it still can not explain or solve the problem you met.

I am not sure whether your situation is affected by the environment. Or for multi-gpu training, you can also try distributed training. And you can also try different training settings such as adjusting learning rate.

@Zhuysheng
Copy link
Author

@BestJuly Thanks for your quick reply.

In a single RTX 2080 Ti, I set the batch_size=6 for c3d.
In 4 RTX 2080 Ti with 'nn.DataParallel', batch_size=24.

I think main difference is the environment and the size of batch, I will try your suggestions.

By the way, does batch_size affect SSL learning a lot?

@BestJuly
Copy link
Owner

BestJuly commented Dec 20, 2020

@Zhuysheng
In many papers, batch size will affect SSL learning results. However, in our case, I do not have ablation studies/experiments on batch size. And usually, if you change the batch size, the learning rate should be changed according to the batch size.
Ususally, it follows new_lr = old_lr * new_batch_size / old_batch_size.

And if you want to ask whether the learning rate affects the performance of SSL, my answer is YES.
I also do not have ablation studies/experiments on the code of this repo. However, on a modified experimental code version, I found different learning rate (with the same batch size) will affect the performance. 4% improvements on video retrieval and 1% improments on video recognition can be achieved when changing to a different learning rate. Therefore, you can change learning rate to have a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants