-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss stuck in multi-gpu #7
Comments
Hi, @Zhuysheng. Thank you for your interest. In your case, are the multi-gpu settings such as batchsize, learning rate exactly the same as that in single gpu mode? And I also tried multi-gpu setting in one of my experimental environments: pytorch 1.4, CUDA 10.1, V100. I show the logs in the following. Logs when using 2 GPUs
Logs when using 1 GPU: training script is the same
It seems it runs fine for me: the loss decreases as the number of epochs increases. I am not sure whether your situation is affected by the environment. Or for multi-gpu training, you can also try distributed training. And you can also try different training settings such as adjusting learning rate. |
@BestJuly Thanks for your quick reply. In a single RTX 2080 Ti, I set the batch_size=6 for c3d. I think main difference is the environment and the size of batch, I will try your suggestions. By the way, does batch_size affect SSL learning a lot? |
@Zhuysheng And if you want to ask whether the learning rate affects the performance of SSL, my answer is YES. |
when training the ssl model in multi-gpu setting, loss get stuck at around 15, but in single gpu, loss decrease normally. my environment is pytorch 1.5, CUDA10.2, GeForce RTX 2080 Ti
The text was updated successfully, but these errors were encountered: