You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since average_metrics function is being called from backward that should be called from every gpu, the devise that is being created here should be equal to the current rank, otherwise torch.distributed.all_reduce will be stucked forever.
The text was updated successfully, but these errors were encountered:
We use Dora for all of our experiments which typically calls torch.cuda.set_device at the beginning of the training with the proper device. That allows to use 'cuda' everywhere after without worrying about the rank of the gpu.
馃悰 Bug Report
Since
average_metrics
function is being called frombackward
that should be called from every gpu, thedevise
that is being created here should be equal to the current rank, otherwisetorch.distributed.all_reduce
will be stucked forever.The text was updated successfully, but these errors were encountered: