Wrong device in average_metrics function #34

AndreyBocharnikov · 2023-02-10T14:58:31Z

🐛 Bug Report

Since average_metrics function is being called from backward that should be called from every gpu, the devise that is being created here should be equal to the current rank, otherwise torch.distributed.all_reduce will be stucked forever.

The text was updated successfully, but these errors were encountered:

adefossez · 2023-03-03T09:51:49Z

We use Dora for all of our experiments which typically calls torch.cuda.set_device at the beginning of the training with the proper device. That allows to use 'cuda' everywhere after without worrying about the rank of the gpu.

AndreyBocharnikov added the bug Something isn't working label Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong device in average_metrics function #34

Wrong device in average_metrics function #34

AndreyBocharnikov commented Feb 10, 2023

adefossez commented Mar 3, 2023

Wrong device in average_metrics function #34

Wrong device in average_metrics function #34

Comments

AndreyBocharnikov commented Feb 10, 2023

🐛 Bug Report

adefossez commented Mar 3, 2023