Training Never Starts in Single GPU machine (Solution) #26

TheSeriousProgrammer · 2022-08-16T14:39:47Z

Hi , I dont use much pytorch-lightning or performed distributed training (noob here),
but found that the training never starts on a single gpu single node configuration

The solution which I found out for the same was to modify the num_nodes parameter in your train configuration to 1
If number is greater that 1 pytorch lightning waits for the other nodes I presume

It took a lot of time for me to get it right , putting it out for fellow noobs : )

Thanks for sharing such an incredible work to the community !!!

The text was updated successfully, but these errors were encountered:

gwkrsrch · 2022-08-17T05:18:45Z

Hi @TheSeriousProgrammer, thanks for sharing your experience!
d3f759b will help to prevent such misunderstanding. Thanks a lot :)

TheSeriousProgrammer closed this as completed Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Never Starts in Single GPU machine (Solution) #26

Training Never Starts in Single GPU machine (Solution) #26

TheSeriousProgrammer commented Aug 16, 2022

gwkrsrch commented Aug 17, 2022

Training Never Starts in Single GPU machine (Solution) #26

Training Never Starts in Single GPU machine (Solution) #26

Comments

TheSeriousProgrammer commented Aug 16, 2022

gwkrsrch commented Aug 17, 2022