How to use distributed trainning? #1

haoyan14 · 2023-08-16T09:27:53Z

Why when I use distributed training, I get stuck here in "model = torch.nn.parallel.DistributedDataParallel(model.to(local_rank), device_ids=[local_rank], output_device=local_rank, find_unused_parameters=False)"?

boheumd · 2023-08-17T18:07:44Z

Hello, I did not come across this problem before. Can you provide more information about this issue? If you change to the single gpu training, will it have the similar problem?

haoyan14 · 2023-08-19T01:21:54Z

Sorry,i have solved the problem. The reason for this situation is that I mixed the use of "torchrun" and "-d".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use distributed trainning? #1

How to use distributed trainning? #1

haoyan14 commented Aug 16, 2023

boheumd commented Aug 17, 2023

haoyan14 commented Aug 19, 2023

How to use distributed trainning? #1

How to use distributed trainning? #1

Comments

haoyan14 commented Aug 16, 2023

boheumd commented Aug 17, 2023

haoyan14 commented Aug 19, 2023