broken pipe when sending distributed run #3

mrT23 · 2019-11-25T10:22:55Z

i am trying to send a distributed run via 'launch' script:
python -m fastai2.launch train.py --model_name=xresnet
on a 1xV100 machine, then run succeed
on a 2xV100 machine, the run fails and i get an error message:

File "/opt/conda/lib/python3.6/site-packages/fastai2/distributed.py", line 91, in begin_fit
self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in init
self.broadcast_bucket_size)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe

i pulled the latest version of fastai2 from 25.11

The text was updated successfully, but these errors were encountered:

sgugger · 2019-11-25T14:28:12Z

This is too vague an error message and could come from multiple things (it basically says there was an error...) Unless you debug it more to point us to something in the library that is the cause, ther eis nothing we can do to help.

sgugger closed this as completed Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broken pipe when sending distributed run #3

broken pipe when sending distributed run #3

mrT23 commented Nov 25, 2019

sgugger commented Nov 25, 2019

broken pipe when sending distributed run #3

broken pipe when sending distributed run #3

Comments

mrT23 commented Nov 25, 2019

sgugger commented Nov 25, 2019