Skip to content
This repository has been archived by the owner on Aug 18, 2020. It is now read-only.

broken pipe when sending distributed run #3

Closed
mrT23 opened this issue Nov 25, 2019 · 1 comment
Closed

broken pipe when sending distributed run #3

mrT23 opened this issue Nov 25, 2019 · 1 comment

Comments

@mrT23
Copy link

mrT23 commented Nov 25, 2019

i am trying to send a distributed run via 'launch' script:
python -m fastai2.launch train.py --model_name=xresnet
on a 1xV100 machine, then run succeed
on a 2xV100 machine, the run fails and i get an error message:

File "/opt/conda/lib/python3.6/site-packages/fastai2/distributed.py", line 91, in begin_fit
self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in init
self.broadcast_bucket_size)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe

i pulled the latest version of fastai2 from 25.11

@sgugger
Copy link
Contributor

sgugger commented Nov 25, 2019

This is too vague an error message and could come from multiple things (it basically says there was an error...) Unless you debug it more to point us to something in the library that is the cause, ther eis nothing we can do to help.

@sgugger sgugger closed this as completed Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants