You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 18, 2020. It is now read-only.
i am trying to send a distributed run via 'launch' script: python -m fastai2.launch train.py --model_name=xresnet
on a 1xV100 machine, then run succeed
on a 2xV100 machine, the run fails and i get an error message:
File "/opt/conda/lib/python3.6/site-packages/fastai2/distributed.py", line 91, in begin_fit
self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in init
self.broadcast_bucket_size)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe
i pulled the latest version of fastai2 from 25.11
The text was updated successfully, but these errors were encountered:
This is too vague an error message and could come from multiple things (it basically says there was an error...) Unless you debug it more to point us to something in the library that is the cause, ther eis nothing we can do to help.
i am trying to send a distributed run via 'launch' script:
python -m fastai2.launch train.py --model_name=xresnet
on a 1xV100 machine, then run succeed
on a 2xV100 machine, the run fails and i get an error message:
i pulled the latest version of fastai2 from 25.11
The text was updated successfully, but these errors were encountered: