You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's start with the workaround that fixes the issue:
--- a/fastai/distributed.py
+++ b/fastai/distributed.py
@@ -29,7 +29,7 @@ class DistributedTrainer(LearnerCallback):
return old_dl,new_dl,sampler
def on_train_begin(self, **kwargs):
- self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id)
+ self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id, find_unused_parameters=True)
shuffle = self.data.train_dl.init_kwargs['shuffle'] if hasattr(self.data.train_dl, 'init_kwargs') else True
self.old_train_dl,self.data.train_dl,self.train_sampler = self._change_dl(self.data.train_dl, shuffle)
if hasattr(self.data, 'valid_dl') and self.data.valid_dl is not None:
added find_unused_parameters=True to DistributedDataParallel
but I have no idea what it does and whether it camouflages some other problem.
So language_model_learner(data_lm, AWD_LSTM, pretrained=False) runs fine in a single gpu mode, but crashes on distributed with:
Traceback (most recent call last):
File "./mimic_lm_distr.py", line 69, in <module>
seed: Param("Random seed", int)=42,
File "/mnt/nvme1/fast.ai-1/br/fastai/master/fastai/script.py", line 40, in call_parse
func(**args.__dict__)
File "./mimic_lm_distr.py", line 105, in main
learn.fit_one_cycle(10, slice(1e-2), moms=moms)
File "/mnt/nvme1/fast.ai-1/br/fastai/master/fastai/train.py", line 22, in fit_one_cycle
learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
File "/mnt/nvme1/fast.ai-1/br/fastai/master/fastai/basic_train.py", line 200, in fit
fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
File "/mnt/nvme1/fast.ai-1/br/fastai/master/fastai/basic_train.py", line 101, in fit
loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
File "/mnt/nvme1/fast.ai-1/br/fastai/master/fastai/basic_train.py", line 26, in loss_batch
out = model(*xb)
File "/home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 401, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch-nightly_1559452046329/work/torch/csrc/distributed/c10d/reducer.cpp:410)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fe928977265 in /home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x61b (0x7fe9577cca1b in /home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x7116d8 (0x7fe9577c26d8 in /home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x216716 (0x7fe9572c7716 in /home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x55da2c35a6e4 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x55da2c35a801 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #6: _PyEval_EvalFrameDefault + 0x537e (0x55da2c3b67ae in /home/stas/anaconda3/envs/fastai/bin/python)
frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #8: _PyFunction_FastCallDict + 0x1d5 (0x55da2c2f85d5 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #9: _PyObject_Call_Prepend + 0x63 (0x55da2c30fc43 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #10: PyObject_Call + 0x6e (0x55da2c30495e in /home/stas/anaconda3/envs/fastai/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x1e20 (0x55da2c3b3250 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #13: _PyFunction_FastCallDict + 0x1d5 (0x55da2c2f85d5 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #14: _PyObject_Call_Prepend + 0x63 (0x55da2c30fc43 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #15: <unknown function> + 0x17116a (0x55da2c35216a in /home/stas/anaconda3/envs/fastai/bin/python)
frame #16: PyObject_Call + 0x6e (0x55da2c30495e in /home/stas/anaconda3/envs/fastai/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x1e20 (0x55da2c3b3250 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #19: _PyFunction_FastCallKeywords + 0x325 (0x55da2c3599c5 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x416 (0x55da2c3b1846 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0x387 (0x55da2c359a27 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x14ce (0x55da2c3b28fe in /home/stas/anaconda3/envs/fastai/bin/python)
frame #24: _PyEval_EvalCodeWithName + 0xbb9 (0x55da2c2f7db9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #25: _PyFunction_FastCallKeywords + 0x387 (0x55da2c359a27 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x55da2c3b28fe in /home/stas/anaconda3/envs/fastai/bin/python)
frame #27: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #28: _PyFunction_FastCallKeywords + 0x387 (0x55da2c359a27 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x14ce (0x55da2c3b28fe in /home/stas/anaconda3/envs/fastai/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #31: _PyFunction_FastCallDict + 0x400 (0x55da2c2f8800 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x1e20 (0x55da2c3b3250 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #33: _PyFunction_FastCallKeywords + 0xfb (0x55da2c35979b in /home/stas/anaconda3/envs/fastai/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x416 (0x55da2c3b1846 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0x2f9 (0x55da2c2f74f9 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #36: PyEval_EvalCodeEx + 0x44 (0x55da2c2f83c4 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #37: PyEval_EvalCode + 0x1c (0x55da2c2f83ec in /home/stas/anaconda3/envs/fastai/bin/python)
frame #38: <unknown function> + 0x22f874 (0x55da2c410874 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #39: PyRun_FileExFlags + 0xa1 (0x55da2c41ab81 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #40: PyRun_SimpleFileExFlags + 0x1c3 (0x55da2c41ad73 in /home/stas/anaconda3/envs/fastai/bin/python)
frame #41: <unknown function> + 0x23ae5f (0x55da2c41be5f in /home/stas/anaconda3/envs/fastai/bin/python)
frame #42: _Py_UnixMain + 0x3c (0x55da2c41bf7c in /home/stas/anaconda3/envs/fastai/bin/python)
frame #43: __libc_start_main + 0xe7 (0x7fe9714d9b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #44: <unknown function> + 0x1e0122 (0x55da2c3c1122 in /home/stas/anaconda3/envs/fastai/bin/python)
I followed the RuntimeError error instructions and added find_unused_parameters=True as it suggested. But instead of reporting unused params, it just worked. The distributed training worked.
I don't know yet anything about this argument, I hope perhaps you do. If not I will investigate tomorrow.
note that this problem doesn't exist with pretrained=True.
this is all with git master.
I can also make a script to reproduce the problem if it helps, really just the staple LM like the lesson, but pretrained=False and run with python -m torch.distributed.launch --nproc_per_node=2 ./script.py.
Thank you!
p.s. looks like find_unused_parameters=True was added some time in pytorch 1.2.0.dev2 (i.e. not in 1.0.1.post2)
Let's start with the workaround that fixes the issue:
added
find_unused_parameters=True
toDistributedDataParallel
but I have no idea what it does and whether it camouflages some other problem.
So
language_model_learner(data_lm, AWD_LSTM, pretrained=False)
runs fine in a single gpu mode, but crashes on distributed with:I followed the RuntimeError error instructions and added
find_unused_parameters=True
as it suggested. But instead of reporting unused params, it just worked. The distributed training worked.I don't know yet anything about this argument, I hope perhaps you do. If not I will investigate tomorrow.
note that this problem doesn't exist with
pretrained=True
.this is all with git master.
I can also make a script to reproduce the problem if it helps, really just the staple LM like the lesson, but pretrained=False and run with
python -m torch.distributed.launch --nproc_per_node=2 ./script.py
.Thank you!
p.s. looks like
find_unused_parameters=True
was added some time in pytorch 1.2.0.dev2 (i.e. not in 1.0.1.post2)env:
The text was updated successfully, but these errors were encountered: