Trouble with training #6

liaolianye666 · 2022-04-01T08:20:00Z

（PVDNet）：CUDA_VISIBLE_DEVICES=0 python -B run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist
Traceback (most recent call last):
File "run.py", line 263, in
init_dist()
File "run.py", line 193, in init_dist
rank = int(os.environ['RANK'])
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

codeslake · 2022-04-01T08:24:27Z

Try the following code:

CUDA_VISIBLE_DEVICES=0 python -B -m torch.distributed.launch --nproc_per_node=1 --master_port=9000 run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist

liaolianye666 · 2022-04-01T08:41:22Z

(PVDNet)CUDA_VISIBLE_DEVICES=0 python -B -m torch.distributed.launch --nproc_per_node=1 --master_port=9000 run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist
/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
Are you sure to delete the logs (y/n):
y
Laoding Config...
Project : PVDNet_TOG2021
Mode : PVDNet_DVD
Config: config_PVDNet
Network: PVDNet
Trainer: trainer
Loading Model...
initializing deblurring network
/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1646755853042/work/aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Warning! No positional inputs found for a module, assuming batch size is 1.
BIMNet loaded:
BIMNet fixed
Building Optim...
Building Loss...
Loading Learning Rate Scheduler...
Cosine annealing scheduler...
Loading Data Loader...
Building Dist Parallel Model...
Computing model complexity...
Computational complexity (Macs): 1004.12195896 B
Number of parameters: 10.514548 M
Max Epoch: 11539

=========== TRAINING START ============
Traceback (most recent call last):
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3668) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run.py", line 296, in
trainer.train()
File "run.py", line 79, in train
self.iteration(epoch, state, is_log)
File "run.py", line 144, in iteration
for inputs in data_loader:
File "/home/liao/PVDNet-main/data_loader/FastDataLoader.py", line 24, in iter
yield next(self.iterator)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1163, in _get_data
success, data = self._try_get_data()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3668) exited unexpectedly
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3634) of binary: /home/liao/anaconda3/envs/PVDNet/bin/python
Traceback (most recent call last):
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-01_16:30:45
host : liao-GI5CN54
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3634)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

codeslake · 2022-04-01T08:44:36Z

I think it is the dataset problem. Try the following first:

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=9000 run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist

liaolianye666 · 2022-04-01T09:07:02Z

You are right, but it failed the second time I tried it. It failed several times and then it worked. Could it be that my computer didn't have enough memory

liaolianye666 · 2022-04-01T09:17:22Z

I appreciate your helping me

codeslake closed this as completed Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble with training #6

Trouble with training #6

liaolianye666 commented Apr 1, 2022

codeslake commented Apr 1, 2022

liaolianye666 commented Apr 1, 2022

codeslake commented Apr 1, 2022

liaolianye666 commented Apr 1, 2022

liaolianye666 commented Apr 1, 2022

Trouble with training #6

Trouble with training #6

Comments

liaolianye666 commented Apr 1, 2022

codeslake commented Apr 1, 2022

liaolianye666 commented Apr 1, 2022

run.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-04-01_16:30:45 host : liao-GI5CN54 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3634) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

codeslake commented Apr 1, 2022

liaolianye666 commented Apr 1, 2022

liaolianye666 commented Apr 1, 2022

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-01_16:30:45
host : liao-GI5CN54
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3634)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html