Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with training #6

Closed
liaolianye666 opened this issue Apr 1, 2022 · 5 comments
Closed

Trouble with training #6

liaolianye666 opened this issue Apr 1, 2022 · 5 comments

Comments

@liaolianye666
Copy link

(PVDNet):CUDA_VISIBLE_DEVICES=0 python -B run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist
Traceback (most recent call last):
File "run.py", line 263, in
init_dist()
File "run.py", line 193, in init_dist
rank = int(os.environ['RANK'])
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

@codeslake
Copy link
Owner

Try the following code:

CUDA_VISIBLE_DEVICES=0 python -B -m torch.distributed.launch --nproc_per_node=1 --master_port=9000 run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist

@liaolianye666
Copy link
Author

(PVDNet)CUDA_VISIBLE_DEVICES=0 python -B -m torch.distributed.launch --nproc_per_node=1 --master_port=9000 run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist
/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
Are you sure to delete the logs (y/n):
y
Laoding Config...
Project : PVDNet_TOG2021
Mode : PVDNet_DVD
Config: config_PVDNet
Network: PVDNet
Trainer: trainer
Loading Model...
initializing deblurring network
/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1646755853042/work/aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Warning! No positional inputs found for a module, assuming batch size is 1.
BIMNet loaded:
BIMNet fixed
Building Optim...
Building Loss...
Loading Learning Rate Scheduler...
Cosine annealing scheduler...
Loading Data Loader...
Building Dist Parallel Model...
Computing model complexity...
Computational complexity (Macs): 1004.12195896 B
Number of parameters: 10.514548 M
Max Epoch: 11539

=========== TRAINING START ============
Traceback (most recent call last):
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3668) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run.py", line 296, in
trainer.train()
File "run.py", line 79, in train
self.iteration(epoch, state, is_log)
File "run.py", line 144, in iteration
for inputs in data_loader:
File "/home/liao/PVDNet-main/data_loader/FastDataLoader.py", line 24, in iter
yield next(self.iterator)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1163, in _get_data
success, data = self._try_get_data()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3668) exited unexpectedly
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3634) of binary: /home/liao/anaconda3/envs/PVDNet/bin/python
Traceback (most recent call last):
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liao/anaconda3/envs/PVDNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-01_16:30:45
host : liao-GI5CN54
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3634)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@codeslake
Copy link
Owner

I think it is the dataset problem. Try the following first:

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=9000 run.py --is_train --mode PVDNet_DVD --config config_PVDNet --trainer trainer --data DVD --LRS CA -b 2 -th 8 -dl -ss -dist

@liaolianye666
Copy link
Author

You are right, but it failed the second time I tried it. It failed several times and then it worked. Could it be that my computer didn't have enough memory

@liaolianye666
Copy link
Author

I appreciate your helping me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants