You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Minseok) ubuntu@DESKTOP-SMIU2JP:~/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main$ python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5 /home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
| distributed init (rank 0): env://
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
main(args) main(args)main(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
Traceback (most recent call last):
main(args)Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
dist.init_distributed_mode(args)dist.init_distributed_mode(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
dist.init_distributed_mode(args)Traceback (most recent call last):
dist.init_distributed_mode(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)main(args)main(args)torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
main(args)torch._C._cuda_setDevice(device)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
RuntimeError torch._C._cuda_setDevice(device): torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
dist.init_distributed_mode(args)dist.init_distributed_mode(args)
RuntimeError
RuntimeError
RuntimeErrordist.init_distributed_mode(args): : File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.: File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8606 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8607) of binary: /home/ubuntu/anaconda3/envs/Minseok/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Other Failures:
[1]:
time: 2021-09-26_17:57:42
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 8608)
error_file: <N/A>
msg: Process failed with exitcode 1
[2]:
time: 2021-09-26_17:57:42
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 8609)
error_file: <N/A>
msg: Process failed with exitcode 1
[3]:
time: 2021-09-26_17:57:42
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 8610)
error_file: <N/A>
msg: Process failed with exitcode 1
[4]:
time: 2021-09-26_17:57:42
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 8611)
error_file: <N/A>
msg: Process failed with exitcode 1
[5]:
time: 2021-09-26_17:57:42
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 8612)
error_file: <N/A>
msg: Process failed with exitcode 1
[6]:
time: 2021-09-26_17:57:42
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 8613)
error_file: <N/A>
msg: Process failed with exitcode 1
*************************************`
I just find this error when I try to run on a single node with 8 gpus in Renet101 with Mdetr Model
it seems I got multiple errors in once and I have no idea how to fix these errors.
Can anyone help me to fix?
The text was updated successfully, but these errors were encountered:
(Minseok) ubuntu@DESKTOP-SMIU2JP:~/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main$ python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5 /home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rankargument to be set, please change it to read from
os.environ['LOCAL_RANK']` instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
| distributed init (rank 0): env://
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
main(args) main(args)main(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
Traceback (most recent call last):
main(args)Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
dist.init_distributed_mode(args)dist.init_distributed_mode(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
dist.init_distributed_mode(args)Traceback (most recent call last):
dist.init_distributed_mode(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)main(args)main(args)torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
main(args)torch._C._cuda_setDevice(device)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
RuntimeError torch._C._cuda_setDevice(device): torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
dist.init_distributed_mode(args)dist.init_distributed_mode(args)
RuntimeError
RuntimeError
RuntimeErrordist.init_distributed_mode(args): : File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.: File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8606 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8607) of binary: /home/ubuntu/anaconda3/envs/Minseok/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================
Root Cause:
[0]:
time: 2021-09-26_17:57:42
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 8607)
error_file: <N/A>
msg: Process failed with exitcode 1
Other Failures:
[1]:
time: 2021-09-26_17:57:42
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 8608)
error_file: <N/A>
msg: Process failed with exitcode 1
[2]:
time: 2021-09-26_17:57:42
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 8609)
error_file: <N/A>
msg: Process failed with exitcode 1
[3]:
time: 2021-09-26_17:57:42
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 8610)
error_file: <N/A>
msg: Process failed with exitcode 1
[4]:
time: 2021-09-26_17:57:42
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 8611)
error_file: <N/A>
msg: Process failed with exitcode 1
[5]:
time: 2021-09-26_17:57:42
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 8612)
error_file: <N/A>
msg: Process failed with exitcode 1
[6]:
time: 2021-09-26_17:57:42
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 8613)
error_file: <N/A>
msg: Process failed with exitcode 1
*************************************`
I just find this error when I try to run on a single node with 8 gpus in Renet101 with Mdetr Model
it seems I got multiple errors in once and I have no idea how to fix these errors.
Can anyone help me to fix?
The text was updated successfully, but these errors were encountered: