Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when I run on a single node with 8 gpus #44

Closed
panda15963 opened this issue Sep 26, 2021 · 2 comments
Closed

Error when I run on a single node with 8 gpus #44

panda15963 opened this issue Sep 26, 2021 · 2 comments

Comments

@panda15963
Copy link

(Minseok) ubuntu@DESKTOP-SMIU2JP:~/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main$ python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5 /home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


| distributed init (rank 0): env://
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
main(args) main(args)main(args)

File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
Traceback (most recent call last):
main(args)Traceback (most recent call last):

File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
dist.init_distributed_mode(args)dist.init_distributed_mode(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode

dist.init_distributed_mode(args)Traceback (most recent call last):
dist.init_distributed_mode(args)
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode

File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 643, in
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)main(args)main(args)torch.cuda.set_device(args.gpu)

File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
main(args)torch._C._cuda_setDevice(device)

File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/main.py", line 281, in main
RuntimeError torch._C._cuda_setDevice(device): torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
dist.init_distributed_mode(args)dist.init_distributed_mode(args)

RuntimeError
RuntimeError
RuntimeErrordist.init_distributed_mode(args): : File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.: File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode

CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

File "/home/ubuntu/anaconda3/envs/Minseok/Portfolio/TeamProject/mdetr-main/util/dist.py", line 220, in init_distributed_mode
torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)torch.cuda.set_device(args.gpu)

File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

RuntimeErrorRuntimeErrortorch._C._cuda_setDevice(device): 

: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError

: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8606 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8607) of binary: /home/ubuntu/anaconda3/envs/Minseok/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/Minseok/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


        main.py FAILED           

=====================================
Root Cause:
[0]:
time: 2021-09-26_17:57:42
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 8607)
error_file: <N/A>
msg: Process failed with exitcode 1

Other Failures:
[1]:
time: 2021-09-26_17:57:42
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 8608)
error_file: <N/A>
msg: Process failed with exitcode 1
[2]:
time: 2021-09-26_17:57:42
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 8609)
error_file: <N/A>
msg: Process failed with exitcode 1
[3]:
time: 2021-09-26_17:57:42
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 8610)
error_file: <N/A>
msg: Process failed with exitcode 1
[4]:
time: 2021-09-26_17:57:42
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 8611)
error_file: <N/A>
msg: Process failed with exitcode 1
[5]:
time: 2021-09-26_17:57:42
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 8612)
error_file: <N/A>
msg: Process failed with exitcode 1
[6]:
time: 2021-09-26_17:57:42
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 8613)
error_file: <N/A>
msg: Process failed with exitcode 1
*************************************`
I just find this error when I try to run on a single node with 8 gpus in Renet101 with Mdetr Model
it seems I got multiple errors in once and I have no idea how to fix these errors.
Can anyone help me to fix?

@nikky4D
Copy link

nikky4D commented Feb 2, 2022

were you able to fix this error? I'm running into something similar

@panda15963
Copy link
Author

were you able to fix this error? I'm running into something similar
I guess its uncompleted project so I gave up this essay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants