为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

zhangfujunaaa · 2024-03-31T07:05:52Z

No description provided.

zhangfujunaaa · 2024-03-31T11:32:51Z

/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "train.py", line 885, in
main(args)
File "train.py", line 851, in main
_ = train(model, epoch, train_dataloader, optimizer, lr_scheduler, scaler, args, output_dir=args.out_dir, output_prefix=args.tag)
File "train.py", line 757, in train
loss_len = nnf.cross_entropy(len_out, mask.sum(dim=-1).to(torch.long) - 1)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2824, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f90b676ca22 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10aa3 (0x7f90b69cdaa3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f90b69cf147 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f90b67565a4 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa2822a (0x7f915b57422a in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa282c1 (0x7f915b5742c1 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1932c6 (0x55f8b191b2c6 in /root/miniconda3/bin/python)
frame #7: + 0x158415 (0x55f8b18e0415 in /root/miniconda3/bin/python)
frame #8: + 0x15878b (0x55f8b18e078b in /root/miniconda3/bin/python)
frame #9: + 0x158415 (0x55f8b18e0415 in /root/miniconda3/bin/python)
frame #10: + 0x15893b (0x55f8b18e093b in /root/miniconda3/bin/python)
frame #11: + 0x193141 (0x55f8b191b141 in /root/miniconda3/bin/python)
frame #12: + 0x15893b (0x55f8b18e093b in /root/miniconda3/bin/python)
frame #13: + 0x193141 (0x55f8b191b141 in /root/miniconda3/bin/python)
frame #14: + 0x1592ac (0x55f8b18e12ac in /root/miniconda3/bin/python)
frame #15: + 0x158e77 (0x55f8b18e0e77 in /root/miniconda3/bin/python)
frame #16: + 0x158e60 (0x55f8b18e0e60 in /root/miniconda3/bin/python)
frame #17: + 0x176057 (0x55f8b18fe057 in /root/miniconda3/bin/python)
frame #18: PyDict_SetItemString + 0x61 (0x55f8b191f3c1 in /root/miniconda3/bin/python)
frame #19: PyImport_Cleanup + 0x9d (0x55f8b195daad in /root/miniconda3/bin/python)
frame #20: Py_FinalizeEx + 0x79 (0x55f8b198fa49 in /root/miniconda3/bin/python)
frame #21: Py_RunMain + 0x183 (0x55f8b1991893 in /root/miniconda3/bin/python)
frame #22: Py_BytesMain + 0x39 (0x55f8b1991ca9 in /root/miniconda3/bin/python)
frame #23: __libc_start_main + 0xe7 (0x7f915debcbf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: + 0x1e21c7 (0x55f8b196a1c7 in /root/miniconda3/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1573) of binary: /root/miniconda3/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_0cy1_ymo/none_380u4lyc/attempt_1/0/error.json
RANK and WORLD_SIZE in environ: 0/1
我换成了我的中文图像描述数据集，在运行train的时候报了这个错误

zhangfujunaaa closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024

zhangfujunaaa reopened this Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

zhangfujunaaa commented Mar 31, 2024 •

edited

Loading

zhangfujunaaa commented Mar 31, 2024

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

Comments

zhangfujunaaa commented Mar 31, 2024 • edited Loading

zhangfujunaaa commented Mar 31, 2024

zhangfujunaaa commented Mar 31, 2024 •

edited

Loading