Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

Open
zhangfujunaaa opened this issue Mar 31, 2024 · 1 comment

Comments

@zhangfujunaaa
Copy link

zhangfujunaaa commented Mar 31, 2024

No description provided.

@zhangfujunaaa zhangfujunaaa closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
@zhangfujunaaa
Copy link
Author

/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "train.py", line 885, in
main(args)
File "train.py", line 851, in main
_ = train(model, epoch, train_dataloader, optimizer, lr_scheduler, scaler, args, output_dir=args.out_dir, output_prefix=args.tag)
File "train.py", line 757, in train
loss_len = nnf.cross_entropy(len_out, mask.sum(dim=-1).to(torch.long) - 1)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2824, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f90b676ca22 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10aa3 (0x7f90b69cdaa3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f90b69cf147 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f90b67565a4 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa2822a (0x7f915b57422a in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa282c1 (0x7f915b5742c1 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1932c6 (0x55f8b191b2c6 in /root/miniconda3/bin/python)
frame #7: + 0x158415 (0x55f8b18e0415 in /root/miniconda3/bin/python)
frame #8: + 0x15878b (0x55f8b18e078b in /root/miniconda3/bin/python)
frame #9: + 0x158415 (0x55f8b18e0415 in /root/miniconda3/bin/python)
frame #10: + 0x15893b (0x55f8b18e093b in /root/miniconda3/bin/python)
frame #11: + 0x193141 (0x55f8b191b141 in /root/miniconda3/bin/python)
frame #12: + 0x15893b (0x55f8b18e093b in /root/miniconda3/bin/python)
frame #13: + 0x193141 (0x55f8b191b141 in /root/miniconda3/bin/python)
frame #14: + 0x1592ac (0x55f8b18e12ac in /root/miniconda3/bin/python)
frame #15: + 0x158e77 (0x55f8b18e0e77 in /root/miniconda3/bin/python)
frame #16: + 0x158e60 (0x55f8b18e0e60 in /root/miniconda3/bin/python)
frame #17: + 0x176057 (0x55f8b18fe057 in /root/miniconda3/bin/python)
frame #18: PyDict_SetItemString + 0x61 (0x55f8b191f3c1 in /root/miniconda3/bin/python)
frame #19: PyImport_Cleanup + 0x9d (0x55f8b195daad in /root/miniconda3/bin/python)
frame #20: Py_FinalizeEx + 0x79 (0x55f8b198fa49 in /root/miniconda3/bin/python)
frame #21: Py_RunMain + 0x183 (0x55f8b1991893 in /root/miniconda3/bin/python)
frame #22: Py_BytesMain + 0x39 (0x55f8b1991ca9 in /root/miniconda3/bin/python)
frame #23: __libc_start_main + 0xe7 (0x7f915debcbf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: + 0x1e21c7 (0x55f8b196a1c7 in /root/miniconda3/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1573) of binary: /root/miniconda3/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_0cy1_ymo/none_380u4lyc/attempt_1/0/error.json
RANK and WORLD_SIZE in environ: 0/1
我换成了我的中文图像描述数据集,在运行train的时候报了这个错误

@zhangfujunaaa zhangfujunaaa reopened this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant