Stucked at the beginning of training #21

psu1 opened this issue May 30, 2020 · 14 comments

psu1 opened this issue May 30, 2020 · 14 comments


psu1 commented May 30, 2020

Hi, I am trying to run the DETR on my local machine. But both training process gets stuck at the beginning stage, as follows

I am using Pytorch 1.5 and torchvision 0.6. And the faster-rcnn model can be trained on the coco dataset wihtout the problem.

I am wondering the problem may come from the Dataloader part. Could you provide some hints on this ? Thanks!

fmassa commented May 30, 2020


Are you using distributed training? If yes, on how many GPUs / nodes?

My first guess is that the deadlock you are facing might be due to a synchronisation issue in DistributedDataParallel, but we would a bit more information to be sure

psu1 commented May 31, 2020

Yes. 4 GPUs, 1 node.

I have tried to run the evaluation code with one GPU, which also stuck in the same place.

fmassa commented May 31, 2020

What is the command that you are using to train your model? Can you run a standard CUDA code with your environment?

Without further information it is very hard to understand what is going on and to be able to give more precise help.

psu1 commented Jun 1, 2020

Yes. I can run the standard CUDA code.

I use the training command of "python -m torch.distributed.launch --nproc_per_node=4 --use_env
--lr_drop 400 --epochs 500
--coco_path /path/to/coco"

fmassa commented Jun 1, 2020

@psu1 if you run your code with only

python --lr_drop 400 --epochs 500 --coco_path /path/to/coco"

does it also deadlock?
If yes, it probably isn't the distributed training fault, but you can do CTRL+C to stop in the middle while it's hanging, so that you know where it is spending time on.

One possibility is that it could be hanging at the data loading part. To be sure, you could also maybe do --num_workers 0 and do CTRL+C if it hangs, this will point out kind of exactly where the code is stuck.

Once you have this information, can you share it here so that we can debug further?

psu1 commented Jun 3, 2020


Running python --lr_drop 400 --epochs 500 --coco_path /path/to/coco" still stuck, and the CTRL+C gives the following traceback:

Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/media/jaden/jaden/DeepLearningCode/data/coco2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=500, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=400, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
number of params: 41302368
loading annotations into memory...
Done (t=14.59s)
creating index...
index created!
loading annotations into memory...
Done (t=0.54s)
creating index...
index created!
Start training
Traceback (most recent call last):
File "", line 249, in
File "", line 199, in main
File "/media/jaden/jaden/DeepLearningCode/detr/", line 28, in train_one_epoch
for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
File "/media/jaden/jaden/DeepLearningCode/detr/util/", line 223, in log_every
for obj in iterable:
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/", line 345, in next
data = self._next_data()
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/", line 841, in _next_data
idx, data = self._get_data()
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/", line 808, in _get_data
success, data = self._try_get_data()
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/", line 104, in get
if not self._poll(timeout):
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/", line 257, in poll
return self._poll(timeout)
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/", line 414, in _poll
r = wait([self], timeout)
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/", line 920, in wait
ready =
File "/home/jaden/anaconda3/lib/python3.7/", line 415, in select
fd_event_list = self._selector.poll(timeout)

But running python --lr_drop 400 --epochs 500 --coco_path /path/to/coco ----num_workers 0 doesn't have hanging problem.

psu1 commented Jun 3, 2020

Distributed training on multi GPUs works fine when setting --num_workers =0 . It seems the problem of torch Dataloader.
Thanks very much @fmassa !

@psu1 psu1 closed this as completed Jun 3, 2020
hvudeshi commented Jun 8, 2020

Hello @psu1 and @fmassa,
I am running the command "python3 -m torch.distributed.launch --nproc_per_node=8 --use_env --coco_path /path/to/coco --epochs 25 --num_workers 0" on 8 GPUs and 1 node on my custom dataset of object detection. I am also getting the same error as shown by @psu1. So, how can I fix this?

PS: Pytorch version 1.5 and torchvision 0.6

fmassa commented Jun 8, 2020

@hardik22317 you mean that your code get stuck and when you do CTRL+C it shows an error in the dataloader?

Can you try running with --num_workers 0?

EDIT: Just saw that you are running with num_workers 0, can you paste which error you are facing?

hvudeshi commented Jun 8, 2020

Command:- python3 -m torch.distributed.launch --nproc_per_node=8 --use_env --coco_path /home/ubuntu/newdarknet/coco_gmr_dataset/ --epochs 25 --num_workers 0
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 4): env://
| distributed init (rank 2): env://
| distributed init (rank 6): env://
| distributed init (rank 7): env://
| distributed init (rank 0): env://
| distributed init (rank 5): env://
  sha: be9d447ea3208e91069510643f75dadb7e9d163d, status: clean, branch: master

Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/home/ubuntu/newdarknet/coco_gmr_dataset/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=25, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=0, output_dir='', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
Traceback (most recent call last):
  File "", line 248, in <module>
  File "", line 198, in main
  File "/home/ubuntu/detr/", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
number of params: 41302368
loading annotations into memory...
Done (t=0.05s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
Start training
fmassa commented Jun 8, 2020

@hardik22317 your error is different:

  File "/home/ubuntu/detr/", line 28, in train_one_epoch
    return [self.imgs[id] for id in ids]
KeyError: 'v'

most probably because there might be an issue with the path to your COCO data, or the annotations are not in the format that it expects

amsword commented Jun 21, 2020

have similar issues. it works on pytorch 1.5.1, but not on 1.4.

i am running into the hanging issue. Code gets stuck at all_reduce statement in SetCriterion's forward class. Not sure how to resolve this. I am training simply on coco dataset.

Copy link

Ferry7z commented Aug 3, 2023

我遇到了悬而未决的问题。代码卡在 SetCriterion 的前向类中的 all_reduce 语句处。不知道如何解决这个问题。我只是在 coco 数据集上进行训练。

hello,have you solved the problem.I have met this problem,too. thanks

