-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Stucked at the beginning of training #21
Comments
Hi, Are you using distributed training? If yes, on how many GPUs / nodes? My first guess is that the deadlock you are facing might be due to a synchronisation issue in DistributedDataParallel, but we would a bit more information to be sure |
Yes. 4 GPUs, 1 node. I have tried to run the evaluation code with one GPU, which also stuck in the same place. |
What is the command that you are using to train your model? Can you run a standard CUDA code with your environment? Without further information it is very hard to understand what is going on and to be able to give more precise help. |
Yes. I can run the standard CUDA code. I use the training command of "python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py |
@psu1 if you run your code with only
does it also deadlock? One possibility is that it could be hanging at the data loading part. To be sure, you could also maybe do Once you have this information, can you share it here so that we can debug further? |
Thanks! Running Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/media/jaden/jaden/DeepLearningCode/data/coco2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=500, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=400, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1) But running |
Distributed training on multi GPUs works fine when setting |
Hello @psu1 and @fmassa, PS: Pytorch version 1.5 and torchvision 0.6 |
@hardik22317 you mean that your code get stuck and when you do CTRL+C it shows an error in the dataloader? Can you try running with EDIT: Just saw that you are running with |
|
@hardik22317 your error is different:
most probably because there might be an issue with the path to your COCO data, or the annotations are not in the format that it expects |
have similar issues. it works on pytorch 1.5.1, but not on 1.4. |
i am running into the hanging issue. Code gets stuck at all_reduce statement in SetCriterion's forward class. Not sure how to resolve this. I am training simply on coco dataset. |
hello,have you solved the problem.I have met this problem,too. thanks |
Hi, I am trying to run the DETR on my local machine. But both training process gets stuck at the beginning stage, as follows
I am using Pytorch 1.5 and torchvision 0.6. And the faster-rcnn model can be trained on the coco dataset wihtout the problem.
I am wondering the problem may come from the Dataloader part. Could you provide some hints on this ? Thanks!
The text was updated successfully, but these errors were encountered: