Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

Stucked at the beginning of training #21

Closed
psu1 opened this issue May 30, 2020 · 14 comments
Closed

Stucked at the beginning of training #21

psu1 opened this issue May 30, 2020 · 14 comments

Comments

@psu1
Copy link

psu1 commented May 30, 2020

Hi, I am trying to run the DETR on my local machine. But both training process gets stuck at the beginning stage, as follows
bug

I am using Pytorch 1.5 and torchvision 0.6. And the faster-rcnn model can be trained on the coco dataset wihtout the problem.

I am wondering the problem may come from the Dataloader part. Could you provide some hints on this ? Thanks!

@fmassa
Copy link
Contributor

fmassa commented May 30, 2020

Hi,

Are you using distributed training? If yes, on how many GPUs / nodes?

My first guess is that the deadlock you are facing might be due to a synchronisation issue in DistributedDataParallel, but we would a bit more information to be sure

@psu1
Copy link
Author

psu1 commented May 31, 2020

Yes. 4 GPUs, 1 node.

I have tried to run the evaluation code with one GPU, which also stuck in the same place.

@fmassa
Copy link
Contributor

fmassa commented May 31, 2020

What is the command that you are using to train your model? Can you run a standard CUDA code with your environment?

Without further information it is very hard to understand what is going on and to be able to give more precise help.

@psu1
Copy link
Author

psu1 commented Jun 1, 2020

Yes. I can run the standard CUDA code.

I use the training command of "python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py
--lr_drop 400 --epochs 500
--coco_path /path/to/coco"

@fmassa
Copy link
Contributor

fmassa commented Jun 1, 2020

@psu1 if you run your code with only

python main.py --lr_drop 400 --epochs 500 --coco_path /path/to/coco"

does it also deadlock?
If yes, it probably isn't the distributed training fault, but you can do CTRL+C to stop in the middle while it's hanging, so that you know where it is spending time on.

One possibility is that it could be hanging at the data loading part. To be sure, you could also maybe do --num_workers 0 and do CTRL+C if it hangs, this will point out kind of exactly where the code is stuck.

Once you have this information, can you share it here so that we can debug further?

@psu1
Copy link
Author

psu1 commented Jun 3, 2020

Thanks!

Running python main.py --lr_drop 400 --epochs 500 --coco_path /path/to/coco" still stuck, and the CTRL+C gives the following traceback:

Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/media/jaden/jaden/DeepLearningCode/data/coco2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=500, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=400, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
number of params: 41302368
loading annotations into memory...
Done (t=14.59s)
creating index...
index created!
loading annotations into memory...
Done (t=0.54s)
creating index...
index created!
Start training
Traceback (most recent call last):
File "main.py", line 249, in
main(args)
File "main.py", line 199, in main
args.clip_max_norm)
File "/media/jaden/jaden/DeepLearningCode/detr/engine.py", line 28, in train_one_epoch
for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
File "/media/jaden/jaden/DeepLearningCode/detr/util/misc.py", line 223, in log_every
for obj in iterable:
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/jaden/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/jaden/anaconda3/lib/python3.7/multiprocessing/connection.py", line 920, in wait
ready = selector.select(timeout)
File "/home/jaden/anaconda3/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

But running python main.py --lr_drop 400 --epochs 500 --coco_path /path/to/coco ----num_workers 0 doesn't have hanging problem.

@psu1
Copy link
Author

psu1 commented Jun 3, 2020

Distributed training on multi GPUs works fine when setting --num_workers =0 . It seems the problem of torch Dataloader.
Thanks very much @fmassa !

@psu1 psu1 closed this as completed Jun 3, 2020
@hvudeshi
Copy link

hvudeshi commented Jun 8, 2020

Hello @psu1 and @fmassa,
I am running the command "python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco --epochs 25 --num_workers 0" on 8 GPUs and 1 node on my custom dataset of object detection. I am also getting the same error as shown by @psu1. So, how can I fix this?

PS: Pytorch version 1.5 and torchvision 0.6

@fmassa
Copy link
Contributor

fmassa commented Jun 8, 2020

@hardik22317 you mean that your code get stuck and when you do CTRL+C it shows an error in the dataloader?

Can you try running with --num_workers 0?

EDIT: Just saw that you are running with num_workers 0, can you paste which error you are facing?

@hvudeshi
Copy link

hvudeshi commented Jun 8, 2020

Command:- python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /home/ubuntu/newdarknet/coco_gmr_dataset/ --epochs 25 --num_workers 0
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 4): env://
| distributed init (rank 2): env://
| distributed init (rank 6): env://
| distributed init (rank 7): env://
| distributed init (rank 0): env://
| distributed init (rank 5): env://
git:
  sha: be9d447ea3208e91069510643f75dadb7e9d163d, status: clean, branch: master

Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/home/ubuntu/newdarknet/coco_gmr_dataset/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=25, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=0, output_dir='', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
Traceback (most recent call last):
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
  File "main.py", line 248, in <module>
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    main(args)
  File "main.py", line 198, in main
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    return [self.imgs[id] for id in ids]
KeyError: 'v'
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'v'
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'r'
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'S'
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'S'
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'r'
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'r'
number of params: 41302368
loading annotations into memory...
Done (t=0.05s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
Start training
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 198, in main
    args.clip_max_norm)
  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/ubuntu/detr/util/misc.py", line 222, in log_every
    for obj in iterable:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/detr/datasets/coco.py", line 24, in __getitem__
    img, target = super(CocoDetection, self).__getitem__(idx)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torchvision/datasets/coco.py", line 114, in __getitem__
    path = coco.loadImgs(img_id)[0]['file_name']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in loadImgs
    return [self.imgs[id] for id in ids]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pycocotools/coco.py", line 229, in <listcomp>
    return [self.imgs[id] for id in ids]
KeyError: 'r'
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main.py', '--coco_path', '/home/ubuntu/newdarknet/coco_gmr_dataset/', '--epochs', '25', '--num_workers', '0']' returned non-zero exit status 1.

@fmassa
Copy link
Contributor

fmassa commented Jun 8, 2020

@hardik22317 your error is different:

  File "/home/ubuntu/detr/engine.py", line 28, in train_one_epoch
    return [self.imgs[id] for id in ids]
KeyError: 'v'

most probably because there might be an issue with the path to your COCO data, or the annotations are not in the format that it expects

@amsword
Copy link

amsword commented Jun 21, 2020

have similar issues. it works on pytorch 1.5.1, but not on 1.4.

@gulzainali98
Copy link

i am running into the hanging issue. Code gets stuck at all_reduce statement in SetCriterion's forward class. Not sure how to resolve this. I am training simply on coco dataset.

@Ferry7z
Copy link

Ferry7z commented Aug 3, 2023

我遇到了悬而未决的问题。代码卡在 SetCriterion 的前向类中的 all_reduce 语句处。不知道如何解决这个问题。我只是在 coco 数据集上进行训练。

hello,have you solved the problem.I have met this problem,too. thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants