distributed error encountered #318

txytju · 2019-01-04T07:59:10Z

❓ Questions and Help

I tried to use just P2-P4 of FPN and just modified a few lines of code. The code works well on a single GPU but when using more than one GPUs, the error bellow is encountered.

Traceback (most recent call last):
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 251, in <module>
    main()
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 244, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 153, in train
    arguments,
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 81, in do_train
    losses.backward()
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 384, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 413, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7ffb97f8c180>, [[tensor([[[[0.]],

The main modifications that I made is in forward function of fpn.py

        # just use P2-P4 rather than P2-P5
        # use_P5 is bool, FPN outputs P2-P5 when use_P5==True and P2-P4 when False
        if not self.use_P5:
            results.pop()

The text was updated successfully, but these errors were encountered:

fmassa · 2019-01-07T18:29:50Z

Hi,

I think you could achieve something like that by just removing the last element in

maskrcnn-benchmark/configs/e2e_faster_rcnn_R_50_FPN_1x.yaml

Line 18 in f25c6cf

POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)

to be (0.25, 0.125, 0.0625).
It might work out of the box, but I'm not 100% sure now.

Can you try that first?

txytju · 2019-01-10T07:41:33Z

Should the length of POOLER_SCALES be the same as ANCHOR_STRIDE ? And the length of them should be the same as the number of feature maps output by the FPN. Is that right?

fmassa · 2019-01-10T10:33:51Z

ANCHOR_STRIDE should be the same as the number of feature maps in the FPN.
But we can limit the number of pooled feature maps by just reducing POOLER_SCALES I believe.

Let me know if this doesn't work, I might be missing something.

HOPEver1991 · 2019-01-30T02:16:16Z

I also meet this problem, have u solved it ?

fmassa · 2019-01-30T10:38:42Z

@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)

HOPEver1991 · 2019-01-30T11:33:42Z

@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)

Thank you for your reply !

I have a layer tensor, and now I need to do convolution operation to it for several times. The convolutions are all different. Similar to @txytju , the code works well on a single GPU but fails in the distributed environment. The error is as follows:

Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 76, in train
arguments,
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
losses.backward()
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
self._queue_reduction(bucket_idx)
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f75aa3e0ae8>, [[tensor([[[[0.]],

fmassa · 2019-01-31T09:59:39Z

@HOPEver1991 maybe one of the GPUs has a different computation graph?

Having a minimum reproducible example would help a lot as well identifying the issue.

mikigom · 2019-02-01T14:18:55Z

@fmassa Hi, I encountered the same problem. I tried to use SENet as backbone network. Based on your repo, I completed the task.

When the whole model is trained on single GPU, it works totally right.
However, with more than one GPU, the same error is raised @txytju and @HOPEver1991 referred.

I'm not asking you to fix the problem, but if you need to figure out the cause of this issue, I'd like to share my current code with you. If you need, please notice me later.

fmassa · 2019-02-01T16:16:43Z

@mikigom sharing the code would be very helpful, but I might not have the time to dig too much into it in the near future unfortunately

Lausannen · 2019-02-04T03:25:02Z

Hi, I have met same problem with distributed training. The same error is raised @txytju and @HOPEver1991 referred. I tried to use one node with multi GPUs but it failed to backward. I will try to provide a minimum reprobucible example but since I have changed a lot in this repository, it will cost some time. I will be appreciate if you can provide some suggestions! Thanks!

fmassa · 2019-02-04T12:34:51Z

@Lausannen When you say it failed on backward, does this mean that it raised an error or was it stuck?

Lausannen · 2019-02-05T04:24:08Z

@fmassa Thank you ! Sorry for my late reply, " it failed on backward" means that it raised an error. The error info was the same as txytju's "TypeError: _queue_reduction(): incompatible function arguments."
When I tried to solved the problem, I found something may be helpful for you. In pytorch/pytorch#13273, they discussed DDP support, and one man suggested NVIDIA distributed module wrapper, apex.parallel.DistributedDataParallel. I took a try, and the code successfully ran. I think maybe my model also has some layers or parameters not used since I am a new one joining in Deep Learning , I am not sure about this. Hopefully my discovery can help you, if you need me to provide any other information, please let me know. Thank you for your quick reply again !

mikigom · 2019-02-07T01:50:29Z

After I saw @Lausannen 's reply, I tried to remove all unused layers in my new backbone and it totally solved my problem (referring to http://pytorch/pytorch#13273.) Thus, for my case, it is PyTorch's issue rather than this repo's issue. Thank you. @fmassa

Lausannen · 2019-02-07T03:38:07Z

@mikigom Hi, thank you for your test and reply. If it does not bother you, can you tell me how to make sure which layers are not used in one model ?

fmassa · 2019-02-07T09:36:49Z

Awesome, good to know that this was the issue!

mikigom · 2019-02-08T08:35:24Z

@Lausannen I strongly recommend that you check your forward() in your nn.Module backbone. In ordinary usage, only parameters used in forward() are practically used parameters in nn.Module. Compare all declared class variables with variables used in forward().

nn.Module.parameters() or nn.Module.named_parameters() may return all parameters in class nn.Module (whether they are used in forward() or not).

Lausannen · 2019-02-08T12:35:44Z

@mikigom Thank you for your reply ! I will check my code.

chengyangfu · 2019-02-18T07:21:21Z

I also met the same problem and the solution is to remove the non-used parameters.

Try to add the following line in the code.

for name, param in model. named_parameters(): 
    print(name, param, True if param.grad is not None else False)

After backward, if the parameter does not contain grad, it means the parameter is either frozen or not used in the forward.

fmassa · 2019-02-18T09:49:55Z

Thanks for the comment @chengyangfu !

This is indeed a problem, and apparently one potential solution is also to switch to apex DDP, as discussed in pytorch/pytorch#13273

xllau · 2019-02-20T08:53:11Z

I have also met this problem, and I am trying to re-config this envrionment, but it doesnot work. Thanks all. And I have another question, can anyone provide a new version of multi-gpu training code without the deprecated code?

fmassa · 2019-02-20T15:27:34Z

@xllau the new version of the codebase uses the new distributed backend of PyTorch

xllau · 2019-02-21T07:12:14Z

Hi, I have found a solution. I installed the pytorch nightly with the version of pytorch-nightly 1.0.0.dev20190207 py3.6_cuda9.0.176_cudnn7.4.2_0, this doesnot work. And I have also tried with python 3.6.1, 3.6.3, 3.6.5, 3.7.x, they all cannot work. Finally, I come with the following config, it works:
python 3.6.8 h0371630_0 defaults
pytorch-nightly 1.0.0.dev20190128 py3.6_cuda9.0.176_cudnn7.4.1_0 pytorch
I have a fully version of conda-env config, follow this link:https://github.com/xllau/maskrcnn-benchmark/blob/master/conda_mb.yaml , download and run:conda create --file ./conda_mb.yaml, it will be installed automatically.

fmassa · 2019-02-22T11:02:29Z

@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?

xllau · 2019-03-04T11:17:42Z

@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?

Yes, the version control is something nasty!

fmassa · 2019-03-05T15:11:53Z

@xllau and the error you get with a recent PyTorch is exactly the same one as in the description of this issue?

moinnadeem · 2019-04-04T16:14:35Z

Does anyone know the performance impact of figuring out which gradients are zero, and setting those to be not trainable?

The problem is that I do multi-task training, so the tasks that aren't being trained at the time are unused parts of the model. Is this an acceptable patch?

densechen · 2019-04-08T09:36:40Z

I meet the same issue. By removing the unused layers, everything works well. However, I need to train the model with these layers in some epochs, and skip these layers in other epochs to get a better trained model. Is there existed same better coding method to remove and add the layers dynamically?
Thanks at first.

Lausannen · 2019-04-08T09:50:07Z

@LittleLampChen I recommend you to use NVIDIA Apex to wrap your model by setting delay_allreduce=True. In this mode, Apex can collect variables which should be computed gradients after one epoch finished. It can adjust compute graph each epoch. I think this can help you with your condition.

densechen · 2019-04-09T00:43:40Z

@Lausannen Thank you very much! I will have a try.

chengyangfu · 2019-04-09T19:33:44Z

Hi @LittleLampChen ,
Another way is to multiply 0 to the loss you don't want.
This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.

densechen · 2019-04-10T00:43:41Z

@chengyangfu This may be a better way.

linhuaiyuan · 2019-05-05T07:45:54Z

@fmassa I also meet same problem,can you help me,thank you.
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 60, in train
start_iter=arguments["iteration"],
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\build.py",
line 154, in make_data_loader
datasets = build_dataset(dataset_list, transforms, DatasetCatalog, is_train)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\build.py",
line 44, in build_dataset
dataset = factory(**args)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\dataset
\coco.py", line 43, in init
super(COCODataset, self).init(root, ann_file)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\torchvision\datasets\coco.py", line 97, in init
self.coco = COC(annFile)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\pycocotools\coco.py", line 85, in init
dataset = json.load(open(annotation_file, 'r'))
FileNotFoundError: [Errno 2] No such file or directory: 'datasets\coco/annotations/instances_train2017.json'

chenjoya · 2019-06-07T14:54:09Z

Hi @LittleLampChen ,
Another way is to multiply 0 to the loss you don't want.
This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.

Yeah. Also, we can return the loss zero. e.g. return loss.zero_()

samson-wang · 2019-08-07T12:01:58Z

@fmassa I've implemented a multi-gpu training code by only torch.distributed.all_reduce function in my project. In some cases, when some tensor.requires_grad is True, the tensor.grad is None. The all reduce should not applied on such tensors.

I think it leads to the error above.

TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

The List[List[at:Tensor]] requirement breaks because of the grad None involved. So should the torch/nn/parallel/distributed.py package handle the grad None case?

freesouls mentioned this issue Feb 18, 2019

Multi-GPU Training: distributed error encountered shrubb/box-convolutions#5

Closed

chengyangfu mentioned this issue Feb 25, 2019

subprocess.CalledProcessError #490

Closed

This was referenced Mar 29, 2019

How do you #619

Closed

How the variable of a custom loss update in distributed training? #625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed error encountered #318

distributed error encountered #318

txytju commented Jan 4, 2019 •

edited

Loading

fmassa commented Jan 7, 2019

txytju commented Jan 10, 2019

fmassa commented Jan 10, 2019

HOPEver1991 commented Jan 30, 2019

fmassa commented Jan 30, 2019

HOPEver1991 commented Jan 30, 2019

fmassa commented Jan 31, 2019

mikigom commented Feb 1, 2019

fmassa commented Feb 1, 2019

Lausannen commented Feb 4, 2019

fmassa commented Feb 4, 2019

Lausannen commented Feb 5, 2019

mikigom commented Feb 7, 2019 •

edited

Loading

Lausannen commented Feb 7, 2019

fmassa commented Feb 7, 2019

mikigom commented Feb 8, 2019

Lausannen commented Feb 8, 2019

chengyangfu commented Feb 18, 2019

fmassa commented Feb 18, 2019

xllau commented Feb 20, 2019

fmassa commented Feb 20, 2019

xllau commented Feb 21, 2019

fmassa commented Feb 22, 2019

xllau commented Mar 4, 2019

fmassa commented Mar 5, 2019

moinnadeem commented Apr 4, 2019

densechen commented Apr 8, 2019

Lausannen commented Apr 8, 2019

densechen commented Apr 9, 2019

chengyangfu commented Apr 9, 2019

densechen commented Apr 10, 2019

linhuaiyuan commented May 5, 2019

chenjoya commented Jun 7, 2019

samson-wang commented Aug 7, 2019

distributed error encountered #318

distributed error encountered #318

Comments

txytju commented Jan 4, 2019 • edited Loading

❓ Questions and Help

fmassa commented Jan 7, 2019

txytju commented Jan 10, 2019

fmassa commented Jan 10, 2019

HOPEver1991 commented Jan 30, 2019

fmassa commented Jan 30, 2019

HOPEver1991 commented Jan 30, 2019

fmassa commented Jan 31, 2019

mikigom commented Feb 1, 2019

fmassa commented Feb 1, 2019

Lausannen commented Feb 4, 2019

fmassa commented Feb 4, 2019

Lausannen commented Feb 5, 2019

mikigom commented Feb 7, 2019 • edited Loading

Lausannen commented Feb 7, 2019

fmassa commented Feb 7, 2019

mikigom commented Feb 8, 2019

Lausannen commented Feb 8, 2019

chengyangfu commented Feb 18, 2019

fmassa commented Feb 18, 2019

xllau commented Feb 20, 2019

fmassa commented Feb 20, 2019

xllau commented Feb 21, 2019

fmassa commented Feb 22, 2019

xllau commented Mar 4, 2019

fmassa commented Mar 5, 2019

moinnadeem commented Apr 4, 2019

densechen commented Apr 8, 2019

Lausannen commented Apr 8, 2019

densechen commented Apr 9, 2019

chengyangfu commented Apr 9, 2019

densechen commented Apr 10, 2019

linhuaiyuan commented May 5, 2019

chenjoya commented Jun 7, 2019

samson-wang commented Aug 7, 2019

txytju commented Jan 4, 2019 •

edited

Loading

mikigom commented Feb 7, 2019 •

edited

Loading