Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

distributed error encountered #318

Open
txytju opened this issue Jan 4, 2019 · 34 comments
Open

distributed error encountered #318

txytju opened this issue Jan 4, 2019 · 34 comments

Comments

@txytju
Copy link

txytju commented Jan 4, 2019

❓ Questions and Help

I tried to use just P2-P4 of FPN and just modified a few lines of code. The code works well on a single GPU but when using more than one GPUs, the error bellow is encountered.

Traceback (most recent call last):
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 251, in <module>
    main()
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 244, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 153, in train
    arguments,
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 81, in do_train
    losses.backward()
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 384, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 413, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7ffb97f8c180>, [[tensor([[[[0.]],

The main modifications that I made is in forward function of fpn.py

        # just use P2-P4 rather than P2-P5
        # use_P5 is bool, FPN outputs P2-P5 when use_P5==True and P2-P4 when False
        if not self.use_P5:
            results.pop()
@fmassa
Copy link
Contributor

fmassa commented Jan 7, 2019

Hi,

I think you could achieve something like that by just removing the last element in

POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)

to be (0.25, 0.125, 0.0625).
It might work out of the box, but I'm not 100% sure now.

Can you try that first?

@txytju
Copy link
Author

txytju commented Jan 10, 2019

Should the length of POOLER_SCALES be the same as ANCHOR_STRIDE ? And the length of them should be the same as the number of feature maps output by the FPN. Is that right?

@fmassa
Copy link
Contributor

fmassa commented Jan 10, 2019

ANCHOR_STRIDE should be the same as the number of feature maps in the FPN.
But we can limit the number of pooled feature maps by just reducing POOLER_SCALES I believe.

Let me know if this doesn't work, I might be missing something.

@HOPEver1991
Copy link

I also meet this problem, have u solved it ?

@fmassa
Copy link
Contributor

fmassa commented Jan 30, 2019

@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)

@HOPEver1991
Copy link

@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)

Thank you for your reply !

I have a layer tensor, and now I need to do convolution operation to it for several times. The convolutions are all different. Similar to @txytju , the code works well on a single GPU but fails in the distributed environment. The error is as follows:

Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 76, in train
arguments,
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
losses.backward()
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
self._queue_reduction(bucket_idx)
File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f75aa3e0ae8>, [[tensor([[[[0.]],

@fmassa
Copy link
Contributor

fmassa commented Jan 31, 2019

@HOPEver1991 maybe one of the GPUs has a different computation graph?

Having a minimum reproducible example would help a lot as well identifying the issue.

@mikigom
Copy link

mikigom commented Feb 1, 2019

@fmassa Hi, I encountered the same problem. I tried to use SENet as backbone network. Based on your repo, I completed the task.

When the whole model is trained on single GPU, it works totally right.
However, with more than one GPU, the same error is raised @txytju and @HOPEver1991 referred.

I'm not asking you to fix the problem, but if you need to figure out the cause of this issue, I'd like to share my current code with you. If you need, please notice me later.

@fmassa
Copy link
Contributor

fmassa commented Feb 1, 2019

@mikigom sharing the code would be very helpful, but I might not have the time to dig too much into it in the near future unfortunately

@Lausannen
Copy link

Hi, I have met same problem with distributed training. The same error is raised @txytju and @HOPEver1991 referred. I tried to use one node with multi GPUs but it failed to backward. I will try to provide a minimum reprobucible example but since I have changed a lot in this repository, it will cost some time. I will be appreciate if you can provide some suggestions! Thanks!

@fmassa
Copy link
Contributor

fmassa commented Feb 4, 2019

@Lausannen When you say it failed on backward, does this mean that it raised an error or was it stuck?

@Lausannen
Copy link

@fmassa Thank you ! Sorry for my late reply, " it failed on backward" means that it raised an error. The error info was the same as txytju's "TypeError: _queue_reduction(): incompatible function arguments."
When I tried to solved the problem, I found something may be helpful for you. In pytorch/pytorch#13273, they discussed DDP support, and one man suggested NVIDIA distributed module wrapper, apex.parallel.DistributedDataParallel. I took a try, and the code successfully ran. I think maybe my model also has some layers or parameters not used since I am a new one joining in Deep Learning , I am not sure about this. Hopefully my discovery can help you, if you need me to provide any other information, please let me know. Thank you for your quick reply again !

@mikigom
Copy link

mikigom commented Feb 7, 2019

After I saw @Lausannen 's reply, I tried to remove all unused layers in my new backbone and it totally solved my problem (referring to http://pytorch/pytorch#13273.) Thus, for my case, it is PyTorch's issue rather than this repo's issue. Thank you. @fmassa

@Lausannen
Copy link

@mikigom Hi, thank you for your test and reply. If it does not bother you, can you tell me how to make sure which layers are not used in one model ?

@fmassa
Copy link
Contributor

fmassa commented Feb 7, 2019

Awesome, good to know that this was the issue!

@mikigom
Copy link

mikigom commented Feb 8, 2019

@Lausannen I strongly recommend that you check your forward() in your nn.Module backbone. In ordinary usage, only parameters used in forward() are practically used parameters in nn.Module. Compare all declared class variables with variables used in forward().

nn.Module.parameters() or nn.Module.named_parameters() may return all parameters in class nn.Module (whether they are used in forward() or not).

@Lausannen
Copy link

@mikigom Thank you for your reply ! I will check my code.

@chengyangfu
Copy link
Contributor

I also met the same problem and the solution is to remove the non-used parameters.

Try to add the following line in the code.

for name, param in model. named_parameters(): 
    print(name, param, True if param.grad is not None else False)

After backward, if the parameter does not contain grad, it means the parameter is either frozen or not used in the forward.

@fmassa
Copy link
Contributor

fmassa commented Feb 18, 2019

Thanks for the comment @chengyangfu !

This is indeed a problem, and apparently one potential solution is also to switch to apex DDP, as discussed in pytorch/pytorch#13273

@xllau
Copy link

xllau commented Feb 20, 2019

I have also met this problem, and I am trying to re-config this envrionment, but it doesnot work. Thanks all. And I have another question, can anyone provide a new version of multi-gpu training code without the deprecated code?

@fmassa
Copy link
Contributor

fmassa commented Feb 20, 2019

@xllau the new version of the codebase uses the new distributed backend of PyTorch

@xllau
Copy link

xllau commented Feb 21, 2019

Hi, I have found a solution. I installed the pytorch nightly with the version of pytorch-nightly 1.0.0.dev20190207 py3.6_cuda9.0.176_cudnn7.4.2_0, this doesnot work. And I have also tried with python 3.6.1, 3.6.3, 3.6.5, 3.7.x, they all cannot work. Finally, I come with the following config, it works:
python 3.6.8 h0371630_0 defaults
pytorch-nightly 1.0.0.dev20190128 py3.6_cuda9.0.176_cudnn7.4.1_0 pytorch
I have a fully version of conda-env config, follow this link:https://github.com/xllau/maskrcnn-benchmark/blob/master/conda_mb.yaml , download and run:conda create --file ./conda_mb.yaml, it will be installed automatically.

@fmassa
Copy link
Contributor

fmassa commented Feb 22, 2019

@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?

@xllau
Copy link

xllau commented Mar 4, 2019

@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?

Yes, the version control is something nasty!

@fmassa
Copy link
Contributor

fmassa commented Mar 5, 2019

@xllau and the error you get with a recent PyTorch is exactly the same one as in the description of this issue?

@moinnadeem
Copy link

Does anyone know the performance impact of figuring out which gradients are zero, and setting those to be not trainable?

The problem is that I do multi-task training, so the tasks that aren't being trained at the time are unused parts of the model. Is this an acceptable patch?

@densechen
Copy link

I meet the same issue. By removing the unused layers, everything works well. However, I need to train the model with these layers in some epochs, and skip these layers in other epochs to get a better trained model. Is there existed same better coding method to remove and add the layers dynamically?
Thanks at first.

@Lausannen
Copy link

@LittleLampChen I recommend you to use NVIDIA Apex to wrap your model by setting delay_allreduce=True. In this mode, Apex can collect variables which should be computed gradients after one epoch finished. It can adjust compute graph each epoch. I think this can help you with your condition.

@densechen
Copy link

@Lausannen Thank you very much! I will have a try.

@chengyangfu
Copy link
Contributor

Hi @LittleLampChen ,
Another way is to multiply 0 to the loss you don't want.
This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.

@densechen
Copy link

@chengyangfu This may be a better way.

@linhuaiyuan
Copy link

@fmassa I also meet same problem,can you help me,thank you.
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 60, in train
start_iter=arguments["iteration"],
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\build.py",
line 154, in make_data_loader
datasets = build_dataset(dataset_list, transforms, DatasetCatalog, is_train)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\build.py",
line 44, in build_dataset
dataset = factory(**args)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\dataset
\coco.py", line 43, in init
super(COCODataset, self).init(root, ann_file)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\torchvision\datasets\coco.py", line 97, in init
self.coco = COC(annFile)
File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\pycocotools\coco.py", line 85, in init
dataset = json.load(open(annotation_file, 'r'))
FileNotFoundError: [Errno 2] No such file or directory: 'datasets\coco/annotations/instances_train2017.json'

@chenjoya
Copy link
Contributor

chenjoya commented Jun 7, 2019

Hi @LittleLampChen ,
Another way is to multiply 0 to the loss you don't want.
This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.

Yeah. Also, we can return the loss zero. e.g. return loss.zero_()

@samson-wang
Copy link

@fmassa I've implemented a multi-gpu training code by only torch.distributed.all_reduce function in my project. In some cases, when some tensor.requires_grad is True, the tensor.grad is None. The all reduce should not applied on such tensors.

I think it leads to the error above.

TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

The List[List[at:Tensor]] requirement breaks because of the grad None involved. So should the torch/nn/parallel/distributed.py package handle the grad None case?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests