how to use multi-gpus for training #5

StephanPan · 2021-09-06T05:25:42Z

when i use the nn.DataParallel(model), i got the problem as follows
Traceback (most recent call last):
File "main.py", line 67, in
main(args)
File "main.py", line 31, in main
_train(model, args)
File "MIMO-UNet/train.py", line 53, in _train
pred_img = model(input_img)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices

chosj95 · 2021-09-06T05:31:40Z

when i use the nn.DataParallel(model), i got the problem as follows
Traceback (most recent call last):
File "main.py", line 67, in
main(args)
File "main.py", line 31, in main
_train(model, args)
File "MIMO-UNet/train.py", line 53, in _train
pred_img = model(input_img)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices

We trained our networks on a single GPU.
We don't know how to apply that function.
If you solve that problem, could you give me the code?

StephanPan · 2021-09-07T01:49:07Z

can i shorten the training time by reduceing the epoch number, since 3000 epoch seems too long to be finished?

chosj95 · 2021-09-07T02:16:52Z

can i shorten the training time by reduceing the epoch number, since 3000 epoch seems too long to be finished?

You can design a new training schedule but it may degrade preformance.
When I trained our network during the 1000 epoch, the performance was below the level of the original one.

If you have enough RAM storage, loading all images and saving them in a list it might enhance the training time.

StephanPan · 2021-09-16T01:59:19Z

it seems not all shape of input images suitable for model, so the input image should be limited to what kind of shape?

chosj95 · 2021-09-16T02:12:42Z

it seems not all shape of input images suitable for model, so the input image should be limited to what kind of shape?

I don't fully understand your question.
The part of the code for loading images and convert to tensor is here.

At least, the width and height of input images must be multiple of 4 (because of pooling and upsampling).

StephanPan · 2021-09-27T07:31:00Z

thanks for reply, in my experiment, I find that the msfr loss is much larger than the l1 loss, is that correct?

StephanPan · 2021-11-16T09:14:52Z

I train the model with customed datasets in which the resolution is 1080p, and i find that the model works well when the test image is in 1080p or smaller resolution, while the model does not work when the test image is in 2k or even larger resolution. it seems that the model is not scale-invariant, could you give some suggestions? thx

chosj95 · 2021-11-16T09:22:47Z

thanks for reply, in my experiment, I find that the msfr loss is much larger than the l1 loss, is that correct?

Yes, so we set lambda = 0.1.

I train the model with customed datasets in which the resolution is 1080p, and i find that the model works well when the test image is in 1080p or smaller resolution, while the model does not work when the test image is in 2k or even larger resolution. it seems that the model is not scale-invariant, could you give some suggestions? thx

In my experience, 2k images can be tested.
Could you give me a error message?

StephanPan · 2021-11-16T09:26:50Z

sorry, i maybe misunderstand you, my problem is that the performance with 2k test image is largely degradated.

chosj95 · 2021-11-16T09:52:04Z

sorry, i maybe misunderstand you, my problem is that the performance with 2k test image is largely degradated.

Yes, it can. I think simply applying higher resolution image can not work because the checkpoint we provide was only trained on images having sizes of 1280 720.

StephanPan · 2021-11-16T10:00:11Z

that's right. is that advisable that randomly resize the input image to a larger resolution in dataloader in training process?

chosj95 · 2021-11-17T04:18:28Z

I'm not sure but it can work.
If you try it, please let me know the result.
Thanks.

StephanPan · 2022-01-17T12:12:01Z

I'm not sure but it can work. If you try it, please let me know the result. Thanks.

adding random resize could receive improvement.
btw, i encounter another problem that the badcase (such as distortion, artifacts) is especially severe when the blur degree is heavy. is there any suggestions to alleviate the problem

chosj95 closed this as completed Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use multi-gpus for training #5

how to use multi-gpus for training #5

StephanPan commented Sep 6, 2021 •

edited

chosj95 commented Sep 6, 2021

StephanPan commented Sep 7, 2021

chosj95 commented Sep 7, 2021

StephanPan commented Sep 16, 2021

chosj95 commented Sep 16, 2021

StephanPan commented Sep 27, 2021

StephanPan commented Nov 16, 2021

chosj95 commented Nov 16, 2021

StephanPan commented Nov 16, 2021

chosj95 commented Nov 16, 2021

StephanPan commented Nov 16, 2021

chosj95 commented Nov 17, 2021

StephanPan commented Jan 17, 2022

how to use multi-gpus for training #5

how to use multi-gpus for training #5

Comments

StephanPan commented Sep 6, 2021 • edited

chosj95 commented Sep 6, 2021

StephanPan commented Sep 7, 2021

chosj95 commented Sep 7, 2021

StephanPan commented Sep 16, 2021

chosj95 commented Sep 16, 2021

StephanPan commented Sep 27, 2021

StephanPan commented Nov 16, 2021

chosj95 commented Nov 16, 2021

StephanPan commented Nov 16, 2021

chosj95 commented Nov 16, 2021

StephanPan commented Nov 16, 2021

chosj95 commented Nov 17, 2021

StephanPan commented Jan 17, 2022

StephanPan commented Sep 6, 2021 •

edited