Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use multi-gpus for training #5

Closed
StephanPan opened this issue Sep 6, 2021 · 13 comments
Closed

how to use multi-gpus for training #5

StephanPan opened this issue Sep 6, 2021 · 13 comments

Comments

@StephanPan
Copy link

StephanPan commented Sep 6, 2021

when i use the nn.DataParallel(model), i got the problem as follows
Traceback (most recent call last):
File "main.py", line 67, in
main(args)
File "main.py", line 31, in main
_train(model, args)
File "MIMO-UNet/train.py", line 53, in _train
pred_img = model(input_img)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices

@chosj95
Copy link
Owner

chosj95 commented Sep 6, 2021

when i use the nn.DataParallel(model), i got the problem as follows
Traceback (most recent call last):
File "main.py", line 67, in
main(args)
File "main.py", line 31, in main
_train(model, args)
File "MIMO-UNet/train.py", line 53, in _train
pred_img = model(input_img)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices

We trained our networks on a single GPU.
We don't know how to apply that function.
If you solve that problem, could you give me the code?

@StephanPan
Copy link
Author

can i shorten the training time by reduceing the epoch number, since 3000 epoch seems too long to be finished?

@chosj95
Copy link
Owner

chosj95 commented Sep 7, 2021

can i shorten the training time by reduceing the epoch number, since 3000 epoch seems too long to be finished?

You can design a new training schedule but it may degrade preformance.
When I trained our network during the 1000 epoch, the performance was below the level of the original one.

If you have enough RAM storage, loading all images and saving them in a list it might enhance the training time.

@chosj95 chosj95 closed this as completed Sep 7, 2021
@StephanPan
Copy link
Author

it seems not all shape of input images suitable for model, so the input image should be limited to what kind of shape?

@chosj95
Copy link
Owner

chosj95 commented Sep 16, 2021

it seems not all shape of input images suitable for model, so the input image should be limited to what kind of shape?

I don't fully understand your question.
The part of the code for loading images and convert to tensor is here.

At least, the width and height of input images must be multiple of 4 (because of pooling and upsampling).

@StephanPan
Copy link
Author

thanks for reply, in my experiment, I find that the msfr loss is much larger than the l1 loss, is that correct?

@StephanPan
Copy link
Author

I train the model with customed datasets in which the resolution is 1080p, and i find that the model works well when the test image is in 1080p or smaller resolution, while the model does not work when the test image is in 2k or even larger resolution. it seems that the model is not scale-invariant, could you give some suggestions? thx

@chosj95
Copy link
Owner

chosj95 commented Nov 16, 2021

thanks for reply, in my experiment, I find that the msfr loss is much larger than the l1 loss, is that correct?

Yes, so we set lambda = 0.1.

I train the model with customed datasets in which the resolution is 1080p, and i find that the model works well when the test image is in 1080p or smaller resolution, while the model does not work when the test image is in 2k or even larger resolution. it seems that the model is not scale-invariant, could you give some suggestions? thx

In my experience, 2k images can be tested.
Could you give me a error message?

@StephanPan
Copy link
Author

sorry, i maybe misunderstand you, my problem is that the performance with 2k test image is largely degradated.

@chosj95
Copy link
Owner

chosj95 commented Nov 16, 2021

sorry, i maybe misunderstand you, my problem is that the performance with 2k test image is largely degradated.

Yes, it can. I think simply applying higher resolution image can not work because the checkpoint we provide was only trained on images having sizes of 1280 720.

@StephanPan
Copy link
Author

that's right. is that advisable that randomly resize the input image to a larger resolution in dataloader in training process?

@chosj95
Copy link
Owner

chosj95 commented Nov 17, 2021

I'm not sure but it can work.
If you try it, please let me know the result.
Thanks.

@StephanPan
Copy link
Author

I'm not sure but it can work. If you try it, please let me know the result. Thanks.

adding random resize could receive improvement.
btw, i encounter another problem that the badcase (such as distortion, artifacts) is especially severe when the blur degree is heavy. is there any suggestions to alleviate the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants