New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to use multi-gpus for training #5
Comments
We trained our networks on a single GPU. |
can i shorten the training time by reduceing the epoch number, since 3000 epoch seems too long to be finished? |
You can design a new training schedule but it may degrade preformance. If you have enough RAM storage, loading all images and saving them in a list it might enhance the training time. |
it seems not all shape of input images suitable for model, so the input image should be limited to what kind of shape? |
I don't fully understand your question. At least, the width and height of input images must be multiple of 4 (because of pooling and upsampling). |
thanks for reply, in my experiment, I find that the msfr loss is much larger than the l1 loss, is that correct? |
I train the model with customed datasets in which the resolution is 1080p, and i find that the model works well when the test image is in 1080p or smaller resolution, while the model does not work when the test image is in 2k or even larger resolution. it seems that the model is not scale-invariant, could you give some suggestions? thx |
Yes, so we set lambda = 0.1.
In my experience, 2k images can be tested. |
sorry, i maybe misunderstand you, my problem is that the performance with 2k test image is largely degradated. |
Yes, it can. I think simply applying higher resolution image can not work because the checkpoint we provide was only trained on images having sizes of 1280 720. |
that's right. is that advisable that randomly resize the input image to a larger resolution in dataloader in training process? |
I'm not sure but it can work. |
adding random resize could receive improvement. |
when i use the nn.DataParallel(model), i got the problem as follows
Traceback (most recent call last):
File "main.py", line 67, in
main(args)
File "main.py", line 31, in main
_train(model, args)
File "MIMO-UNet/train.py", line 53, in _train
pred_img = model(input_img)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices
The text was updated successfully, but these errors were encountered: