Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory during training before evaluation #9

Closed
hiyyg opened this issue May 2, 2020 · 6 comments
Closed

CUDA out of memory during training before evaluation #9

hiyyg opened this issue May 2, 2020 · 6 comments

Comments

@hiyyg
Copy link

hiyyg commented May 2, 2020

Hi @m-niemeyer , I tried to train with configs/single_view_reconstruction/multi_view_supervision/ours_combined.yaml, I have reduced the batch size of training and testing to 16. However, during training, every time before evaluation, the runtime error occurred:

RuntimeError: CUDA out of memory. Tried to allocate 7.06 GiB (GPU 0; 10.76 GiB total capacity; 7.46 GiB already allocated; 2.06 GiB free; 7.83 GiB reserved in
total by PyTorch)

What could be the problem?

@hiyyg
Copy link
Author

hiyyg commented May 2, 2020

I can only run evaluation with batch size <= 2, why does evaluation cost so much memory?

@m-niemeyer
Copy link
Collaborator

Hi @hiyyg , thanks a lot for your interest.
First, the configs for the large models are optimized for GPUs with 32GB memory - Here some ideas how to reduce the memory load:

  1. Reduce Training and Validation batch size in the config, e.g.
training:
  batch_size: 16
  batch_size_val: 4
  1. Reduce the maximum number of points processed in parallel in the depth prediction step, e.g.
model:
  depth_function_kwargs:
    max_points: 10000
  1. Reduce the hidden dimension of the model for training new models. A smaller model also trains much faster. Set e.g.
model:
  decoder_kwargs:
    hidden_size: 128
  1. Reduce the number of training and evaluation points, e.g.
training:
  n_training_points: 512
  n_eval_points: 512

I hope this helps and you can find a setting which is suitable for your hardware! Regarding your question, the validation step requires more GPU memory in the early stages of training because we adaptively increase the ray sampling resolution and start with a small one (16) which is increased over time (up to 128). However, the validation step is always performed on the high resolution (128). You can see this in the depth function implementation.

@hiyyg
Copy link
Author

hiyyg commented May 2, 2020

Thanks for your reply. Does that mean the largest training batch size should be set to the same as testing batch size, otherwise training will be out of memory later?

@m-niemeyer
Copy link
Collaborator

You are right that the memory consumption will increase during training later, but it is not exactly the same as needed for the validation step; It also depends on the number of training / validation points you use (see Point 4.) from before). All mentioned points in the previous message can be used to reduce the memory load for both the training and testing step. Good luck with your research!

@hiyyg hiyyg closed this as completed May 2, 2020
@hiyyg
Copy link
Author

hiyyg commented May 7, 2020

Hi @m-niemeyer , may I ask for the single-view reconstruction with multi-view supervision on the 3D-R2N2 Shapenet dataset, what is the final evaluation loss values of your model when the it is converged?

@m-niemeyer
Copy link
Collaborator

Hi @hiyyg , it should be loss_depth_eval: 0.033 for our 2.5D supervised model and mask_intersection: 0.973 for our 2D supervised model. I hope this is what you were looking for!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants