CUDA out of memory during training before evaluation #9

hiyyg · 2020-05-02T11:19:16Z

Hi @m-niemeyer , I tried to train with configs/single_view_reconstruction/multi_view_supervision/ours_combined.yaml, I have reduced the batch size of training and testing to 16. However, during training, every time before evaluation, the runtime error occurred:

RuntimeError: CUDA out of memory. Tried to allocate 7.06 GiB (GPU 0; 10.76 GiB total capacity; 7.46 GiB already allocated; 2.06 GiB free; 7.83 GiB reserved in
total by PyTorch)

What could be the problem?

The text was updated successfully, but these errors were encountered:

hiyyg · 2020-05-02T14:43:15Z

I can only run evaluation with batch size <= 2, why does evaluation cost so much memory?

m-niemeyer · 2020-05-02T15:56:15Z

Hi @hiyyg , thanks a lot for your interest.
First, the configs for the large models are optimized for GPUs with 32GB memory - Here some ideas how to reduce the memory load:

Reduce Training and Validation batch size in the config, e.g.

training:
  batch_size: 16
  batch_size_val: 4

Reduce the maximum number of points processed in parallel in the depth prediction step, e.g.

model:
  depth_function_kwargs:
    max_points: 10000

Reduce the hidden dimension of the model for training new models. A smaller model also trains much faster. Set e.g.

model:
  decoder_kwargs:
    hidden_size: 128

Reduce the number of training and evaluation points, e.g.

training:
  n_training_points: 512
  n_eval_points: 512

I hope this helps and you can find a setting which is suitable for your hardware! Regarding your question, the validation step requires more GPU memory in the early stages of training because we adaptively increase the ray sampling resolution and start with a small one (16) which is increased over time (up to 128). However, the validation step is always performed on the high resolution (128). You can see this in the depth function implementation.

hiyyg · 2020-05-02T16:17:48Z

Thanks for your reply. Does that mean the largest training batch size should be set to the same as testing batch size, otherwise training will be out of memory later?

m-niemeyer · 2020-05-02T16:25:03Z

You are right that the memory consumption will increase during training later, but it is not exactly the same as needed for the validation step; It also depends on the number of training / validation points you use (see Point 4.) from before). All mentioned points in the previous message can be used to reduce the memory load for both the training and testing step. Good luck with your research!

hiyyg · 2020-05-07T12:54:43Z

Hi @m-niemeyer , may I ask for the single-view reconstruction with multi-view supervision on the 3D-R2N2 Shapenet dataset, what is the final evaluation loss values of your model when the it is converged?

m-niemeyer · 2020-05-07T13:48:10Z

Hi @hiyyg , it should be loss_depth_eval: 0.033 for our 2.5D supervised model and mask_intersection: 0.973 for our 2D supervised model. I hope this is what you were looking for!

hiyyg closed this as completed May 2, 2020

m-niemeyer mentioned this issue Sep 5, 2020

Depth in your paper and code #38

Closed

tharinduk90 mentioned this issue Sep 24, 2020

Multi GPU Training #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory during training before evaluation #9

CUDA out of memory during training before evaluation #9

hiyyg commented May 2, 2020

hiyyg commented May 2, 2020

m-niemeyer commented May 2, 2020

hiyyg commented May 2, 2020

m-niemeyer commented May 2, 2020

hiyyg commented May 7, 2020

m-niemeyer commented May 7, 2020

CUDA out of memory during training before evaluation #9

CUDA out of memory during training before evaluation #9

Comments

hiyyg commented May 2, 2020

hiyyg commented May 2, 2020

m-niemeyer commented May 2, 2020

hiyyg commented May 2, 2020

m-niemeyer commented May 2, 2020

hiyyg commented May 7, 2020

m-niemeyer commented May 7, 2020