New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory during training before evaluation #9
Comments
I can only run evaluation with batch size <= 2, why does evaluation cost so much memory? |
Hi @hiyyg , thanks a lot for your interest.
I hope this helps and you can find a setting which is suitable for your hardware! Regarding your question, the validation step requires more GPU memory in the early stages of training because we adaptively increase the ray sampling resolution and start with a small one (16) which is increased over time (up to 128). However, the validation step is always performed on the high resolution (128). You can see this in the depth function implementation. |
Thanks for your reply. Does that mean the largest training batch size should be set to the same as testing batch size, otherwise training will be out of memory later? |
You are right that the memory consumption will increase during training later, but it is not exactly the same as needed for the validation step; It also depends on the number of training / validation points you use (see Point 4.) from before). All mentioned points in the previous message can be used to reduce the memory load for both the training and testing step. Good luck with your research! |
Hi @m-niemeyer , may I ask for the single-view reconstruction with multi-view supervision on the 3D-R2N2 Shapenet dataset, what is the final evaluation loss values of your model when the it is converged? |
Hi @hiyyg , it should be loss_depth_eval: 0.033 for our 2.5D supervised model and mask_intersection: 0.973 for our 2D supervised model. I hope this is what you were looking for! |
Hi @m-niemeyer , I tried to train with
configs/single_view_reconstruction/multi_view_supervision/ours_combined.yaml
, I have reduced the batch size of training and testing to 16. However, during training, every time before evaluation, the runtime error occurred:What could be the problem?
The text was updated successfully, but these errors were encountered: