Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last level training #7

Closed
songuke opened this issue Apr 1, 2020 · 4 comments
Closed

Last level training #7

songuke opened this issue Apr 1, 2020 · 4 comments
Labels
OOM Out-of-memory issue question Further information is requested Training Training issues

Comments

@songuke
Copy link

songuke commented Apr 1, 2020

I encountered an issue when training at the last level. When I execute the command

python train_model.py level=0 batch_size=10 num_train_iters=100000

I got the following error:
Loading training data...
Killed

Any ideas?

@songuke
Copy link
Author

songuke commented Apr 4, 2020

It seems the last level training requires > 32 GB of CPU memory to load the training data. I use another machine with 64 GB and the training is launched successfully.

@SanoPan
Copy link

SanoPan commented Apr 8, 2020

Hi,did you encounter the problem that cuda resource exhausted?I have to reduce the batch size when I train model with level 1 or 0,but it affect the final result as I can't reach the same result which the author's orginal model performs.

@XGBoost
Copy link

XGBoost commented Apr 9, 2020

@SanoPan
The proposed model by the author is very big. The author train the proposed model in tesla v100 which has 16G memory, while general GPU card (1080ti, 2080ti) only have 12 or 11G memory. Using the default batch size will out of memory. When I set the batch size to 1 for every GPU card, it is very slow and can not repeat the precision in the paper.

@aiff22
Copy link
Owner

aiff22 commented Apr 12, 2020

@songuke, yes, you are basically getting out-of-memory error when loading the data. To avoid it, just reduce the number of the loaded patches by using the train_size option.

@XGBoost, if your GPU has 11/12GB of RAM, you should be able to use a batch size of 6-8, not one. Besides that, you can also extract smaller random crops from the input/target patches, which will allow you to significantly increase the batch size.

@aiff22 aiff22 added OOM Out-of-memory issue question Further information is requested Training Training issues labels Apr 12, 2020
@aiff22 aiff22 closed this as completed Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OOM Out-of-memory issue question Further information is requested Training Training issues
Projects
None yet
Development

No branches or pull requests

4 participants