Last level training #7

songuke · 2020-04-01T10:32:02Z

I encountered an issue when training at the last level. When I execute the command

python train_model.py level=0 batch_size=10 num_train_iters=100000

I got the following error:
Loading training data...
Killed

Any ideas?

The text was updated successfully, but these errors were encountered:

songuke · 2020-04-04T02:24:39Z

It seems the last level training requires > 32 GB of CPU memory to load the training data. I use another machine with 64 GB and the training is launched successfully.

SanoPan · 2020-04-08T09:19:42Z

Hi,did you encounter the problem that cuda resource exhausted?I have to reduce the batch size when I train model with level 1 or 0,but it affect the final result as I can't reach the same result which the author's orginal model performs.

XGBoost · 2020-04-09T02:42:00Z

@SanoPan
The proposed model by the author is very big. The author train the proposed model in tesla v100 which has 16G memory, while general GPU card (1080ti, 2080ti) only have 12 or 11G memory. Using the default batch size will out of memory. When I set the batch size to 1 for every GPU card, it is very slow and can not repeat the precision in the paper.

aiff22 · 2020-04-12T10:04:10Z

@songuke, yes, you are basically getting out-of-memory error when loading the data. To avoid it, just reduce the number of the loaded patches by using the train_size option.

@XGBoost, if your GPU has 11/12GB of RAM, you should be able to use a batch size of 6-8, not one. Besides that, you can also extract smaller random crops from the input/target patches, which will allow you to significantly increase the batch size.

aiff22 added OOM Out-of-memory issue question Further information is requested Training Training issues labels Apr 12, 2020

aiff22 closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last level training #7

Last level training #7

songuke commented Apr 1, 2020

songuke commented Apr 4, 2020

SanoPan commented Apr 8, 2020

XGBoost commented Apr 9, 2020

aiff22 commented Apr 12, 2020

Last level training #7

Last level training #7

Comments

songuke commented Apr 1, 2020

songuke commented Apr 4, 2020

SanoPan commented Apr 8, 2020

XGBoost commented Apr 9, 2020

aiff22 commented Apr 12, 2020