You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems the last level training requires > 32 GB of CPU memory to load the training data. I use another machine with 64 GB and the training is launched successfully.
Hi,did you encounter the problem that cuda resource exhausted?I have to reduce the batch size when I train model with level 1 or 0,but it affect the final result as I can't reach the same result which the author's orginal model performs.
@SanoPan
The proposed model by the author is very big. The author train the proposed model in tesla v100 which has 16G memory, while general GPU card (1080ti, 2080ti) only have 12 or 11G memory. Using the default batch size will out of memory. When I set the batch size to 1 for every GPU card, it is very slow and can not repeat the precision in the paper.
@songuke, yes, you are basically getting out-of-memory error when loading the data. To avoid it, just reduce the number of the loaded patches by using the train_size option.
@XGBoost, if your GPU has 11/12GB of RAM, you should be able to use a batch size of 6-8, not one. Besides that, you can also extract smaller random crops from the input/target patches, which will allow you to significantly increase the batch size.
I encountered an issue when training at the last level. When I execute the command
python train_model.py level=0 batch_size=10 num_train_iters=100000
I got the following error:
Loading training data...
Killed
Any ideas?
The text was updated successfully, but these errors were encountered: