About RuntimeError：CUDA out of memory #41

Txrachel · 2019-05-31T07:34:56Z

Hi,
Thanks for your wonderful work and detailed tutorial.
I am just a fresh new here, when I try to retrain the model, there will be RuntimeError. Then I set the Batch_Size (config.py) to [1] * 3, it also can't work. I wonder if you have ever met this problem?
Could you please help me?
Thanks in advance!

INFO:main: Train epoch: 0 [0/795] Avg. Loss: 3.751 Avg. Time: 1.046
Traceback (most recent call last):
File "src/train.py", line 425, in
main()
File "src/train.py", line 409, in main
args.freeze_bn[task_idx])
File "src/train.py", line 273, in train_segmenter
output = segmenter(input_var)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/SS/light-weight-refinenet/models/resnet.py", line 237, in forward
x1 = self.mflow_conv_g4_pool(x1)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/SS/light-weight-refinenet/utils/layer_factory.py", line 72, in forward
top = self.maxpool(top)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/pooling.py", line 146, in forward
self.return_indices)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/_jit_internal.py", line 133, in fn
return if_false(*args, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/functional.py", line 494, in _max_pool2d
input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 1.96 GiB total capacity; 1.14 GiB already allocated; 20.06 MiB free; 41.52 MiB cached)

arindamrc · 2019-06-07T12:06:03Z

I'm observing similar behaviour as well. I can train only with a batch size of 1. The GPU memory isn't fully utilized either. I'm training on a GTX 1080; vram is 8gb.

DrSleep · 2019-06-29T23:34:12Z

can't help with this one, would suggest to make sure that no other GPU processes are being run alongside.
I think with batch size of 1 1080 should be enough, for reference I am using 1080Ti with the batch size of 6

rfairhurst · 2019-08-05T23:25:46Z

I realize you can't help, but I am also getting this error. I am using Nvidia Quadro P4000 with 8 GB vram.

The Task Manager shows very low GPU memory usage until the program prints:

Train epoch: 0 [0/132] Avg. Loss: 3.711 Avg. Time: 2.425

Then the GPU memory usage jumps in under a second to over 90% and throws the error:

File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 425, in
main()
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 409, in main
args.freeze_bn[task_idx])
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 280, in train_segmenter
loss.backward()
File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\autograd_init_.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 230.00 MiB (GPU 0; 8.00 GiB total capacity; 5.81 GiB already allocated; 159.27 MiB free; 333.44 MiB cached)

I believe my batch size is set to 1. Anyway, I will search Google about this error to see if there is anything I can try.

rfairhurst · 2019-08-06T15:49:48Z

Apparently I was wrong about my batch size setting. It must have been set to 6 or higher, because when I made sure it was set to 5 or less the training ran successfully, but it failed if I set the batch size to 6.

DrSleep closed this as completed Aug 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About RuntimeError：CUDA out of memory #41

About RuntimeError：CUDA out of memory #41

Txrachel commented May 31, 2019

arindamrc commented Jun 7, 2019

DrSleep commented Jun 29, 2019

rfairhurst commented Aug 5, 2019

rfairhurst commented Aug 6, 2019

About RuntimeError：CUDA out of memory #41

About RuntimeError：CUDA out of memory #41

Comments

Txrachel commented May 31, 2019

arindamrc commented Jun 7, 2019

DrSleep commented Jun 29, 2019

rfairhurst commented Aug 5, 2019

rfairhurst commented Aug 6, 2019