Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About RuntimeError:CUDA out of memory #41

Closed
Txrachel opened this issue May 31, 2019 · 4 comments
Closed

About RuntimeError:CUDA out of memory #41

Txrachel opened this issue May 31, 2019 · 4 comments

Comments

@Txrachel
Copy link

Hi,
Thanks for your wonderful work and detailed tutorial.
I am just a fresh new here, when I try to retrain the model, there will be RuntimeError. Then I set the Batch_Size (config.py) to [1] * 3, it also can't work. I wonder if you have ever met this problem?
Could you please help me?
Thanks in advance!

INFO:main: Train epoch: 0 [0/795] Avg. Loss: 3.751 Avg. Time: 1.046
Traceback (most recent call last):
File "src/train.py", line 425, in
main()
File "src/train.py", line 409, in main
args.freeze_bn[task_idx])
File "src/train.py", line 273, in train_segmenter
output = segmenter(input_var)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/SS/light-weight-refinenet/models/resnet.py", line 237, in forward
x1 = self.mflow_conv_g4_pool(x1)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/SS/light-weight-refinenet/utils/layer_factory.py", line 72, in forward
top = self.maxpool(top)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/pooling.py", line 146, in forward
self.return_indices)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/_jit_internal.py", line 133, in fn
return if_false(*args, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/functional.py", line 494, in _max_pool2d
input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 1.96 GiB total capacity; 1.14 GiB already allocated; 20.06 MiB free; 41.52 MiB cached)

@arindamrc
Copy link

I'm observing similar behaviour as well. I can train only with a batch size of 1. The GPU memory isn't fully utilized either. I'm training on a GTX 1080; vram is 8gb.

@DrSleep
Copy link
Owner

DrSleep commented Jun 29, 2019

can't help with this one, would suggest to make sure that no other GPU processes are being run alongside.
I think with batch size of 1 1080 should be enough, for reference I am using 1080Ti with the batch size of 6

@rfairhurst
Copy link

I realize you can't help, but I am also getting this error. I am using Nvidia Quadro P4000 with 8 GB vram.

The Task Manager shows very low GPU memory usage until the program prints:

Train epoch: 0 [0/132] Avg. Loss: 3.711 Avg. Time: 2.425

Then the GPU memory usage jumps in under a second to over 90% and throws the error:

File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 425, in
main()
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 409, in main
args.freeze_bn[task_idx])
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 280, in train_segmenter
loss.backward()
File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\autograd_init_.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 230.00 MiB (GPU 0; 8.00 GiB total capacity; 5.81 GiB already allocated; 159.27 MiB free; 333.44 MiB cached)

I believe my batch size is set to 1. Anyway, I will search Google about this error to see if there is anything I can try.

@rfairhurst
Copy link

Apparently I was wrong about my batch size setting. It must have been set to 6 or higher, because when I made sure it was set to 5 or less the training ran successfully, but it failed if I set the batch size to 6.

@DrSleep DrSleep closed this as completed Aug 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants