Training on a pre-trained model: RuntimeError: CUDA error: out of memory #238

BelhalK · 2018-11-30T17:02:48Z

🐛 Bug

I am launching training on a pretrained model and a 2 classes coco like dataset.

To Reproduce

Steps to reproduce the behavior:

Run training with this command line

python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 10 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Where myconfig.yaml points out to mymodel.pth like this:
WEIGHT: "/Users/karimimohammedbelhal/.torch/models/mymodel"
And mymodel.pth is a pre trained model with the right keys deleted as suggested in #15

Expected behavior

Training should start and complete.

Environment

PyTorch version: 1.0.0.dev20181123
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 396.51
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip3] numpy (1.13.3)
[pip3] torch (0.4.1)
[pip3] torchvision (0.2.1)
[conda] pytorch-nightly 1.0.0.dev20181123 py3.7_cuda9.0.176_cudnn7.4.1_0 pytorch

Returned Error

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory

The text was updated successfully, but these errors were encountered:

zimenglan-sysu-512 · 2018-12-01T06:54:09Z

if u use single gpu to train a model, u should change the IMS_PER_BATCH to be small enough (e.g. IMS_PER_BATCH=2).

fmassa · 2018-12-03T19:48:12Z

As @zimenglan-sysu-512 pointed out, you are training on a single GPU with a batch size of 10, which is quite large in general. Try decreasing the batch size.

BelhalK · 2018-12-03T20:07:28Z

Actually I also tried with this command line (setting SOLVER.IMS_PER_BATCHto 1)
python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
and still get a similar error that seems weird indeed

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

fmassa · 2018-12-03T20:09:27Z

can you do

import torch
print(torch.rand(1, device="cuda"))

in your interpreter?

BelhalK · 2018-12-03T20:12:45Z

Hm interesting, it reutrns

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51

BelhalK · 2018-12-03T20:13:46Z

only

import torch
print(torch.rand(1, device="cpu"))

works

fmassa · 2018-12-03T21:40:13Z

It looks like there is a problem with your setup / gpu. Maybe a reboot would help?

BelhalK · 2018-12-06T15:30:11Z

That was it, thank you!

Dongximing · 2021-05-31T21:48:56Z

do you have a solution?

LiuWenJia-ops · 2021-09-23T11:31:55Z

I met the same question, I cant reboot it because its a public server.

arachid1 · 2021-11-24T15:03:28Z

are we rebooting the gpu? how do you that safely? Is there a solution by chance?

lalalafloat · 2022-10-17T11:34:51Z

I solved this problem by rebooting the server.

fmassa closed this as completed Dec 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on a pre-trained model: RuntimeError: CUDA error: out of memory #238

Training on a pre-trained model: RuntimeError: CUDA error: out of memory #238

BelhalK commented Nov 30, 2018

zimenglan-sysu-512 commented Dec 1, 2018

fmassa commented Dec 3, 2018

BelhalK commented Dec 3, 2018

fmassa commented Dec 3, 2018

BelhalK commented Dec 3, 2018

BelhalK commented Dec 3, 2018

fmassa commented Dec 3, 2018 •

edited

BelhalK commented Dec 6, 2018

Dongximing commented May 31, 2021

LiuWenJia-ops commented Sep 23, 2021

arachid1 commented Nov 24, 2021

lalalafloat commented Oct 17, 2022

Training on a pre-trained model: RuntimeError: CUDA error: out of memory #238

Training on a pre-trained model: RuntimeError: CUDA error: out of memory #238

Comments

BelhalK commented Nov 30, 2018

🐛 Bug

To Reproduce

Expected behavior

Environment

Returned Error

zimenglan-sysu-512 commented Dec 1, 2018

fmassa commented Dec 3, 2018

BelhalK commented Dec 3, 2018

fmassa commented Dec 3, 2018

BelhalK commented Dec 3, 2018

BelhalK commented Dec 3, 2018

fmassa commented Dec 3, 2018 • edited

BelhalK commented Dec 6, 2018

Dongximing commented May 31, 2021

LiuWenJia-ops commented Sep 23, 2021

arachid1 commented Nov 24, 2021

lalalafloat commented Oct 17, 2022

fmassa commented Dec 3, 2018 •

edited