Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Training on a pre-trained model: RuntimeError: CUDA error: out of memory #238

Closed
BelhalK opened this issue Nov 30, 2018 · 12 comments
Closed

Comments

@BelhalK
Copy link

BelhalK commented Nov 30, 2018

馃悰 Bug

I am launching training on a pretrained model and a 2 classes coco like dataset.

To Reproduce

Steps to reproduce the behavior:

  1. Run training with this command line

python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 10 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Where myconfig.yaml points out to mymodel.pth like this:
WEIGHT: "/Users/karimimohammedbelhal/.torch/models/mymodel"
And mymodel.pth is a pre trained model with the right keys deleted as suggested in #15

Expected behavior

Training should start and complete.

Environment

PyTorch version: 1.0.0.dev20181123
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 396.51
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip3] numpy (1.13.3)
[pip3] torch (0.4.1)
[pip3] torchvision (0.2.1)
[conda] pytorch-nightly 1.0.0.dev20181123 py3.7_cuda9.0.176_cudnn7.4.1_0 pytorch

Returned Error

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory
@zimenglan-sysu-512
Copy link
Contributor

if u use single gpu to train a model, u should change the IMS_PER_BATCH to be small enough (e.g. IMS_PER_BATCH=2).

@fmassa
Copy link
Contributor

fmassa commented Dec 3, 2018

As @zimenglan-sysu-512 pointed out, you are training on a single GPU with a batch size of 10, which is quite large in general. Try decreasing the batch size.

@fmassa fmassa closed this as completed Dec 3, 2018
@BelhalK
Copy link
Author

BelhalK commented Dec 3, 2018

Actually I also tried with this command line (setting SOLVER.IMS_PER_BATCHto 1)
python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
and still get a similar error that seems weird indeed

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

@fmassa
Copy link
Contributor

fmassa commented Dec 3, 2018

can you do

import torch
print(torch.rand(1, device="cuda"))

in your interpreter?

@BelhalK
Copy link
Author

BelhalK commented Dec 3, 2018

Hm interesting, it reutrns

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51

@BelhalK
Copy link
Author

BelhalK commented Dec 3, 2018

only

import torch
print(torch.rand(1, device="cpu"))

works

@fmassa
Copy link
Contributor

fmassa commented Dec 3, 2018

It looks like there is a problem with your setup / gpu. Maybe a reboot would help?

@BelhalK
Copy link
Author

BelhalK commented Dec 6, 2018

That was it, thank you!

@Dongximing
Copy link

do you have a solution?

@LiuWenJia-ops
Copy link

I met the same question, I cant reboot it because its a public server.

@arachid1
Copy link

are we rebooting the gpu? how do you that safely? Is there a solution by chance?

@lalalafloat
Copy link

I solved this problem by rebooting the server.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants