CUDA error: out of memory #22

hansonkd · 2021-03-27T02:33:15Z

Hello,

Running the run.py in both the main directory and in the MultiGPU directory leads me to have an error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "elegantrl/AgentZoo/ElegantRL-MultiGPU/run.py", line 522, in mp_explore
    agent.init(net_dim, state_dim, action_dim)
  File "/usr/local/lib/python3.8/dist-packages/elegantrl/agent.py", line 687, in init
    self.act = ActorPPO(net_dim, state_dim, action_dim).to(self.device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory

This error persists no matter what batch size I assign or net size I specify and doesnt matter if I try to use Multi-GPU or the main elegantrl/run.py file.

I am running this on an Nvidia Quadro 4000 with 8gb of RAM.

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

>>> torch.__version__
'1.8.0'

In order to get the examples to work, I have to specify the GPU ID of "-1"

EDIT:
If I set the rollout_num to 1 the error changes to this:

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "elegantrl/AgentZoo/ElegantRL-MultiGPU/run.py", line 546, in mp_explore
    agent.explore_env(env, buffer, exp_step, reward_scale, gamma)
  File "/code/ElegantRL/elegantrl/agent.py", line 714, in explore_env
    action, noise = self.select_action(state)
  File "/code/ElegantRL/elegantrl/agent.py", line 703, in select_action
    actions, noises = self.act.get_action_noise(states)
  File "/code/ElegantRL/elegantrl/net.py", line 144, in get_action_noise
    a_avg = self.net(state)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

Researching it still appears that it is a memory problem, but as far as I can tell the "explore" process should take up about 1.5gb of memory but I have almost 4gb free.

The text was updated successfully, but these errors were encountered:

Yonv1943 · 2021-03-29T03:51:41Z

Hi,

Running the run.py in both the main directory and in the MultiGPU directory leads me to have an error:

We haven't finished checking the multi-GPU version yet, so we put these files in elegantrl/AgentZoo/ElegantRL-MultiGPU.
Once all the checks are done, we'll update it directly to the elegantrl directory.

hansonkd · 2021-03-29T13:16:30Z

Hi,

Running the run.py in both the main directory and in the MultiGPU directory leads me to have an error:

We haven't finished checking the multi-GPU version yet, so we put these files in elegantrl/AgentZoo/ElegantRL-MultiGPU.
Once all the checks are done, we'll update it directly to the elegantrl directory.

I just wanted to make sure you saw, I referred to both the main directory and the multigpu directory. Out of the box, elegantrl/run.py does not work for me with the same out of memory above error. I tried both to see if one example would work.

I am unable to find a configuration that works. I have tried lowering the net size, batch size, rollout size, etc.

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "elegantrl/run.py", line 414, in mp_train
    agent.init(net_dim, state_dim, action_dim)
  File "/home/kyle/trading/erl2/ElegantRL/elegantrl/agent.py", line 687, in init
    self.act = ActorPPO(net_dim, state_dim, action_dim).to(self.device)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory

Yonv1943 · 2021-09-06T13:15:37Z

We have fully upgraded ElegantRL and now supports multiple GPU training (1~8 GPU).
And we have optimized the architecture of this library so that it takes up less GPU memory than before.

Now the problem you mentioned has been resolved. I'm sorry that we have been busy developing the 80 GPU version (Cloud platform) of ElegantRL, and we were unable to reply to you in time.

I will close this question in 3 days.

YangletLiu closed this as completed Sep 22, 2021

YangletLiu added the bug Something isn't working label Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: out of memory #22

CUDA error: out of memory #22

hansonkd commented Mar 27, 2021 •

edited

Yonv1943 commented Mar 29, 2021

hansonkd commented Mar 29, 2021 •

edited

Yonv1943 commented Sep 6, 2021

CUDA error: out of memory #22

CUDA error: out of memory #22

Comments

hansonkd commented Mar 27, 2021 • edited

Yonv1943 commented Mar 29, 2021

hansonkd commented Mar 29, 2021 • edited

Yonv1943 commented Sep 6, 2021

hansonkd commented Mar 27, 2021 •

edited

hansonkd commented Mar 29, 2021 •

edited