Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: out of memory #22

Closed
hansonkd opened this issue Mar 27, 2021 · 3 comments
Closed

CUDA error: out of memory #22

hansonkd opened this issue Mar 27, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@hansonkd
Copy link

hansonkd commented Mar 27, 2021

Hello,

Running the run.py in both the main directory and in the MultiGPU directory leads me to have an error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "elegantrl/AgentZoo/ElegantRL-MultiGPU/run.py", line 522, in mp_explore
    agent.init(net_dim, state_dim, action_dim)
  File "/usr/local/lib/python3.8/dist-packages/elegantrl/agent.py", line 687, in init
    self.act = ActorPPO(net_dim, state_dim, action_dim).to(self.device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory

This error persists no matter what batch size I assign or net size I specify and doesnt matter if I try to use Multi-GPU or the main elegantrl/run.py file.

I am running this on an Nvidia Quadro 4000 with 8gb of RAM.

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
>>> torch.__version__
'1.8.0'

In order to get the examples to work, I have to specify the GPU ID of "-1"

EDIT:
If I set the rollout_num to 1 the error changes to this:

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "elegantrl/AgentZoo/ElegantRL-MultiGPU/run.py", line 546, in mp_explore
    agent.explore_env(env, buffer, exp_step, reward_scale, gamma)
  File "/code/ElegantRL/elegantrl/agent.py", line 714, in explore_env
    action, noise = self.select_action(state)
  File "/code/ElegantRL/elegantrl/agent.py", line 703, in select_action
    actions, noises = self.act.get_action_noise(states)
  File "/code/ElegantRL/elegantrl/net.py", line 144, in get_action_noise
    a_avg = self.net(state)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

Researching it still appears that it is a memory problem, but as far as I can tell the "explore" process should take up about 1.5gb of memory but I have almost 4gb free.

@Yonv1943
Copy link
Collaborator

Hi,

Running the run.py in both the main directory and in the MultiGPU directory leads me to have an error:

We haven't finished checking the multi-GPU version yet, so we put these files in elegantrl/AgentZoo/ElegantRL-MultiGPU.
Once all the checks are done, we'll update it directly to the elegantrl directory.

@hansonkd
Copy link
Author

hansonkd commented Mar 29, 2021

Hi,

Running the run.py in both the main directory and in the MultiGPU directory leads me to have an error:

We haven't finished checking the multi-GPU version yet, so we put these files in elegantrl/AgentZoo/ElegantRL-MultiGPU.
Once all the checks are done, we'll update it directly to the elegantrl directory.

I just wanted to make sure you saw, I referred to both the main directory and the multigpu directory. Out of the box, elegantrl/run.py does not work for me with the same out of memory above error. I tried both to see if one example would work.

I am unable to find a configuration that works. I have tried lowering the net size, batch size, rollout size, etc.

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "elegantrl/run.py", line 414, in mp_train
    agent.init(net_dim, state_dim, action_dim)
  File "/home/kyle/trading/erl2/ElegantRL/elegantrl/agent.py", line 687, in init
    self.act = ActorPPO(net_dim, state_dim, action_dim).to(self.device)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/home/kyle/.virtualenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory

@Yonv1943
Copy link
Collaborator

Yonv1943 commented Sep 6, 2021

We have fully upgraded ElegantRL and now supports multiple GPU training (1~8 GPU).
And we have optimized the architecture of this library so that it takes up less GPU memory than before.

Now the problem you mentioned has been resolved. I'm sorry that we have been busy developing the 80 GPU version (Cloud platform) of ElegantRL, and we were unable to reply to you in time.

I will close this question in 3 days.

@YangletLiu YangletLiu added the bug Something isn't working label Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants