Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10

Closed
yimingsu01 opened this issue Sep 20, 2022 · 5 comments

Comments

@yimingsu01
Copy link

Hi,

I'm getting this error when trying to train my own model on multiple GPUs.

This is my command:
python src/run.py +experiment=[blobgan,local,jitter] wandb.name='10-blob BlobGAN on bedrooms

This is my local.yaml:

dataset:
  path: /home/yimingsu/blobgan/resizedout # Change to your path
  resolution: 128
  #dataloader:
    #batch_size: 24
trainer:
  gpus: 4  # Change to your number of GPUs
wandb:  # Fill in your settings
  group:  Group
  project: Blobgan investigation
  entity: yimingsu
model:
  #layout_net:
    #feature_dim: 256
  #generator:
    #override_c_in: 256
  #dim: 256
  fid_stats_name: mitbedroom # this is my custom dataset.

This is the error log:

Error executing job with overrides: ['+experiment=[blobgan,local,jitter]', 'wandb.name=10-blob BlobGAN on bedrooms']
Traceback (most recent call last):
  File "/home/yimingsu/blobgan/src/run.py", line 81, in run
    trainer.fit(model, datamodule=datamodule)
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1138, in _run
    self._call_setup_hook()  # allow user to setup lightning_module in accelerator environment
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1435, in _call_setup_hook
    self.training_type_plugin.barrier("pre_setup")
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 403, in barrier
    torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I'm training on 4 RTX 2080 Ti so I don't think the memory is the actual issue.

Any help would be appreciated!

@dave-epstein
Copy link
Owner

It isn't clear to me what the batch size you're using is from what you posted since 24 is commented out, but 2080Ti only has around 20GB of memory, if I'm not mistaken. Try batch size 4 or 8 with gradient accumulation.

@yimingsu01
Copy link
Author

Sorry I was being cryptic, I didn't specify the batch size and it should be 22 from the log. Each 2080Ti has 11GB so 4 of them should give me 44GB memory. How can I enable gradient accumulation?

@dave-epstein
Copy link
Owner

That batch size is much too large for 11GB per GPU. GPU memory is not pooled in DDP training. Try 4 or 8. You can set trainer.accumulate_grad_batches to 2 or 4 for stability.

@yimingsu01
Copy link
Author

[2022-09-20 10:27:24,717][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3
[2022-09-20 10:27:24,730][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-09-20 10:27:24,731][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[2022-09-20 10:27:24,733][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-09-20 10:27:24,734][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-09-20 10:27:24,740][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.

For some reason the terminal got stuck here and seems to be frozen.

@dave-epstein
Copy link
Owner

Sorry, I'm not sure how I can help debug this without more information :) This is likely to be a GPU memory or other system issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants