Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10

yimingsu01 · 2022-09-20T14:03:24Z

Hi,

I'm getting this error when trying to train my own model on multiple GPUs.

This is my command:
python src/run.py +experiment=[blobgan,local,jitter] wandb.name='10-blob BlobGAN on bedrooms

This is my local.yaml:

dataset:
  path: /home/yimingsu/blobgan/resizedout # Change to your path
  resolution: 128
  #dataloader:
    #batch_size: 24
trainer:
  gpus: 4  # Change to your number of GPUs
wandb:  # Fill in your settings
  group:  Group
  project: Blobgan investigation
  entity: yimingsu
model:
  #layout_net:
    #feature_dim: 256
  #generator:
    #override_c_in: 256
  #dim: 256
  fid_stats_name: mitbedroom # this is my custom dataset.

This is the error log:

Error executing job with overrides: ['+experiment=[blobgan,local,jitter]', 'wandb.name=10-blob BlobGAN on bedrooms']
Traceback (most recent call last):
  File "/home/yimingsu/blobgan/src/run.py", line 81, in run
    trainer.fit(model, datamodule=datamodule)
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1138, in _run
    self._call_setup_hook()  # allow user to setup lightning_module in accelerator environment
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1435, in _call_setup_hook
    self.training_type_plugin.barrier("pre_setup")
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 403, in barrier
    torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
  File "/home/yimingsu/blobganenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I'm training on 4 RTX 2080 Ti so I don't think the memory is the actual issue.

Any help would be appreciated!

The text was updated successfully, but these errors were encountered:

dave-epstein · 2022-09-20T15:04:08Z

It isn't clear to me what the batch size you're using is from what you posted since 24 is commented out, but 2080Ti only has around 20GB of memory, if I'm not mistaken. Try batch size 4 or 8 with gradient accumulation.

yimingsu01 · 2022-09-20T15:06:32Z

Sorry I was being cryptic, I didn't specify the batch size and it should be 22 from the log. Each 2080Ti has 11GB so 4 of them should give me 44GB memory. How can I enable gradient accumulation?

dave-epstein · 2022-09-20T15:24:03Z

That batch size is much too large for 11GB per GPU. GPU memory is not pooled in DDP training. Try 4 or 8. You can set trainer.accumulate_grad_batches to 2 or 4 for stability.

yimingsu01 · 2022-09-20T15:31:39Z

[2022-09-20 10:27:24,717][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3
[2022-09-20 10:27:24,730][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-09-20 10:27:24,731][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[2022-09-20 10:27:24,733][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-09-20 10:27:24,734][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-09-20 10:27:24,740][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.

For some reason the terminal got stuck here and seems to be frozen.

dave-epstein · 2022-10-03T18:46:06Z

Sorry, I'm not sure how I can help debug this without more information :) This is likely to be a GPU memory or other system issue.

dave-epstein closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10

yimingsu01 commented Sep 20, 2022

dave-epstein commented Sep 20, 2022

yimingsu01 commented Sep 20, 2022

dave-epstein commented Sep 20, 2022

yimingsu01 commented Sep 20, 2022

dave-epstein commented Oct 3, 2022

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10

Comments

yimingsu01 commented Sep 20, 2022

dave-epstein commented Sep 20, 2022

yimingsu01 commented Sep 20, 2022

dave-epstein commented Sep 20, 2022

yimingsu01 commented Sep 20, 2022

dave-epstein commented Oct 3, 2022