New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs #10
Comments
It isn't clear to me what the batch size you're using is from what you posted since 24 is commented out, but 2080Ti only has around 20GB of memory, if I'm not mistaken. Try batch size 4 or 8 with gradient accumulation. |
Sorry I was being cryptic, I didn't specify the batch size and it should be 22 from the log. Each 2080Ti has 11GB so 4 of them should give me 44GB memory. How can I enable gradient accumulation? |
That batch size is much too large for 11GB per GPU. GPU memory is not pooled in DDP training. Try 4 or 8. You can set |
For some reason the terminal got stuck here and seems to be frozen. |
Sorry, I'm not sure how I can help debug this without more information :) This is likely to be a GPU memory or other system issue. |
Hi,
I'm getting this error when trying to train my own model on multiple GPUs.
This is my command:
python src/run.py +experiment=[blobgan,local,jitter] wandb.name='10-blob BlobGAN on bedrooms
This is my local.yaml:
This is the error log:
I'm training on 4 RTX 2080 Ti so I don't think the memory is the actual issue.
Any help would be appreciated!
The text was updated successfully, but these errors were encountered: