Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Got 'RuntimeError: CUDA out of memory' #63

Closed
ntyoshi opened this issue Mar 19, 2021 · 6 comments
Closed

Got 'RuntimeError: CUDA out of memory' #63

ntyoshi opened this issue Mar 19, 2021 · 6 comments

Comments

@ntyoshi
Copy link

ntyoshi commented Mar 19, 2021

Hi there,

I tried bash launch_dns.sh with the default parameters you gave us but I got the error messages below:

$ bash launch_dns.sh 
[2021-03-19 00:39:32,334][__main__][INFO] - For logs, checkpoints and samples check /data/workspace/ntyoshi/outputs/exp_demucs.causal=1,demucs.hidden=64,demucs.resample=4,dset=dns
[2021-03-19 00:39:35,850][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2021-03-19 00:39:35,850][denoiser.solver][INFO] - Training...
Warning: Error detected in GluBackward. No forward pass information available. Enable detect anomaly during forward pass for more information. (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:42)
[2021-03-19 00:39:38,123][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 104, in main
    _main(args)
  File "train.py", line 98, in _main
    run(args)
  File "train.py", line 79, in run
    solver.train()
  File "/data/home/ntyoshi/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/data/home/ntyoshi/denoiser/denoiser/solver.py", line 226, in _run_one_epoch
    loss.backward()
  File "/home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.96 GiB (GPU 0; 23.65 GiB total capacity; 20.24 GiB already allocated; 1.53 GiB free; 21.27 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f79e7c5d536 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1cf1e (0x7f79e7ea6f1e in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1df9e (0x7f79e7ea7f9e in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: THCStorage_resize + 0x96 (0x7f79e911c3d6 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: THCTensor_resizeNd + 0x441 (0x7f79e912d591 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: THNN_CudaGatedLinear_updateGradInput + 0x100 (0x7f79e99dacb0 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x100a4f6 (0x7f79e90c34f6 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xf95376 (0x7f79e904e376 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x10c25b3 (0x7f7a2598d5b3 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2d39136 (0x7f7a27604136 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x10c25b3 (0x7f7a2598d5b3 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::GluBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x19d (0x7f7a2716f5cd in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2d89705 (0x7f7a27654705 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f7a27651a03 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f7a276527e2 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f7a2764ae59 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f7a33f92ac8 in /home/ntyoshi/anaconda3/envs/denoiser/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xbd6df (0x7f7a34e426df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: <unknown function> + 0x76db (0x7f7a37a466db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7f7a3776f71f in /lib/x86_64-linux-gnu/libc.so.6)

The GPU I'm using is QuadroRTX 6000 (memory size: 2.4GB).
I tried to use 2 GPUs, and batch_size=1 and segment=1 (I've seen #19 ) but I got same error.

Is it possible?
I'd like you to give me some advices to solve this error.
Thank you!

@adefossez
Copy link
Contributor

Are the GPU completely empty when you start training ? 24GB should be sufficient especially with a small segment or batch size.

@ntyoshi
Copy link
Author

ntyoshi commented Mar 19, 2021

Technically it uses only 4MB for some process but I guess it won't affect the computation.
I could actually complete launch_valentini.sh instead so I wonder some settings or parameters of dns script or the dataset itself are wrong.
Can you guess the reason from the error message?

@adefossez
Copy link
Contributor

when you said you tried with batch_size=1, did you edit it in the script or the original config file ? you need to edit that in the script as it will get overridden otherwise. If you pass verbose=1 in the script, this will also print more debug information that can prove useful for me to help you. Sorry for not replying sooner.

@ntyoshi
Copy link
Author

ntyoshi commented Apr 8, 2021

I edited launch_dns.sh and config file, then it did work! I set batch_size=50.
One thing I'm worried is that compared to Valentini (batch_size=128), the training speed is very very slow (it took 6405.02s for one epoch).
Those hardware environments are completely same and I used default settings of launch_dns.sh and launch_valentini.sh except for batch_size of the dns case.
I guess the difference of speed would be caused by dataset but does it look weird?

@adefossez
Copy link
Contributor

Yes epoch size is big on the DNS dataset, the dataset is pretty large. I think we were training on 8 to 16 GPUs, and already it was taking a few days to fully converge, so on 2 GPUs I would expect it to be even slower.

@ntyoshi
Copy link
Author

ntyoshi commented Apr 8, 2021

@adefossez
I see. Thank you!

@ntyoshi ntyoshi closed this as completed Apr 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants