RuntimeError: CUDA out of memory #19

youssefavx · 2020-10-01T20:03:31Z

Hey guys, in trying to make the first 'hello world' into training this model / fine-tuning it, I basically replaced the debug files noisy.json and clean.json with my own json file content that pointed to my own dataset. The dataset contains around 2.5K files and is around 1GB when at 44kHz and lower when at 16kHz as expected.

The problem is that when trying to run this on Colab (which worked with the original toy dataset provided, I'm now getting this unexpected error:

[2020-10-01 19:57:18,722][__main__][INFO] - For logs, checkpoints and samples check /content/denoiser/outputs/exp_demucs.hidden=64
[2020-10-01 19:57:23,614][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2020-10-01 19:57:23,615][denoiser.solver][INFO] - Training...
[2020-10-01 19:57:26,054][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 99, in main
    _main(args)
  File "train.py", line 93, in _main
    run(args)
  File "train.py", line 76, in run
    solver.train()
  File "/content/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/content/denoiser/denoiser/solver.py", line 207, in _run_one_epoch
    estimate = self.dmodel(noisy)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/denoiser/denoiser/demucs.py", line 184, in forward
    x = decode(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 914, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 0; 14.73 GiB total capacity; 13.03 GiB already allocated; 467.88 MiB free; 13.49 GiB reserved in total by PyTorch)

I have different versions of my dataset, from 16kHz, to 22kHz, 32kHz, and 44.1khz. Every time I try one, I get a variant of the same error above. For example, when I try with 44.1kHz:

!python3 train.py demucs.hidden=64 sample_rate=44100

I get:

[2020-10-01 20:00:59,516][__main__][INFO] - For logs, checkpoints and samples check /content/denoiser/outputs/exp_demucs.hidden=64,sample_rate=44100
[2020-10-01 20:01:04,753][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2020-10-01 20:01:04,754][denoiser.solver][INFO] - Training...
[2020-10-01 20:01:09,170][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 99, in main
    _main(args)
  File "train.py", line 93, in _main
    run(args)
  File "train.py", line 76, in run
    solver.train()
  File "/content/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/content/denoiser/denoiser/solver.py", line 207, in _run_one_epoch
    estimate = self.dmodel(noisy)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/denoiser/denoiser/demucs.py", line 176, in forward
    x = encode(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 448, in forward
    return F.glu(input, self.dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 946, in glu
    return torch._C._nn.glu(input, dim)
RuntimeError: CUDA out of memory. Tried to allocate 2.69 GiB (GPU 0; 14.73 GiB total capacity; 11.28 GiB already allocated; 2.65 GiB free; 11.30 GiB reserved in total by PyTorch)

Whereas when I try 16kHz, I get:

[2020-10-01 19:57:18,722][__main__][INFO] - For logs, checkpoints and samples check /content/denoiser/outputs/exp_demucs.hidden=64
[2020-10-01 19:57:23,614][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2020-10-01 19:57:23,615][denoiser.solver][INFO] - Training...
[2020-10-01 19:57:26,054][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 99, in main
    _main(args)
  File "train.py", line 93, in _main
    run(args)
  File "train.py", line 76, in run
    solver.train()
  File "/content/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/content/denoiser/denoiser/solver.py", line 207, in _run_one_epoch
    estimate = self.dmodel(noisy)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/denoiser/denoiser/demucs.py", line 184, in forward
    x = decode(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 914, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 0; 14.73 GiB total capacity; 13.03 GiB already allocated; 467.88 MiB free; 13.49 GiB reserved in total by PyTorch)

And the 'memory tried to allocate' just varies, the free memory is always just underneath the 'available' memory.

The fact that the 'free memory' varies when I change the dataset (which have different sizes) makes me think it's something entirely different than CUDA being out of memory, though I could be wrong.

The version pytorch I'm running is 1.4.0 and 0.4.0 for torchaudio because otherwise I get an error saying the Cuda driver is out of date.

I get this error when I try to train, and when I try to fine tune.

Am I doing something wrong in setting all this up? Should I be arranging my files differently than the debug ones? I tried to place everything its correct directory, and point everything to its correct directory.

The text was updated successfully, but these errors were encountered:

youssefavx · 2020-10-01T20:13:55Z

I also tried:

import torch
torch.cuda.empty_cache()

Before running !python3 train.py demucs.hidden=64

And it has no effect.

youssefavx · 2020-10-01T20:19:18Z

The way I replaced the noisy / clean json files, is I created them manually, so instead of having something like this:

    [
        "/content/denoiser/dataset/debug/clean/p287_006.wav",
        81271
    ]

Through python, I created ones like so:

[
        "/content/denoiser_dataset/audiofile noise 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_100.wav",
        100
    ],

and for the clean files I would give each file the corresponding index (for the corresponding clean audio file):

[
        "/content/denoiser_dataset/audiofile clean 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_100.wav",
        100
    ],

etc.

Where the second number would be the index of the file. Not sure if that's what's responsible for the problem or not.

youssefavx · 2020-10-01T20:30:01Z

To be clear, in this same session, when I run the debug noisy and clean json files with their dataset, they work great.

But when I replace my dataset, and clean and noisy json files with the corresponding files, I get this error.

youssefavx · 2020-10-01T21:42:24Z

I'm also curious what the test set files look like, for example, for dns and valentini, those seem to be tt, but what's inside of those? also noisy and clean folders? and if so, what's inside of those? A subset of the files from the original noisy and clean folders? And if so, do they have to be different?

adiyoss · 2020-10-02T06:17:02Z

Hi @youssefavx,
Try to use smaller batch size, I think that might be the problem.

youssefavx · 2020-10-05T21:19:31Z

Thanks @adiyoss ! I'll try and see.

youssefavx · 2020-10-05T23:32:10Z

So if I understand what you mean by 'batch size' I thought you were referring to number of files. It turns out that I'm still getting this error when I run it on only 200 files.

However, I discovered something. In the config file, when I change the segment value from 4 to 2, it actually trains! (but this is only on 16kHz, I get the same error when I go back to 44.1kHz, even if I reduce segment to 1.) So I think this is unrelated to dataset size (though I could be wrong).

What does this segment value refer to? Number of seconds? My examples are around 5 seconds, is this advisable to keep it at 2?

Omenranr · 2020-10-25T14:27:16Z

batch_size is the number of files you give to the model at each batch iteration, not the number of files in the train directory. Even if you change the number of files in the directory to 200, since the batch_size is equal to 64 (in the config.yaml file), you'll get an error in the first batch.

youssefavx · 2020-10-25T18:06:38Z

Thank you so much for taking the time to explain! Much clearer now.

adefossez · 2020-10-27T13:53:39Z

The way I replaced the noisy / clean json files, is I created them manually, so instead of having something like this:
    [
        "/content/denoiser/dataset/debug/clean/p287_006.wav",
        81271
    ]
Through python, I created ones like so:
[
        "/content/denoiser_dataset/audiofile noise 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_100.wav",
        100
    ],
and for the clean files I would give each file the corresponding index (for the corresponding clean audio file):
[
        "/content/denoiser_dataset/audiofile clean 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_100.wav",
        100
    ],
etc.

Where the second number would be the index of the file. Not sure if that's what's responsible for the problem or not.

The second number should always be the size of the file, you are going to get issues otherwise! it should be automatically generated using like in https://github.com/facebookresearch/denoiser/blob/master/make_debug.sh#L13:

python3 -m denoiser.audio PATH_TO_WAV_FOLDEr > PATH_TO_JSON_FILE.json

Then you will indeed need to reduce the batch size. Don't forget to set the sample_rate parameter as well when using a dataset with 44.1kHz. And finally you might want to add demucs.resample=2 when training :)

Good luck!

youssefavx · 2020-10-28T23:18:11Z

@adefossez Thank you so much! You guys are awesome. I'm gonna try this next. I actually did get the generation of the noisy and clean json files to work. Now I'm curious what demucs.resample=2 does.

adefossez · 2020-10-29T09:52:17Z

By default demucs upsamples the audio by a factor of 4 before feeding it to the model, so effectively our model handles audio at 64kHz, but because you have audio at 44kHz, its not a great idea to upsample 4 times, as it will become very expansive.

youssefavx · 2020-10-30T21:26:53Z

Thank you! makes sense now.

adefossez mentioned this issue Nov 2, 2020

Recommended architectural parameter changes for higher sample rate #23

Closed

adiyoss closed this as completed Dec 16, 2020

adefossez mentioned this issue Feb 4, 2021

What parameters to set to train a deeper network? #46

Closed

ntyoshi mentioned this issue Mar 19, 2021

Got 'RuntimeError: CUDA out of memory' #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA out of memory #19

RuntimeError: CUDA out of memory #19

youssefavx commented Oct 1, 2020 •

edited

Loading

youssefavx commented Oct 1, 2020

youssefavx commented Oct 1, 2020 •

edited

Loading

youssefavx commented Oct 1, 2020

youssefavx commented Oct 1, 2020

adiyoss commented Oct 2, 2020

youssefavx commented Oct 5, 2020

youssefavx commented Oct 5, 2020 •

edited

Loading

Omenranr commented Oct 25, 2020

youssefavx commented Oct 25, 2020

adefossez commented Oct 27, 2020

youssefavx commented Oct 28, 2020

adefossez commented Oct 29, 2020

youssefavx commented Oct 30, 2020

RuntimeError: CUDA out of memory #19

RuntimeError: CUDA out of memory #19

Comments

youssefavx commented Oct 1, 2020 • edited Loading

youssefavx commented Oct 1, 2020

youssefavx commented Oct 1, 2020 • edited Loading

youssefavx commented Oct 1, 2020

youssefavx commented Oct 1, 2020

adiyoss commented Oct 2, 2020

youssefavx commented Oct 5, 2020

youssefavx commented Oct 5, 2020 • edited Loading

Omenranr commented Oct 25, 2020

youssefavx commented Oct 25, 2020

adefossez commented Oct 27, 2020

youssefavx commented Oct 28, 2020

adefossez commented Oct 29, 2020

youssefavx commented Oct 30, 2020

youssefavx commented Oct 1, 2020 •

edited

Loading

youssefavx commented Oct 1, 2020 •

edited

Loading

youssefavx commented Oct 5, 2020 •

edited

Loading