Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

RuntimeError: CUDA out of memory #19

Closed
youssefavx opened this issue Oct 1, 2020 · 13 comments
Closed

RuntimeError: CUDA out of memory #19

youssefavx opened this issue Oct 1, 2020 · 13 comments

Comments

@youssefavx
Copy link

youssefavx commented Oct 1, 2020

Hey guys, in trying to make the first 'hello world' into training this model / fine-tuning it, I basically replaced the debug files noisy.json and clean.json with my own json file content that pointed to my own dataset. The dataset contains around 2.5K files and is around 1GB when at 44kHz and lower when at 16kHz as expected.

The problem is that when trying to run this on Colab (which worked with the original toy dataset provided, I'm now getting this unexpected error:

[2020-10-01 19:57:18,722][__main__][INFO] - For logs, checkpoints and samples check /content/denoiser/outputs/exp_demucs.hidden=64
[2020-10-01 19:57:23,614][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2020-10-01 19:57:23,615][denoiser.solver][INFO] - Training...
[2020-10-01 19:57:26,054][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 99, in main
    _main(args)
  File "train.py", line 93, in _main
    run(args)
  File "train.py", line 76, in run
    solver.train()
  File "/content/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/content/denoiser/denoiser/solver.py", line 207, in _run_one_epoch
    estimate = self.dmodel(noisy)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/denoiser/denoiser/demucs.py", line 184, in forward
    x = decode(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 914, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 0; 14.73 GiB total capacity; 13.03 GiB already allocated; 467.88 MiB free; 13.49 GiB reserved in total by PyTorch)

I have different versions of my dataset, from 16kHz, to 22kHz, 32kHz, and 44.1khz. Every time I try one, I get a variant of the same error above. For example, when I try with 44.1kHz:

!python3 train.py demucs.hidden=64 sample_rate=44100

I get:

[2020-10-01 20:00:59,516][__main__][INFO] - For logs, checkpoints and samples check /content/denoiser/outputs/exp_demucs.hidden=64,sample_rate=44100
[2020-10-01 20:01:04,753][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2020-10-01 20:01:04,754][denoiser.solver][INFO] - Training...
[2020-10-01 20:01:09,170][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 99, in main
    _main(args)
  File "train.py", line 93, in _main
    run(args)
  File "train.py", line 76, in run
    solver.train()
  File "/content/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/content/denoiser/denoiser/solver.py", line 207, in _run_one_epoch
    estimate = self.dmodel(noisy)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/denoiser/denoiser/demucs.py", line 176, in forward
    x = encode(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 448, in forward
    return F.glu(input, self.dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 946, in glu
    return torch._C._nn.glu(input, dim)
RuntimeError: CUDA out of memory. Tried to allocate 2.69 GiB (GPU 0; 14.73 GiB total capacity; 11.28 GiB already allocated; 2.65 GiB free; 11.30 GiB reserved in total by PyTorch)

Whereas when I try 16kHz, I get:

[2020-10-01 19:57:18,722][__main__][INFO] - For logs, checkpoints and samples check /content/denoiser/outputs/exp_demucs.hidden=64
[2020-10-01 19:57:23,614][denoiser.solver][INFO] - ----------------------------------------------------------------------
[2020-10-01 19:57:23,615][denoiser.solver][INFO] - Training...
[2020-10-01 19:57:26,054][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 99, in main
    _main(args)
  File "train.py", line 93, in _main
    run(args)
  File "train.py", line 76, in run
    solver.train()
  File "/content/denoiser/denoiser/solver.py", line 137, in train
    train_loss = self._run_one_epoch(epoch)
  File "/content/denoiser/denoiser/solver.py", line 207, in _run_one_epoch
    estimate = self.dmodel(noisy)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/denoiser/denoiser/demucs.py", line 184, in forward
    x = decode(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 914, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 0; 14.73 GiB total capacity; 13.03 GiB already allocated; 467.88 MiB free; 13.49 GiB reserved in total by PyTorch)

And the 'memory tried to allocate' just varies, the free memory is always just underneath the 'available' memory.

The fact that the 'free memory' varies when I change the dataset (which have different sizes) makes me think it's something entirely different than CUDA being out of memory, though I could be wrong.

The version pytorch I'm running is 1.4.0 and 0.4.0 for torchaudio because otherwise I get an error saying the Cuda driver is out of date.

I get this error when I try to train, and when I try to fine tune.

Am I doing something wrong in setting all this up? Should I be arranging my files differently than the debug ones? I tried to place everything its correct directory, and point everything to its correct directory.

@youssefavx
Copy link
Author

I also tried:

import torch
torch.cuda.empty_cache()

Before running !python3 train.py demucs.hidden=64

And it has no effect.

@youssefavx
Copy link
Author

youssefavx commented Oct 1, 2020

The way I replaced the noisy / clean json files, is I created them manually, so instead of having something like this:

    [
        "/content/denoiser/dataset/debug/clean/p287_006.wav",
        81271
    ]

Through python, I created ones like so:

[
        "/content/denoiser_dataset/audiofile noise 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_100.wav",
        100
    ],

and for the clean files I would give each file the corresponding index (for the corresponding clean audio file):

[
        "/content/denoiser_dataset/audiofile clean 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_100.wav",
        100
    ],

etc.

Where the second number would be the index of the file. Not sure if that's what's responsible for the problem or not.

@youssefavx
Copy link
Author

To be clear, in this same session, when I run the debug noisy and clean json files with their dataset, they work great.

But when I replace my dataset, and clean and noisy json files with the corresponding files, I get this error.

@youssefavx
Copy link
Author

I'm also curious what the test set files look like, for example, for dns and valentini, those seem to be tt, but what's inside of those? also noisy and clean folders? and if so, what's inside of those? A subset of the files from the original noisy and clean folders? And if so, do they have to be different?

@adiyoss
Copy link
Contributor

adiyoss commented Oct 2, 2020

Hi @youssefavx,
Try to use smaller batch size, I think that might be the problem.

@youssefavx
Copy link
Author

Thanks @adiyoss ! I'll try and see.

@youssefavx
Copy link
Author

youssefavx commented Oct 5, 2020

So if I understand what you mean by 'batch size' I thought you were referring to number of files. It turns out that I'm still getting this error when I run it on only 200 files.

However, I discovered something. In the config file, when I change the segment value from 4 to 2, it actually trains! (but this is only on 16kHz, I get the same error when I go back to 44.1kHz, even if I reduce segment to 1.) So I think this is unrelated to dataset size (though I could be wrong).

What does this segment value refer to? Number of seconds? My examples are around 5 seconds, is this advisable to keep it at 2?

@Omenranr
Copy link

batch_size is the number of files you give to the model at each batch iteration, not the number of files in the train directory. Even if you change the number of files in the directory to 200, since the batch_size is equal to 64 (in the config.yaml file), you'll get an error in the first batch.

@youssefavx
Copy link
Author

Thank you so much for taking the time to explain! Much clearer now.

@adefossez
Copy link
Contributor

The way I replaced the noisy / clean json files, is I created them manually, so instead of having something like this:

    [
        "/content/denoiser/dataset/debug/clean/p287_006.wav",
        81271
    ]

Through python, I created ones like so:

[
        "/content/denoiser_dataset/audiofile noise 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile noise 1_100.wav",
        100
    ],

and for the clean files I would give each file the corresponding index (for the corresponding clean audio file):

[
        "/content/denoiser_dataset/audiofile clean 1_1.wav",
        1
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_10.wav",
        10
    ],
    [
        "/content/denoiser_dataset/audiofile clean 1_100.wav",
        100
    ],

etc.

Where the second number would be the index of the file. Not sure if that's what's responsible for the problem or not.

The second number should always be the size of the file, you are going to get issues otherwise! it should be automatically generated using like in https://github.com/facebookresearch/denoiser/blob/master/make_debug.sh#L13:

python3 -m denoiser.audio PATH_TO_WAV_FOLDEr > PATH_TO_JSON_FILE.json

Then you will indeed need to reduce the batch size. Don't forget to set the sample_rate parameter as well when using a dataset with 44.1kHz. And finally you might want to add demucs.resample=2 when training :)

Good luck!

@youssefavx
Copy link
Author

@adefossez Thank you so much! You guys are awesome. I'm gonna try this next. I actually did get the generation of the noisy and clean json files to work. Now I'm curious what demucs.resample=2 does.

@adefossez
Copy link
Contributor

By default demucs upsamples the audio by a factor of 4 before feeding it to the model, so effectively our model handles audio at 64kHz, but because you have audio at 44kHz, its not a great idea to upsample 4 times, as it will become very expansive.

@youssefavx
Copy link
Author

Thank you! makes sense now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants