Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hints on improvements for training and matching #33

Closed
asusdisciple opened this issue Nov 16, 2023 · 3 comments
Closed

Hints on improvements for training and matching #33

asusdisciple opened this issue Nov 16, 2023 · 3 comments

Comments

@asusdisciple
Copy link

asusdisciple commented Nov 16, 2023

First of all thanks for the great model! I tested it extensively by now and ran across a few problems and performance issues which you might can help with.

  1. Matching takes a lot of time with big datasets (1000+ 2min files), since it is not multi-gpu, do you intend to change that in the future?

  2. General behaviour: For training in general, it seems to be better to have a few big files rather than many small files (2min vs 10sec). I think this might be related to the overhead introduced by all the small .pt "models". Can you confirm this or believe this is plausible?

  3. My biggest issue so far is when I try to fine tune the hifi-gan vocoder. My Notebook with a A4500 seems to be on par, even outperforming my DGX Station with 4 x V100 32GB GPUs, which is strange.

I identified the following things:

During validation the operation is performed on all files rather than in a batch (1000+ files). The station and the notebook are both about equally fast in validating all files. However the station uses 4 GPUs which are working at 100% all the time and should be a lot faster. Since this is really slow how often do you think should I perform validation?

During batch training the notebook also outperforms the station by a little bit completing one epoch in about 40sec (station 48sec).
However when I look at nvidia-smi on the station the GPU usage is at 0% all the time.

Unfortunately there seem to be some serious issues with a multi-gpu approach. If I only use one GPU on the station one epoch takes about 17 seconds. Maybe you have an idea what goes wrong here?

Edit:
When I monitored the epochs in a multi-gpu setting it seems the epoch itself was trained really fast in 5 seconds. However before the progressbar appears there seems to happen some loading which takes the other 50 seconds. Do you know which process is responsible for that time gap or how to minmize it?

123

What I tried so far:
I tried to use different batch sizes and adjusted the number of workers in the config, but it did no really change results that much,

@asusdisciple
Copy link
Author

asusdisciple commented Nov 20, 2023

I was able to solve the issue. The problem lies with the train_loader class. Most of the "training time" results from the heavy dataloading in a multi gpu setting. By default the workers are dismissed after every epoch, so the dataset needs to be loaded again. By using persistent_workers=True the training time can be reduced to 13 sec/Epoch on 4 GPUs. Its still not optimal since actual training is only performed during a few seconds of this time frame, but it is still a large improvement.

Another thing I noticed was to set the batch_size down to a rather small size, since the loading to the device in

  for i, batch in pb:

         if rank == 0:
             start_b = time.time()
         x, y, _, y_mel = batch
         x = x.to(device, non_blocking=True)
         y = y.to(device, non_blocking=True)
         y_mel = y_mel.to(device, non_blocking=True)
         y = y.unsqueeze(1)

took a long time. Now one epoch was trained in 7 seconds.
Just wanted to let you guys know, in case you run into these problems.

@HninLwin-byte
Copy link

I was fine-tuning the model with my own dataset, I got this error. If someone encountered the same error, please share me how to solve this problem.
checkpoints directory : /content/drive/MyDrive/data/knn-vc/pertained_model
/usr/local/lib/python3.10/dist-packages/torchaudio/transforms/_transforms.py:580: UserWarning: Argument 'onesided' has been deprecated and has no influence on the behavior of this module.
warnings.warn(
Epoch: 1: 0% 0/100 [00:00<?, ?it/s]
0% 0/31 [00:00<?, ?it/s]Before padding - Wav shape: torch.Size([1, 7040])
After padding - Wav shape: torch.Size([1, 7744])
Before padding - Wav shape: torch.Size([1, 7040])
After padding - Wav shape: torch.Size([1, 7744])
Before padding - Wav shape: torch.Size([1, 0])
Before padding - Wav shape: torch.Size([1, 7040])
After padding - Wav shape: torch.Size([1, 7744])
0% 0/31 [00:01<?, ?it/s]
Epoch: 1: 0% 0/100 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/train.py", line 342, in
main()
File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/train.py", line 338, in main
train(0, a, h)
File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/train.py", line 145, in train
for i, batch in pb:
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/meldataset.py", line 203, in getitem
mel_loss = self.alt_melspec(audio)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/meldataset.py", line 77, in forward
wav = F.pad(wav, ((self.n_fft - self.hop_size) // 2, (self.n_fft - self.hop_size) // 2), "reflect")
RuntimeError: Expected 2D or 3D (batch mode) tensor with possibly 0 batch size and other non-zero dimensions for input, but got: [1, 0]

@RF5
Copy link
Collaborator

RF5 commented Nov 23, 2023

Hi @asusdisciple , thanks for the suggestions. We have added the persistent workers trick to the training code now.

And @HninLwin-byte thanks for your issue, it looks like one of your audio files might be corrupt or less than the minimum allowable length (about 160ms). I would double check all files in your dataset are not corrupt / readable, and at least 160ms long.

@RF5 RF5 closed this as completed Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants