-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hints on improvements for training and matching #33
Comments
I was able to solve the issue. The problem lies with the train_loader class. Most of the "training time" results from the heavy dataloading in a multi gpu setting. By default the workers are dismissed after every epoch, so the dataset needs to be loaded again. By using Another thing I noticed was to set the batch_size down to a rather small size, since the loading to the device in
took a long time. Now one epoch was trained in 7 seconds. |
I was fine-tuning the model with my own dataset, I got this error. If someone encountered the same error, please share me how to solve this problem. |
Hi @asusdisciple , thanks for the suggestions. We have added the persistent workers trick to the training code now. And @HninLwin-byte thanks for your issue, it looks like one of your audio files might be corrupt or less than the minimum allowable length (about 160ms). I would double check all files in your dataset are not corrupt / readable, and at least 160ms long. |
First of all thanks for the great model! I tested it extensively by now and ran across a few problems and performance issues which you might can help with.
Matching takes a lot of time with big datasets (1000+ 2min files), since it is not multi-gpu, do you intend to change that in the future?
General behaviour: For training in general, it seems to be better to have a few big files rather than many small files (2min vs 10sec). I think this might be related to the overhead introduced by all the small .pt "models". Can you confirm this or believe this is plausible?
My biggest issue so far is when I try to fine tune the hifi-gan vocoder. My Notebook with a A4500 seems to be on par, even outperforming my DGX Station with 4 x V100 32GB GPUs, which is strange.
I identified the following things:
During validation the operation is performed on all files rather than in a batch (1000+ files). The station and the notebook are both about equally fast in validating all files. However the station uses 4 GPUs which are working at 100% all the time and should be a lot faster. Since this is really slow how often do you think should I perform validation?
During batch training the notebook also outperforms the station by a little bit completing one epoch in about 40sec (station 48sec).
However when I look at nvidia-smi on the station the GPU usage is at 0% all the time.
Unfortunately there seem to be some serious issues with a multi-gpu approach. If I only use one GPU on the station one epoch takes about 17 seconds. Maybe you have an idea what goes wrong here?
Edit:
When I monitored the epochs in a multi-gpu setting it seems the epoch itself was trained really fast in 5 seconds. However before the progressbar appears there seems to happen some loading which takes the other 50 seconds. Do you know which process is responsible for that time gap or how to minmize it?
What I tried so far:
I tried to use different batch sizes and adjusted the number of workers in the config, but it did no really change results that much,
The text was updated successfully, but these errors were encountered: