-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training halts after the first epoch #1
Comments
Hey! Sorry for the late reply.. That seems odd, it indeed seems to get stuck at the queue get() function, but only after the first epoch. Normally that indicates that something has gone wrong with regards to the workers that load audio examples in the background, and do not fill up the internal queue, which then hangs since it waits for new examples. But it seems like your cache is being filled properly at the start of the new epoch. Does this problem occur everytime you execute the code? I am going to try to replicate on my end. Are you using the packages indicated in the requirements in the correct versions? It must be that the caching system is for some reason not reset properly between epochs. For multiple GPUs, a slight extension is necessary to use them (see Towers in the Tensorflow manual), which I did not implement here. If you do implement this, it would be great if you submitted a pull request so everyone can benefit from that. |
Tried to replicate in my case, both with the old code and the newest revision I just committed, but could not replicate the issue, training proceeds normally for me. I tested with epochs_it reduced to 100 so an epoch goes by really quickly and I can test easily. You might want to check out the latest code and see if your problem persists. Also it is worth checking if your dataset is intact (no file corruption, file paths set up correctly), as one thing that could cause it is when the individual data loader processes hang up and don't fill the queue anymore. |
Thanks for the follow up @f90 Multiple GPU is definitely my mistake. I initially thought that TF can handle it automatically. |
Investigating this more, there was indeed a problem with the training not continuing after a certain number of epochs. Weirdly enough, the problem came down to the librosa resampling procedure, which just froze for some reason at the start of an epoch, when workers started loading the audio, and so the Caches queue was never filled and waited for the workers forever. I changed the resampling procedure to use scipys Thanks for reporting it, and glad that you were able to make good use of the code overall! |
Hello,
I tried to train your model with
full_multi_instrument
mode using 4 GPUs (NVIDIA Tesla P100) and with the same datasets (musdb). It took 2 hours to finish the first epoch followed by a very long hanging with no progress.Here is the stack trace after stopping the script manually
The full stack trace can be found here [stack trace gist] (https://gist.github.com/leoybkim/789d367a0ee2c63db8a513613270b017)
It seems like something failed while fetching the next batch from queue. I also found your TODO comment on
update_cache_from_queue()
func about empty queues inmultistreamcache.py. I wonder if it has any relation to this.
The text was updated successfully, but these errors were encountered: