Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training halts after the first epoch #1

Closed
leoybkim opened this issue Jul 17, 2018 · 4 comments
Closed

Training halts after the first epoch #1

leoybkim opened this issue Jul 17, 2018 · 4 comments
Labels
bug Something isn't working

Comments

@leoybkim
Copy link

Hello,

I tried to train your model with full_multi_instrument mode using 4 GPUs (NVIDIA Tesla P100) and with the same datasets (musdb). It took 2 hours to finish the first epoch followed by a very long hanging with no progress.

Here is the stack trace after stopping the script manually



2018-07-16 16:21:31.186201: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/kernel/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186236: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/kernel/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186247: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/bias/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186262: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/bias/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186272: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/kernel/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186285: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/kernel/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186297: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/bias/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186309: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/bias/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186322: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/kernel/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186334: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/kernel/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186347: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/bias/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186359: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/bias/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186384: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/Adam/beta1: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186406: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/Adam/beta2: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186427: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/Adam/epsilon: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186440: I tensorflow/core/common_runtime/placer.cc:886] sep_loss/tags: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186451: I tensorflow/core/common_runtime/placer.cc:886] save/Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186464: I tensorflow/core/common_runtime/placer.cc:886] save/SaveV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186477: I tensorflow/core/common_runtime/placer.cc:886] save/SaveV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186490: I tensorflow/core/common_runtime/placer.cc:886] save/RestoreV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186503: I tensorflow/core/common_runtime/placer.cc:886] save/RestoreV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186516: I tensorflow/core/common_runtime/placer.cc:886] save_1/Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186528: I tensorflow/core/common_runtime/placer.cc:886] save_1/SaveV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186542: I tensorflow/core/common_runtime/placer.cc:886] save_1/SaveV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186554: I tensorflow/core/common_runtime/placer.cc:886] save_1/RestoreV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186567: I tensorflow/core/common_runtime/placer.cc:886] save_1/RestoreV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0


^CWARNING - Waveunet - Aborted after 6:05:45!
Traceback (most recent call last):
  File "Training.py", line 326, in <module>
    @ex.automain
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/experiment.py", line 137, in automain
    self.run_commandline()
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/experiment.py", line 260, in run_commandline
    return self.run(cmd_name, config_updates, named_configs, {}, args)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/experiment.py", line 209, in run
    run()
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/run.py", line 221, in __call__
    self.result = self.main_function(*args)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "Training.py", line 373, in dsd_100_experiment
    sup_model_path, sup_loss = optimise(dataset=dataset)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "Training.py", line 311, in optimise
    model_path = train(sup_dataset=dataset["train_sup"], load_model=model_path)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "Training.py", line 269, in train
    sup_batch = sup_batch_gen.get_batch()
  File "/home/leo/Wave-U-Net/Input/batchgenerators.py", line 123, in get_batch
    self.cache.update_cache_from_queue()
  File "/home/leo/Wave-U-Net/Input/multistreamcache.py", line 84, in update_cache_from_queue
    self.update_next_cache_item(self.communication_queue.get())
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
    res = self._recv()
KeyboardInterrupt
^C

The full stack trace can be found here [stack trace gist] (https://gist.github.com/leoybkim/789d367a0ee2c63db8a513613270b017)

It seems like something failed while fetching the next batch from queue. I also found your TODO comment on update_cache_from_queue() func about empty queues in
multistreamcache.py. I wonder if it has any relation to this.

@f90
Copy link
Owner

f90 commented Aug 24, 2018

Hey! Sorry for the late reply..

That seems odd, it indeed seems to get stuck at the queue get() function, but only after the first epoch. Normally that indicates that something has gone wrong with regards to the workers that load audio examples in the background, and do not fill up the internal queue, which then hangs since it waits for new examples. But it seems like your cache is being filled properly at the start of the new epoch.

Does this problem occur everytime you execute the code? I am going to try to replicate on my end.

Are you using the packages indicated in the requirements in the correct versions?

It must be that the caching system is for some reason not reset properly between epochs.

For multiple GPUs, a slight extension is necessary to use them (see Towers in the Tensorflow manual), which I did not implement here. If you do implement this, it would be great if you submitted a pull request so everyone can benefit from that.

@f90
Copy link
Owner

f90 commented Aug 26, 2018

Tried to replicate in my case, both with the old code and the newest revision I just committed, but could not replicate the issue, training proceeds normally for me. I tested with epochs_it reduced to 100 so an epoch goes by really quickly and I can test easily.

You might want to check out the latest code and see if your problem persists. Also it is worth checking if your dataset is intact (no file corruption, file paths set up correctly), as one thing that could cause it is when the individual data loader processes hang up and don't fill the queue anymore.

@leoybkim
Copy link
Author

Thanks for the follow up @f90

Multiple GPU is definitely my mistake. I initially thought that TF can handle it automatically.
In the end I removed batch generation and multi cache scripts and used a different approach. I'm going to go ahead and close the issue since I don't think I can replicate it either at this point 😅
But thanks again for providing the source code and for following up. Greatly appreciated and I learned a lot!

@f90 f90 reopened this Aug 27, 2018
@f90
Copy link
Owner

f90 commented Aug 27, 2018

Investigating this more, there was indeed a problem with the training not continuing after a certain number of epochs. Weirdly enough, the problem came down to the librosa resampling procedure, which just froze for some reason at the start of an epoch, when workers started loading the audio, and so the Caches queue was never filled and waited for the workers forever.

I changed the resampling procedure to use scipys resample_poly which should work well enough for most sampling rate conversions. This fixes the issue.

Thanks for reporting it, and glad that you were able to make good use of the code overall!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants