Training halts after the first epoch #1

leoybkim · 2018-07-17T01:17:36Z

Hello,

I tried to train your model with full_multi_instrument mode using 4 GPUs (NVIDIA Tesla P100) and with the same datasets (musdb). It took 2 hours to finish the first epoch followed by a very long hanging with no progress.

Here is the stack trace after stopping the script manually



2018-07-16 16:21:31.186201: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/kernel/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186236: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/kernel/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186247: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/bias/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186262: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_25/bias/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186272: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/kernel/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186285: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/kernel/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186297: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/bias/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186309: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_26/bias/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186322: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/kernel/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186334: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/kernel/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186347: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/bias/Adam/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186359: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/separator/conv1d_27/bias/Adam_1/Initializer/zeros: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186384: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/Adam/beta1: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186406: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/Adam/beta2: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186427: I tensorflow/core/common_runtime/placer.cc:886] separator_solver/Adam/epsilon: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2018-07-16 16:21:31.186440: I tensorflow/core/common_runtime/placer.cc:886] sep_loss/tags: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186451: I tensorflow/core/common_runtime/placer.cc:886] save/Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186464: I tensorflow/core/common_runtime/placer.cc:886] save/SaveV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186477: I tensorflow/core/common_runtime/placer.cc:886] save/SaveV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186490: I tensorflow/core/common_runtime/placer.cc:886] save/RestoreV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186503: I tensorflow/core/common_runtime/placer.cc:886] save/RestoreV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186516: I tensorflow/core/common_runtime/placer.cc:886] save_1/Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186528: I tensorflow/core/common_runtime/placer.cc:886] save_1/SaveV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186542: I tensorflow/core/common_runtime/placer.cc:886] save_1/SaveV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186554: I tensorflow/core/common_runtime/placer.cc:886] save_1/RestoreV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2018-07-16 16:21:31.186567: I tensorflow/core/common_runtime/placer.cc:886] save_1/RestoreV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0


^CWARNING - Waveunet - Aborted after 6:05:45!
Traceback (most recent call last):
  File "Training.py", line 326, in <module>
    @ex.automain
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/experiment.py", line 137, in automain
    self.run_commandline()
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/experiment.py", line 260, in run_commandline
    return self.run(cmd_name, config_updates, named_configs, {}, args)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/experiment.py", line 209, in run
    run()
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/run.py", line 221, in __call__
    self.result = self.main_function(*args)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "Training.py", line 373, in dsd_100_experiment
    sup_model_path, sup_loss = optimise(dataset=dataset)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "Training.py", line 311, in optimise
    model_path = train(sup_dataset=dataset["train_sup"], load_model=model_path)
  File "/home/leo/.local/lib/python2.7/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "Training.py", line 269, in train
    sup_batch = sup_batch_gen.get_batch()
  File "/home/leo/Wave-U-Net/Input/batchgenerators.py", line 123, in get_batch
    self.cache.update_cache_from_queue()
  File "/home/leo/Wave-U-Net/Input/multistreamcache.py", line 84, in update_cache_from_queue
    self.update_next_cache_item(self.communication_queue.get())
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
    res = self._recv()
KeyboardInterrupt
^C

The full stack trace can be found here [stack trace gist] (https://gist.github.com/leoybkim/789d367a0ee2c63db8a513613270b017)

It seems like something failed while fetching the next batch from queue. I also found your TODO comment on update_cache_from_queue() func about empty queues in
multistreamcache.py. I wonder if it has any relation to this.

The text was updated successfully, but these errors were encountered:

f90 · 2018-08-24T18:23:05Z

Hey! Sorry for the late reply..

That seems odd, it indeed seems to get stuck at the queue get() function, but only after the first epoch. Normally that indicates that something has gone wrong with regards to the workers that load audio examples in the background, and do not fill up the internal queue, which then hangs since it waits for new examples. But it seems like your cache is being filled properly at the start of the new epoch.

Does this problem occur everytime you execute the code? I am going to try to replicate on my end.

Are you using the packages indicated in the requirements in the correct versions?

It must be that the caching system is for some reason not reset properly between epochs.

For multiple GPUs, a slight extension is necessary to use them (see Towers in the Tensorflow manual), which I did not implement here. If you do implement this, it would be great if you submitted a pull request so everyone can benefit from that.

f90 · 2018-08-26T23:58:14Z

Tried to replicate in my case, both with the old code and the newest revision I just committed, but could not replicate the issue, training proceeds normally for me. I tested with epochs_it reduced to 100 so an epoch goes by really quickly and I can test easily.

You might want to check out the latest code and see if your problem persists. Also it is worth checking if your dataset is intact (no file corruption, file paths set up correctly), as one thing that could cause it is when the individual data loader processes hang up and don't fill the queue anymore.

leoybkim · 2018-08-27T00:35:17Z

Thanks for the follow up @f90

Multiple GPU is definitely my mistake. I initially thought that TF can handle it automatically.
In the end I removed batch generation and multi cache scripts and used a different approach. I'm going to go ahead and close the issue since I don't think I can replicate it either at this point 😅
But thanks again for providing the source code and for following up. Greatly appreciated and I learned a lot!

f90 · 2018-08-27T13:36:33Z

Investigating this more, there was indeed a problem with the training not continuing after a certain number of epochs. Weirdly enough, the problem came down to the librosa resampling procedure, which just froze for some reason at the start of an epoch, when workers started loading the audio, and so the Caches queue was never filled and waited for the workers forever.

I changed the resampling procedure to use scipys resample_poly which should work well enough for most sampling rate conversions. This fixes the issue.

Thanks for reporting it, and glad that you were able to make good use of the code overall!

leoybkim closed this as completed Aug 27, 2018

f90 reopened this Aug 27, 2018

f90 closed this as completed Aug 27, 2018

f90 added the bug Something isn't working label Nov 16, 2018

f90 mentioned this issue Jan 20, 2019

Stuck on "Loading new item into cache from data list starting with" f90/AdversarialAudioSeparation#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training halts after the first epoch #1

Training halts after the first epoch #1

leoybkim commented Jul 17, 2018

f90 commented Aug 24, 2018

f90 commented Aug 26, 2018

leoybkim commented Aug 27, 2018

f90 commented Aug 27, 2018

Training halts after the first epoch #1

Training halts after the first epoch #1

Comments

leoybkim commented Jul 17, 2018

f90 commented Aug 24, 2018

f90 commented Aug 26, 2018

leoybkim commented Aug 27, 2018

f90 commented Aug 27, 2018