Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Division by Zero when training #12

Closed
LukeB42 opened this issue Jan 13, 2018 · 10 comments
Closed

Division by Zero when training #12

LukeB42 opened this issue Jan 13, 2018 · 10 comments

Comments

@LukeB42
Copy link

LukeB42 commented Jan 13, 2018

  File "samplernn-pytorch/trainer/__init__.py", line 45, in call_plugins
    getattr(plugin, queue_name)(*args)
  File "/usr/local/lib/python3.6/site-packages/torch/utils/trainer/plugins/monitor.py", line 56, in epoch
    stats['epoch_mean'] = epoch_stats[0] / epoch_stats[1]
ZeroDivisionError: division by zero

This is with PyTorch 0.3.0.post4.

@sbl
Copy link

sbl commented Jan 21, 2018

Same behavior here:

  File "train.py", line 337, in <module>
    main(**vars(parser.parse_args()))
  File "train.py", line 235, in main
    trainer.run(params['epoch_limit'])
  File "/home/stephen/src/samplernn-pytorch/trainer/__init__.py", line 57, in run
    self.call_plugins('epoch', self.epochs)
  File "/home/stephen/src/samplernn-pytorch/trainer/__init__.py", line 44, in call_plugins
    getattr(plugin, queue_name)(*args)
  File "/home/stephen/anaconda3/lib/python3.6/site-packages/torch/utils/trainer/plugins/monitor.py", line 56, in epoch
    stats['epoch_mean'] = epoch_stats[0] / epoch_stats[1]
ZeroDivisionError: division by zero

@koz4k
Copy link
Member

koz4k commented Jan 25, 2018

Duplicate of #10.

The problem is that for validation we discard the last (incomplete) minibatch so it doesn't skew the result, as it might be smaller than the rest and we average the loss over minibatches with equal weights. Specifically, if you only have one minibatch, it tries to average over an empty set, hence division by zero. This could be handled better and we're planning to do that in the near future.

@LukeB42
Copy link
Author

LukeB42 commented Jan 31, 2018

@koz4k thanks for the response but what do you suggest for fixing this myself for the time being?

returning if args is empty doesn't work and wrapping the function body in a try / except causes the program to exit at around the 1,000 exceptions mark.

@koz4k
Copy link
Member

koz4k commented Jan 31, 2018

Sorry, I was wrong - this is related to the size of the training set, not validation set. Either way, the solution is to lower the batch size or use a bigger dataset. I would recommend a bigger dataset, because with such a small one you might not be able to achieve good results anyway.

@LukeB42
Copy link
Author

LukeB42 commented Feb 1, 2018

@koz4k OK, thanks for explaining that.

@LukeB42
Copy link
Author

LukeB42 commented Feb 1, 2018

@koz4k Following your suggestion using

python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset custom --batch_size 64

I'm getting the following result:

Traceback (most recent call last):
  File "train.py", line 360, in <module>
    main(**vars(parser.parse_args()))
  File "train.py", line 258, in main
    trainer.run(params['epoch_limit'])
  File "pytorch-samplernn/trainer/__init__.py", line 56, in run
    self.train()
  File "pytorch-samplernn/trainer/__init__.py", line 61, in train
    enumerate(self.dataset, self.iterations + 1):
  File "pytorch-samplernn/dataset.py", line 51, in __iter__
    for batch in super().__iter__():
  File "/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 188, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 96, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/usr/local/lib/python3.6/site-packages/torch/functional.py", line 64, in stack
    return torch.cat(inputs, dim)
RuntimeError: inconsistent tensor sizes at /pytorch/torch/lib/TH/generic/THTensorMath.c:2864

What do you suggest I do to fix this for the time being?

@comeweber
Copy link

Are you sure that all the .wav files in your dataset directory have the same duration ?

@LukeB42
Copy link
Author

LukeB42 commented Feb 1, 2018

@comeweber @koz4k Many thanks for your help, both of you. It's now stably training!
Using wav files that're 8 seconds long and --batch_size of 32. Many thanks.

@niuqun
Copy link

niuqun commented May 11, 2018

@LukeB42 Could you share your file structure with custom folder? I cannot use the youtube-dl to generate the training data right now, so I download a audio file myself. Although I have 8 seconds chunks, the training goes wrong with following errors:

Traceback (most recent call last):
File "train.py", line 360, in
main(**vars(parser.parse_args()))
File "train.py", line 258, in main
trainer.run(params['epoch_limit'])
File "/root/Documents/samplernn-pytorch-master/trainer/init.py", line 56, in run
self.train()
File "/root/Documents/samplernn-pytorch-master/trainer/init.py", line 61, in train
enumerate(self.dataset, self.iterations + 1):
File "/root/Documents/samplernn-pytorch-master/dataset.py", line 51, in iter
for batch in super().iter():
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 264, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 115, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 353344 and 352320 in dimension 1 at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/TH/generic/THTensorMath.c:3586

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

You most likely have chunks of not exactly equal length. Many tools for chunking audio files tend to do that. You can use ffmpeg, it cuts the files cleanly. See the downloading script for an example.

@koz4k koz4k closed this as completed Sep 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants