Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point exception (core dumped) #49

Closed
ZohaibAhmed opened this issue Apr 9, 2019 · 10 comments
Closed

Floating point exception (core dumped) #49

ZohaibAhmed opened this issue Apr 9, 2019 · 10 comments

Comments

@ZohaibAhmed
Copy link

I tried to train the tacotron model you have on top of the LJ pretrained checkpoint you have. Just ran train_tacotron.py but when I run gen_tacotron.py, I get the following:

Initialising WaveRNN Model...

Trainable Parameters: 4.481M

Loading Weights: "checkpoints/lj.wavernn/latest_weights.pyt"


Initialising Tacotron Model...

Trainable Parameters: 11.078M

Loading Weights: "checkpoints/lj.tacotron/latest_weights.pyt"

+---------+----------+---+-----------------+----------------+-----------------+
| WaveRNN | Tacotron | r | Generation Mode | Target Samples | Overlap Samples |
+---------+----------+---+-----------------+----------------+-----------------+
|  804k   |   197k   | 1 |     Batched     |     11000      |       550       |
+---------+----------+---+-----------------+----------------+-----------------+
 

| Generating 1/6
Floating point exception (core dumped)

Any ideas on how I can go on debugging this?

@fatchord
Copy link
Owner

fatchord commented Apr 9, 2019

@ZohaibAhmed Unfortunately I don't see the same error on my end - can you do me a small favor? If you have an IDE with breakpoints can you check which function is causing that in gen_tacotron.py (should be somewhere in the loop starting on line 91)?

If you don't have breakpoints you can just print('a', True), print('b', True) after each function in that loop to see what's throwing the error.

Thanks.

@ZohaibAhmed
Copy link
Author

ZohaibAhmed commented Apr 9, 2019

looks like the issue is on the vocoder generate function in fatchord_wavernn, specifically when it calls:

h1 = rnn1(x, h1)

Note, that just using the pretrained model out of the box seems to work. It's just when I train the model further, the error occurs.

More details about my setup:

ubuntu16.04
pytorch=1.0.0
cuda10.0
cudnn7.4.1_1
GPU: RTX 2080 Ti

@fatchord
Copy link
Owner

@ZohaibAhmed can I get the exact steps you went through to get that error? Have you tried training a fresh model for a couple of epochs and then tried generating?

Also is there no other error message besides "Floating point exception (core dumped)"?

@ZohaibAhmed
Copy link
Author

@fatchord - training a model from scratch seems to work.

The exact steps I did were as follows:

  1. take your pretrained models
  2. get a different dataset, run preprocessor on that (the dataset is structured exactly like LJ)
Input File     : '100.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.42 = 75411 samples ~ 256.5 CDDA sectors
File Size      : 151k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
  1. Run train_tacotron.py for a bit.
  2. Run gen_tacotron.py after the first checkpoint (i made it after 500 steps instead of the default).

And that's how I get to that error. Even if i keep the WaveRNN as the pretrained model, it still results in the Floating point exception (core dumped). Theres no other stack trace.

@fatchord
Copy link
Owner

@ZohaibAhmed can you try training LJ from scratch to see if you get the same error?

@ZohaibAhmed
Copy link
Author

ZohaibAhmed commented Apr 12, 2019

@fatchord training Tacotron from scratch makes it work. But I don't have enough data for my own dataset to effectively train the model.

Have you had any success with fine-tuning?

EDIT: the main issue seems to be that the decoder is producing all silent values

It looks like the shape of the output from the original pretrained model is different then when I train on top of it:

Original:
torch.Size([1, 80, 338])

Tuned:
torch.Size([1, 80, 1])

Looks like I hit the condition where if silent frames are present:

if (mel_frames < -3.8).all() : break

This is what the alignment plot looks like while training tacotron:

image

@candlewill
Copy link

candlewill commented Apr 14, 2019

@ZohaibAhmed I met the same error. The reason is that the first frame of mel_frames is all silence (< -3.8), which makes the tacotron output empty. You could fix that by using the following code:

if (mel_frames < -3.8).all() and i > 10 : break

@fatchord
Copy link
Owner

@candlewill Nice catch, I'll push a fix for that later today.

@ZohaibAhmed
Copy link
Author

@candlewill - I still largely get silence (with some static). Did you try to train your model on top of the checkpoint that @fatchord provided? Or did you just train it from scratch?

@fatchord
Copy link
Owner

Tacotron has been updated to fix the premature stopping of generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants