Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss becomes nan after a while #7

Open
singlasahil14 opened this issue Dec 3, 2016 · 15 comments
Open

Loss becomes nan after a while #7

singlasahil14 opened this issue Dec 3, 2016 · 15 comments

Comments

@singlasahil14
Copy link

screen shot 2016-12-03 at 8 02 47 am

Why is this happening? And how to solve it?

@Ka-ya
Copy link

Ka-ya commented Dec 4, 2016

Try using the Keras optimizer rather than Lasagne's.

@singlasahil14
Copy link
Author

What would be the exact changes in the code?

@aglotero
Copy link
Contributor

aglotero commented Dec 5, 2016

Same here, I have edited model.py :
Comment import lasagne
Uncomment from keras.optimizers import SGD
Comment grads = lasagne.updates.total_norm_constraint...
Comment updates = lasagne.updates.nesterov_momentum...
Uncomment optimizer = SGD(nesterov=True, lr=learning_rate,...
Uncomment updates = optimizer.get_updates(...

@dylanbfox
Copy link

I'm seeing this issue too. I switched to using the Keras optimizer instead of Lasagne's, making the same changes that @aglotero cited above.

For the first 8990 (out of 12188) iterations the loss function was working properly. Then it looks like starting at iteration 9000 I started seeing the nan

...
2016-12-07 04:38:33,080 INFO    (__main__) Epoch: 0, Iteration: 8960, Loss: 148.405151367
2016-12-07 04:40:38,369 INFO    (__main__) Epoch: 0, Iteration: 8970, Loss: 356.538299561
2016-12-07 04:42:43,709 INFO    (__main__) Epoch: 0, Iteration: 8980, Loss: 382.034057617
2016-12-07 04:44:49,189 INFO    (__main__) Epoch: 0, Iteration: 8990, Loss: 310.213592529
2016-12-07 04:58:47,111 INFO    (__main__) Epoch: 0, Iteration: 9000, Loss: nan

Interestingly, the loss spiked at iteration 8960. Here is the plot for the first 9000 iterations.

plot4

Some notes: I am using dropout on the RNN layers hence the plot, and I increased the data size being trained on by upping the max duration to 15.0 seconds. My mini batch size is 24.

@aglotero
Copy link
Contributor

aglotero commented Dec 9, 2016

Using the SGM optimizer with the clipnorm=1 option may be a solution.
I was gettin a nan cost at 400th iteration, now I'm at the 3690th iteration, and running.
plot_modelo8
I saw some similar issue at keras-team/keras#1244

@dylanbfox
Copy link

FWIW I fixed this by dropping the learning rate and removing the dropout layers I added. I left the clipnorm value at 100.

@a00achild1
Copy link

Hi!
I change the clipnorm to 1 like @aglotero said, but with more gru layers(1 convolutional, 7 gru with 1000 nodes each, 1 full connection).
plot_7rnn
I found that the loss was converging but stuck at about 300, and the visualize result for testing is really bad!
Is that mean the structure is not deep enough? Or I should train more epochs?
Thanks!

@dylanbfox
Copy link

@a00achild1 I don't think you want clipnorm set to 1. Were you getting NaNs before with the clipnorm set to a higher val (~100)?

@a00achild1
Copy link

@dylanbfox thanks for quick response!
When the clipnorm is the default value(100), i get NaNs after some iterations.
Then I came to read this issue, and try to train the model with clipnorm 1.

What do you mean you don't think setting clipnorm to 1 is a good idea? Is the performance affected by the small clipnorm value?

@dylanbfox
Copy link

What is your learning rate? Trying dropping that and keeping the clipnorm higher.

@a00achild1
Copy link

a00achild1 commented Dec 23, 2016

@dylanbfox my learning rate is 2e-4, the default value.
In my experience this is really small. Maybe I am wrong.
I will try smaller value and keeping the clipnorm higher.
Thanks

@a00achild1
Copy link

a00achild1 commented Dec 26, 2016

I set the learning rate to 2e-4, clipnorm back to 100, training with LibriSpeech-clean-100, and my model structure is 1 convolutional, 7 gru with 1000 nodes each, 1 full connection, which is reference to Baidu's paper.

While the training loss seemed dropping down continuously, the validation loss started to diverge. The prediction of a testing file is better than before, but it still can't predict a correct vocabulary.
Did anyone has trained a good model for speech recognition or has some suggestion? Any suggestion will be really appreciated!

P.S. Could the problem still on the clipnorm? I've been searching for a while, but it seems there isn't a good approach to determine the clip value.
plot_7rnn

@kisuksung
Copy link

I have a similar problem as the other threads described. But my model has NaN value after 1st iteration.
I tried chaging an optimizer function (Keras and Lasagne), a clipnorm (1 and 100), and a Learning rate (2e-4 and 0.01), but it still has NaN cost function value.
Is there anyone who can advise about this problem? I would really appreciate if you guys give a solution for this.
I used Keras-1.0.7 and Theano-rel-0.8.2. If you think this version is not appropriate, please let me know.

Ex. Keras , learning_rate=2e-4, clipnorm = 1
2017-01-09 01:27:52,611 INFO (main) Epoch: 0, Iteration: 0, Loss: 241.261184692
2017-01-09 01:28:00,360 INFO (main) Epoch: 0, Iteration: 1, Loss: nan
2017-01-09 01:28:07,864 INFO (main) Epoch: 0, Iteration: 2, Loss: nan
2017-01-09 01:28:15,374 INFO (main) Epoch: 0, Iteration: 3, Loss: nan
2017-01-09 01:28:23,191 INFO (main) Epoch: 0, Iteration: 4, Loss: nan
2017-01-09 01:28:31,301 INFO (main) Epoch: 0, Iteration: 5, Loss: nan
2017-01-09 01:28:39,587 INFO (main) Epoch: 0, Iteration: 6, Loss: nan
2017-01-09 01:28:48,127 INFO (main) Epoch: 0, Iteration: 7, Loss: nan
2017-01-09 01:28:56,824 INFO (main) Epoch: 0, Iteration: 8, Loss: nan
2017-01-09 01:29:05,442 INFO (main) Epoch: 0, Iteration: 9, Loss: nan
2017-01-09 01:29:14,783 INFO (main) Epoch: 0, Iteration: 10, Loss: nan
2017-01-09 01:29:23,937 INFO (main) Epoch: 0, Iteration: 11, Loss: nan

@kisuksung
Copy link

I just found a solution! We should use Keras-1.1.0 or above version for Keras package.

@varunravi
Copy link

@a00achild1 Hey. Did you find out why your graph turns into that? I'm currently at that stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants