Loss becomes nan after a while #7

singlasahil14 · 2016-12-03T02:34:01Z

Why is this happening? And how to solve it?

Ka-ya · 2016-12-04T19:30:44Z

Try using the Keras optimizer rather than Lasagne's.

singlasahil14 · 2016-12-04T22:38:13Z

What would be the exact changes in the code?

aglotero · 2016-12-05T00:54:27Z

Same here, I have edited model.py :
Comment import lasagne
Uncomment from keras.optimizers import SGD
Comment grads = lasagne.updates.total_norm_constraint...
Comment updates = lasagne.updates.nesterov_momentum...
Uncomment optimizer = SGD(nesterov=True, lr=learning_rate,...
Uncomment updates = optimizer.get_updates(...

dylanbfox · 2016-12-07T05:44:21Z

I'm seeing this issue too. I switched to using the Keras optimizer instead of Lasagne's, making the same changes that @aglotero cited above.

For the first 8990 (out of 12188) iterations the loss function was working properly. Then it looks like starting at iteration 9000 I started seeing the nan

...
2016-12-07 04:38:33,080 INFO    (__main__) Epoch: 0, Iteration: 8960, Loss: 148.405151367
2016-12-07 04:40:38,369 INFO    (__main__) Epoch: 0, Iteration: 8970, Loss: 356.538299561
2016-12-07 04:42:43,709 INFO    (__main__) Epoch: 0, Iteration: 8980, Loss: 382.034057617
2016-12-07 04:44:49,189 INFO    (__main__) Epoch: 0, Iteration: 8990, Loss: 310.213592529
2016-12-07 04:58:47,111 INFO    (__main__) Epoch: 0, Iteration: 9000, Loss: nan

Interestingly, the loss spiked at iteration 8960. Here is the plot for the first 9000 iterations.

Some notes: I am using dropout on the RNN layers hence the plot, and I increased the data size being trained on by upping the max duration to 15.0 seconds. My mini batch size is 24.

aglotero · 2016-12-09T17:15:19Z

Using the SGM optimizer with the clipnorm=1 option may be a solution.
I was gettin a nan cost at 400th iteration, now I'm at the 3690th iteration, and running.

I saw some similar issue at keras-team/keras#1244

dylanbfox · 2016-12-19T17:19:15Z

FWIW I fixed this by dropping the learning rate and removing the dropout layers I added. I left the clipnorm value at 100.

a00achild1 · 2016-12-23T05:46:08Z

Hi!
I change the clipnorm to 1 like @aglotero said, but with more gru layers(1 convolutional, 7 gru with 1000 nodes each, 1 full connection).

I found that the loss was converging but stuck at about 300, and the visualize result for testing is really bad!
Is that mean the structure is not deep enough? Or I should train more epochs?
Thanks!

dylanbfox · 2016-12-23T06:40:48Z

@a00achild1 I don't think you want clipnorm set to 1. Were you getting NaNs before with the clipnorm set to a higher val (~100)?

a00achild1 · 2016-12-23T07:07:28Z

@dylanbfox thanks for quick response!
When the clipnorm is the default value(100), i get NaNs after some iterations.
Then I came to read this issue, and try to train the model with clipnorm 1.

What do you mean you don't think setting clipnorm to 1 is a good idea? Is the performance affected by the small clipnorm value?

dylanbfox · 2016-12-23T08:14:10Z

What is your learning rate? Trying dropping that and keeping the clipnorm higher.

a00achild1 · 2016-12-23T08:45:39Z

@dylanbfox my learning rate is 2e-4, the default value.
In my experience this is really small. Maybe I am wrong.
I will try smaller value and keeping the clipnorm higher.
Thanks

a00achild1 · 2016-12-26T10:41:01Z

I set the learning rate to 2e-4, clipnorm back to 100, training with LibriSpeech-clean-100, and my model structure is 1 convolutional, 7 gru with 1000 nodes each, 1 full connection, which is reference to Baidu's paper.

While the training loss seemed dropping down continuously, the validation loss started to diverge. The prediction of a testing file is better than before, but it still can't predict a correct vocabulary.
Did anyone has trained a good model for speech recognition or has some suggestion? Any suggestion will be really appreciated!

P.S. Could the problem still on the clipnorm? I've been searching for a while, but it seems there isn't a good approach to determine the clip value.

kisuksung · 2017-01-09T01:30:28Z

I have a similar problem as the other threads described. But my model has NaN value after 1st iteration.
I tried chaging an optimizer function (Keras and Lasagne), a clipnorm (1 and 100), and a Learning rate (2e-4 and 0.01), but it still has NaN cost function value.
Is there anyone who can advise about this problem? I would really appreciate if you guys give a solution for this.
I used Keras-1.0.7 and Theano-rel-0.8.2. If you think this version is not appropriate, please let me know.

Ex. Keras , learning_rate=2e-4, clipnorm = 1
2017-01-09 01:27:52,611 INFO (main) Epoch: 0, Iteration: 0, Loss: 241.261184692
2017-01-09 01:28:00,360 INFO (main) Epoch: 0, Iteration: 1, Loss: nan
2017-01-09 01:28:07,864 INFO (main) Epoch: 0, Iteration: 2, Loss: nan
2017-01-09 01:28:15,374 INFO (main) Epoch: 0, Iteration: 3, Loss: nan
2017-01-09 01:28:23,191 INFO (main) Epoch: 0, Iteration: 4, Loss: nan
2017-01-09 01:28:31,301 INFO (main) Epoch: 0, Iteration: 5, Loss: nan
2017-01-09 01:28:39,587 INFO (main) Epoch: 0, Iteration: 6, Loss: nan
2017-01-09 01:28:48,127 INFO (main) Epoch: 0, Iteration: 7, Loss: nan
2017-01-09 01:28:56,824 INFO (main) Epoch: 0, Iteration: 8, Loss: nan
2017-01-09 01:29:05,442 INFO (main) Epoch: 0, Iteration: 9, Loss: nan
2017-01-09 01:29:14,783 INFO (main) Epoch: 0, Iteration: 10, Loss: nan
2017-01-09 01:29:23,937 INFO (main) Epoch: 0, Iteration: 11, Loss: nan

kisuksung · 2017-01-09T04:52:19Z

I just found a solution! We should use Keras-1.1.0 or above version for Keras package.

varunravi · 2017-06-11T18:16:02Z

@a00achild1 Hey. Did you find out why your graph turns into that? I'm currently at that stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss becomes nan after a while #7

Loss becomes nan after a while #7

singlasahil14 commented Dec 3, 2016

Ka-ya commented Dec 4, 2016

singlasahil14 commented Dec 4, 2016

aglotero commented Dec 5, 2016

dylanbfox commented Dec 7, 2016

aglotero commented Dec 9, 2016

dylanbfox commented Dec 19, 2016

a00achild1 commented Dec 23, 2016

dylanbfox commented Dec 23, 2016

a00achild1 commented Dec 23, 2016

dylanbfox commented Dec 23, 2016

a00achild1 commented Dec 23, 2016 •

edited

a00achild1 commented Dec 26, 2016 •

edited

kisuksung commented Jan 9, 2017

kisuksung commented Jan 9, 2017

varunravi commented Jun 11, 2017

Loss becomes nan after a while #7

Loss becomes nan after a while #7

Comments

singlasahil14 commented Dec 3, 2016

Ka-ya commented Dec 4, 2016

singlasahil14 commented Dec 4, 2016

aglotero commented Dec 5, 2016

dylanbfox commented Dec 7, 2016

aglotero commented Dec 9, 2016

dylanbfox commented Dec 19, 2016

a00achild1 commented Dec 23, 2016

dylanbfox commented Dec 23, 2016

a00achild1 commented Dec 23, 2016

dylanbfox commented Dec 23, 2016

a00achild1 commented Dec 23, 2016 • edited

a00achild1 commented Dec 26, 2016 • edited

kisuksung commented Jan 9, 2017

kisuksung commented Jan 9, 2017

varunravi commented Jun 11, 2017

a00achild1 commented Dec 23, 2016 •

edited

a00achild1 commented Dec 26, 2016 •

edited