Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add readout_activation param to models #199

Closed
wants to merge 1 commit into from

Conversation

gibrown
Copy link

@gibrown gibrown commented Jun 30, 2017

The enables avoiding getting stuck in a NaN loss hole when training. This workaround let's the user fix #189

Example usage:

model = Seq2Seq(input_dim=in_dim, input_length=MAXLENGTH, hidden_dim=HIDDEN_SIZE, output_length=MAXLENGTH, output_dim=out_dim, depth=LAYERS, peek=True, readout_activation='softmax')

I do not fully understand the implications of using softmax as the output activation layer, but in my own project (https://github.com/Automattic/wp-translate) setting the output to softmax using this code does seem to have gotten me past getting stuck with NaN during training.

- setting the readout_activation to something like softmax can avoid NaN in the training
- see farizrahman4u#189
@ChristopherLu
Copy link

Does this change include model.add(TimeDistributed(Dense(output_dim)))?

@gibrown
Copy link
Author

gibrown commented Jul 3, 2017

@ChristopherLu no, it only let's you change the activation function in the decoder output.

BTW, I have since found that this change did not completely solve my problem. Training worked for longer, but I still eventually ran into NaN losses at some point. I am not sure yet whether this is a problem with the data I am providing the model or if it is some complexity in how the network itself fits together.

@ChristopherLu
Copy link

@gibrown Exactly. I finally met the NaN problem when the number of training epochs goes up. I guess it's a gradient vanishing problem, and it also depends on the learning rate you set.

So can we say, the softmax activation can only alleviate the NaN, but not solve the problem essentially?

@gibrown
Copy link
Author

gibrown commented Jul 7, 2017

@ChristopherLu ya that is my conclusion. Based on this thread and the common problems across applications, I am guessing that it is something inherent in how the functions of the model fit together that for some data you can easily end up in such conditions.

The other possibility is that I have a bug in generating my training data, but I've been looking at that for a while and haven't found it. My next plan (when I get back to this, probably in a few weeks) is to try the Tensorflow seq2seq directly: https://www.tensorflow.org/tutorials/seq2seq

I had tried that method in the past (like a year ago), but been unable to get it to work. I think it has been significantly updated though and that tutorial looks improved. I guess maybe that model could be ported into this lib if that works.

@ChristopherLu
Copy link

@gibrown Thx. About to try tf seq2seq as well.

@gibrown
Copy link
Author

gibrown commented Jul 11, 2017

@ChristopherLu I've reworked my application to use https://github.com/google/seq2seq and that seems to be working well so far.

@ChristopherLu
Copy link

@gibrown Thx for the recommendation, I will have a try.

@cpury
Copy link

cpury commented Aug 7, 2017

This did not make my training converge :(

@gibrown
Copy link
Author

gibrown commented Aug 9, 2017

Ya I don't think this is a real solution to the original problem, I'm just going to close this PR.

@gibrown gibrown closed this Aug 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

using categorical_crossentropy get loss result is nan
3 participants