New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM node #729

Merged
merged 45 commits into from Jul 24, 2017

Conversation

Projects
None yet
4 participants
@msperber
Collaborator

msperber commented Jul 21, 2017

This attempts to reduce the per-timestep LSTM memory consumption to a minimum, by avoiding creation of intermediate nodes. The memory allocated is now 6*hidden_dim for forward & backward pass respectively: hidden state, cell state, and the result of the 4 matrix multiplies + nonlinearities. In an ideal case we wouldn’t have to save the latter in the backward pass, but due to the complex LSTM dependencies I don’t see a good way to accomplish that without hardcoding the complete loop over the sequence, which would make things very inflexible.



Implemented is the vanilla LSTM, dropout, and weight noise. Dropout and weight noise require no additional memory. Preliminary experiments indicate smaller overall memory footprint at slightly slower speed.



The above is achieved via 3 new dynet nodes:

gates_t = vanilla_lstm_gates(x_t, h_tm1, Wx, Wh, b, dropout_mask_x, dropout_mask_h, weightnoise_std)

c_t = vanilla_lstm_c(c_tm1, gates_t)

h_t = vanilla_lstm_h(c_t, gates_t)

@jcyk

This comment has been minimized.

Show comment
Hide comment
@jcyk

jcyk Jul 22, 2017

Contributor

Great! Excited about it!
@msperber What is the memory allocation in current version?

Contributor

jcyk commented Jul 22, 2017

Great! Excited about it!
@msperber What is the memory allocation in current version?

@msperber

This comment has been minimized.

Show comment
Hide comment
@msperber

msperber Jul 22, 2017

Collaborator

If I counted correctly, it should currently be at least 18hidden_dim, and 20hidden_dim if using dropout.

Collaborator

msperber commented Jul 22, 2017

If I counted correctly, it should currently be at least 18hidden_dim, and 20hidden_dim if using dropout.

}
// non-linearities
fx.tbvec().slice(indices_i, sizes_3).device(*dev.edevice) = fx.tbvec().slice(indices_i, sizes_3).unaryExpr(scalar_logistic_sigmoid_op<float>());

This comment has been minimized.

@pmichel31415

pmichel31415 Jul 22, 2017

Collaborator

Very minor comment: I think you can do .sigmoid() here

@pmichel31415

pmichel31415 Jul 22, 2017

Collaborator

Very minor comment: I think you can do .sigmoid() here

@pmichel31415

This comment has been minimized.

Show comment
Hide comment
@pmichel31415

pmichel31415 Jul 22, 2017

Collaborator

Thanks matthias!

I have a quick question: any guess as to why is it slower? is this fixable?

Collaborator

pmichel31415 commented Jul 22, 2017

Thanks matthias!

I have a quick question: any guess as to why is it slower? is this fixable?

@msperber

This comment has been minimized.

Show comment
Hide comment
@msperber

msperber Jul 22, 2017

Collaborator

Hmm, I don't see a particular reason why it should be slower, so yeah I assume that should be just a matter of doing some more profiling / code-level optimization. If anything we could probably make it slightly faster because we have more opportunity for batchings things together, such as the 3 sigmoids.

Collaborator

msperber commented Jul 22, 2017

Hmm, I don't see a particular reason why it should be slower, so yeah I assume that should be just a matter of doing some more profiling / code-level optimization. If anything we could probably make it slightly faster because we have more opportunity for batchings things together, such as the 3 sigmoids.

@msperber

This comment has been minimized.

Show comment
Hide comment
@msperber

msperber Jul 22, 2017

Collaborator

Actually, now that you made me think about it, Paul: A likely reason is that the i,f,o,g parts are no longer separated in memory, and so the pointwise operations are performed by striding over the memory. If that slows things down, it should be easy to fix by copying them to separate regions of scratch memory (same as the currently implemented LSTM is doing, except we release the memory afterwards).
In any case, I guess we can save this for a future commit as I won't have time to look into this for the next couple of weeks, and speed differences are not super big anyways.

Collaborator

msperber commented Jul 22, 2017

Actually, now that you made me think about it, Paul: A likely reason is that the i,f,o,g parts are no longer separated in memory, and so the pointwise operations are performed by striding over the memory. If that slows things down, it should be easy to fix by copying them to separate regions of scratch memory (same as the currently implemented LSTM is doing, except we release the memory afterwards).
In any case, I guess we can save this for a future commit as I won't have time to look into this for the next couple of weeks, and speed differences are not super big anyways.

@neubig

This comment has been minimized.

Show comment
Hide comment
@neubig

neubig Jul 24, 2017

Contributor

Thanks, this is great! I'm going to merge this for now, and we can take a look at the remaining speed issues in the future.

Contributor

neubig commented Jul 24, 2017

Thanks, this is great! I'm going to merge this for now, and we can take a look at the remaining speed issues in the future.

@neubig neubig merged commit c2fefdf into clab:master Jul 24, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

shuheik added a commit to shuheik/dynet that referenced this pull request Sep 14, 2017

neubig added a commit that referenced this pull request Sep 14, 2017

RNN-related enhancements to Scala bindings (#895)
* fixed file paths

* Apply changes in #547 and #729 to Scala bindings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment