Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The LayerNorm implementation #30

Closed
egg-west opened this issue Oct 23, 2018 · 4 comments
Closed

The LayerNorm implementation #30

egg-west opened this issue Oct 23, 2018 · 4 comments
Labels
invalid This doesn't seem right question Further information is requested

Comments

@egg-west
Copy link

I am wondering why don't you use the standard nn version of LayerNorm?
I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

Could you clarify these 2 approaches?

@codertimo
Copy link
Owner

codertimo commented Oct 23, 2018

@egg-west Well the reason why I used this layer norm is "Attention All you need" implementation Annotated Transformer used this code, and just copied from there. So.. if anyone can answer this question, would be seriously awesome

@codertimo codertimo added invalid This doesn't seem right question Further information is requested labels Oct 23, 2018
@briandw
Copy link

briandw commented Oct 29, 2018

I believe that they should do similar things, however there is a difference in implementation.

For a given input:
x = torch.tensor([1.,0.,0.,0.])
The Annotated Transformer version gives the output:
tensor([ 1.5000, -0.5000, -0.5000, -0.5000], grad_fn=<ThAddBackward>)

While torch.nn.LayerNorm gives:
tensor([ 1.7320, -0.5773, -0.5773, -0.5773], grad_fn=<AddcmulBackward>)

The layer_norm implementation in PyTorch is here:
https://github.com/pytorch/pytorch/blob/cca247635c6edb323176eeac7a18d3e9ab71c558/caffe2/python/helpers/normalization.py

@codertimo
Copy link
Owner

@egg-west Is your question is solved? 👍

@egg-west
Copy link
Author

Thank you for your clarification, I guess pulling the epsilon out of sqrt may speed up the computation.
But yes, they did the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants