You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am wondering why don't you use the standard nn version of LayerNorm?
I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}
Could you clarify these 2 approaches?
The text was updated successfully, but these errors were encountered:
@egg-west Well the reason why I used this layer norm is "Attention All you need" implementation Annotated Transformer used this code, and just copied from there. So.. if anyone can answer this question, would be seriously awesome
I believe that they should do similar things, however there is a difference in implementation.
For a given input: x = torch.tensor([1.,0.,0.,0.])
The Annotated Transformer version gives the output: tensor([ 1.5000, -0.5000, -0.5000, -0.5000], grad_fn=<ThAddBackward>)
While torch.nn.LayerNorm gives: tensor([ 1.7320, -0.5773, -0.5773, -0.5773], grad_fn=<AddcmulBackward>)
I am wondering why don't you use the standard nn version of LayerNorm?
I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}
Could you clarify these 2 approaches?
The text was updated successfully, but these errors were encountered: