New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not sure if it's necessary to scale input embedding before adding positional encoding #2797
Comments
Thanks for your report. |
@pengchengguo, if you have a comment about it, please let us know. |
I didn't test transformer model. Just curious about the reason to do such scaling and if it's strongly related to the problem of dropping piece of worlds in a complete sentence. |
Hi @keshawnhsieh, The scale method is used in both transformer and conformer. I think it may be some training tricks but it is not mentioned in both transformer paper and some open-source implementations. I am not sure about it. I am trying to find which paper mentions this trick. BTW, I find the missing characters in your examples are always capital letters. Do you find the same cases in other utterances? |
Yes, I also notice that there is no any clue in original paper or some third implementations like google's tensor2tensor and facebook's fairseq. As for the capital letters phenomenon, it's just a kaldi type formatting for showing results. I copied the result from reuslt.wrd.txt. The script use in epsnet to evaluate wer performance will automatically make the different part to be capital letters. The ground truth is all capital letters actually. |
I looked back to the commit history of file /espnet/nets/pytorch_backend/transformer/embedding.py and found that the first version already introduced this scaling trick by @ShigekiKarita. Any hints for the mechanism of doing this scaling will be helpful. Thanks. |
I remember I forked opennmt's implementation because it was one of the first transformer in pytorch |
Thanks for @ShigekiKarita 's supplementations. Although I didn't find very strong reasons for doing such scaling, I guess this can be a kind of trick when the APE was first introduced in transformer for NMT tasks. Also I recently found this paper really discussed such weakness of APE in transformer. I attach it here for your information. As for now, I closed this issue. Thank you all again. |
Recently I noticed that conformer model would lose some words occasionally within one sentence like below, especially the sentence is long.
Until I found that conformer (include transformer) implemented in epsnet scales input embedding before adding positional encoding. As the scaler is calculated by
espnet/espnet/nets/pytorch_backend/transformer/embedding.py
Line 51 in 53f6aa1
espnet/espnet/nets/pytorch_backend/transformer/embedding.py
Line 91 in 53f6aa1
the scale value is directly related to attention_dim which is much quite big like 512 and even with sqrt we still got 22.6
So after scaling I worried that the positional encoding will be overwhelmed and wondering if this could be a reseaon causing the problem I described above.
The text was updated successfully, but these errors were encountered: