Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not sure if it's necessary to scale input embedding before adding positional encoding #2797

Closed
keshawnhsieh opened this issue Dec 22, 2020 · 8 comments
Labels
ASR Automatic speech recogntion Question Question

Comments

@keshawnhsieh
Copy link

keshawnhsieh commented Dec 22, 2020

Recently I noticed that conformer model would lose some words occasionally within one sentence like below, especially the sentence is long.

REF: what do you have on fridays I HAVE A PE CLASS do you often play SPORTS NO I DON'T I DON'T LIKE sports do you often read books on the weekend no but i often sleep
HYP: what do you have on fridays * **** * ** ***** do you often play ****** ** * ***** * ***** **** sports do you often read books on the weekend no but i often sleep
Eval: D D D D D D D D D D D D

Until I found that conformer (include transformer) implemented in epsnet scales input embedding before adding positional encoding. As the scaler is calculated by

self.xscale = math.sqrt(self.d_model)

x = x * self.xscale + self.pe[:, : x.size(1)]

the scale value is directly related to attention_dim which is much quite big like 512 and even with sqrt we still got 22.6
So after scaling I worried that the positional encoding will be overwhelmed and wondering if this could be a reseaon causing the problem I described above.

@sw005320 sw005320 added ASR Automatic speech recogntion Question Question labels Dec 22, 2020
@sw005320
Copy link
Contributor

Thanks for your report.
Do you mean it did not happen in the transformer model?
Do you have a recognition example of the same utterance with the transformer model?

@sw005320
Copy link
Contributor

@pengchengguo, if you have a comment about it, please let us know.

@keshawnhsieh
Copy link
Author

Thanks for your report.
Do you mean it did not happen in the transformer model?
Do you have a recognition example of the same utterance with the transformer model?

I didn't test transformer model. Just curious about the reason to do such scaling and if it's strongly related to the problem of dropping piece of worlds in a complete sentence.

@pengchengguo
Copy link
Collaborator

Hi @keshawnhsieh,

The scale method is used in both transformer and conformer. I think it may be some training tricks but it is not mentioned in both transformer paper and some open-source implementations. I am not sure about it. I am trying to find which paper mentions this trick.

BTW, I find the missing characters in your examples are always capital letters. Do you find the same cases in other utterances?

@keshawnhsieh
Copy link
Author

Hi @keshawnhsieh,

The scale method is used in both transformer and conformer. I think it may be some training tricks but it is not mentioned in both transformer paper and some open-source implementations. I am not sure about it. I am trying to find which paper mentions this trick.

BTW, I find the missing characters in your examples are always capital letters. Do you find the same cases in other utterances?

Yes, I also notice that there is no any clue in original paper or some third implementations like google's tensor2tensor and facebook's fairseq.

As for the capital letters phenomenon, it's just a kaldi type formatting for showing results. I copied the result from reuslt.wrd.txt. The script use in epsnet to evaluate wer performance will automatically make the different part to be capital letters. The ground truth is all capital letters actually.

@keshawnhsieh
Copy link
Author

I looked back to the commit history of file /espnet/nets/pytorch_backend/transformer/embedding.py and found that the first version already introduced this scaling trick by @ShigekiKarita. Any hints for the mechanism of doing this scaling will be helpful. Thanks.

@ShigekiKarita
Copy link
Member

ShigekiKarita commented Dec 23, 2020

I remember I forked opennmt's implementation because it was one of the first transformer in pytorch
https://github.com/OpenNMT/OpenNMT-py/blob/f46a4da48efd3df5f7abd7ef7bc2b8860a6f7025/onmt/modules/embeddings.py#L54
I think it is for weight sharing in embedding and pre-softmax projection (link). Tensor2tensor (I think it is the official impl) also performs this scaling for the sharing.
tensorflow/tensor2tensor#1718

@keshawnhsieh
Copy link
Author

Thanks for @ShigekiKarita 's supplementations. Although I didn't find very strong reasons for doing such scaling, I guess this can be a kind of trick when the APE was first introduced in transformer for NMT tasks.

Also I recently found this paper really discussed such weakness of APE in transformer. I attach it here for your information.

As for now, I closed this issue. Thank you all again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion Question Question
Projects
None yet
Development

No branches or pull requests

4 participants