Not sure if it's necessary to scale input embedding before adding positional encoding #2797

keshawnhsieh · 2020-12-22T07:08:38Z

Recently I noticed that conformer model would lose some words occasionally within one sentence like below, especially the sentence is long.

REF: what do you have on fridays I HAVE A PE CLASS do you often play SPORTS NO I DON'T I DON'T LIKE sports do you often read books on the weekend no but i often sleep
HYP: what do you have on fridays * **** * ** ***** do you often play ****** ** * ***** * ***** **** sports do you often read books on the weekend no but i often sleep
Eval: D D D D D D D D D D D D

Until I found that conformer (include transformer) implemented in epsnet scales input embedding before adding positional encoding. As the scaler is calculated by

espnet/espnet/nets/pytorch_backend/transformer/embedding.py

Line 51 in 53f6aa1

self.xscale = math.sqrt(self.d_model)

espnet/espnet/nets/pytorch_backend/transformer/embedding.py

Line 91 in 53f6aa1

x = x * self.xscale + self.pe[:, : x.size(1)]

the scale value is directly related to attention_dim which is much quite big like 512 and even with sqrt we still got 22.6
So after scaling I worried that the positional encoding will be overwhelmed and wondering if this could be a reseaon causing the problem I described above.

sw005320 · 2020-12-22T22:54:07Z

Thanks for your report.
Do you mean it did not happen in the transformer model?
Do you have a recognition example of the same utterance with the transformer model?

sw005320 · 2020-12-22T22:54:57Z

@pengchengguo, if you have a comment about it, please let us know.

keshawnhsieh · 2020-12-23T02:25:24Z

Thanks for your report.
Do you mean it did not happen in the transformer model?
Do you have a recognition example of the same utterance with the transformer model?

I didn't test transformer model. Just curious about the reason to do such scaling and if it's strongly related to the problem of dropping piece of worlds in a complete sentence.

pengchengguo · 2020-12-23T03:01:50Z

Hi @keshawnhsieh,

The scale method is used in both transformer and conformer. I think it may be some training tricks but it is not mentioned in both transformer paper and some open-source implementations. I am not sure about it. I am trying to find which paper mentions this trick.

BTW, I find the missing characters in your examples are always capital letters. Do you find the same cases in other utterances?

keshawnhsieh · 2020-12-23T03:11:13Z

Hi @keshawnhsieh,

The scale method is used in both transformer and conformer. I think it may be some training tricks but it is not mentioned in both transformer paper and some open-source implementations. I am not sure about it. I am trying to find which paper mentions this trick.

BTW, I find the missing characters in your examples are always capital letters. Do you find the same cases in other utterances?

Yes, I also notice that there is no any clue in original paper or some third implementations like google's tensor2tensor and facebook's fairseq.

As for the capital letters phenomenon, it's just a kaldi type formatting for showing results. I copied the result from reuslt.wrd.txt. The script use in epsnet to evaluate wer performance will automatically make the different part to be capital letters. The ground truth is all capital letters actually.

keshawnhsieh · 2020-12-23T03:20:37Z

I looked back to the commit history of file /espnet/nets/pytorch_backend/transformer/embedding.py and found that the first version already introduced this scaling trick by @ShigekiKarita. Any hints for the mechanism of doing this scaling will be helpful. Thanks.

ShigekiKarita · 2020-12-23T07:07:33Z

I remember I forked opennmt's implementation because it was one of the first transformer in pytorch
https://github.com/OpenNMT/OpenNMT-py/blob/f46a4da48efd3df5f7abd7ef7bc2b8860a6f7025/onmt/modules/embeddings.py#L54
I think it is for weight sharing in embedding and pre-softmax projection (link). Tensor2tensor (I think it is the official impl) also performs this scaling for the sharing.
tensorflow/tensor2tensor#1718

keshawnhsieh · 2021-01-21T07:35:16Z

Thanks for @ShigekiKarita 's supplementations. Although I didn't find very strong reasons for doing such scaling, I guess this can be a kind of trick when the APE was first introduced in transformer for NMT tasks.

Also I recently found this paper really discussed such weakness of APE in transformer. I attach it here for your information.

As for now, I closed this issue. Thank you all again.

sw005320 added ASR Automatic speech recogntion Question Question labels Dec 22, 2020

placebokkk mentioned this issue Jan 5, 2021

in attention_rescoring，why the embedding vector of the input token of decoder multiply self.xscale ? wenet-e2e/wenet#45

Closed

keshawnhsieh closed this as completed Jan 21, 2021

This was referenced Jan 21, 2021

conformer asr model performs worse on short utterance #2893

Closed

transformer decode for long audio or short audio #2608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure if it's necessary to scale input embedding before adding positional encoding #2797

Not sure if it's necessary to scale input embedding before adding positional encoding #2797

keshawnhsieh commented Dec 22, 2020 •

edited

sw005320 commented Dec 22, 2020

sw005320 commented Dec 22, 2020

keshawnhsieh commented Dec 23, 2020

pengchengguo commented Dec 23, 2020

keshawnhsieh commented Dec 23, 2020

keshawnhsieh commented Dec 23, 2020

ShigekiKarita commented Dec 23, 2020 •

edited

keshawnhsieh commented Jan 21, 2021

Not sure if it's necessary to scale input embedding before adding positional encoding #2797

Not sure if it's necessary to scale input embedding before adding positional encoding #2797

Comments

keshawnhsieh commented Dec 22, 2020 • edited

sw005320 commented Dec 22, 2020

sw005320 commented Dec 22, 2020

keshawnhsieh commented Dec 23, 2020

pengchengguo commented Dec 23, 2020

keshawnhsieh commented Dec 23, 2020

keshawnhsieh commented Dec 23, 2020

ShigekiKarita commented Dec 23, 2020 • edited

keshawnhsieh commented Jan 21, 2021

keshawnhsieh commented Dec 22, 2020 •

edited

ShigekiKarita commented Dec 23, 2020 •

edited