Is the model a transformer encoder-decoder or just decoder? #7

ehsan-soe · 2019-09-25T19:24:02Z

Hi,

Thank you for the very nice and interesting work.
I have a question regarding the model. In the paper it s mentioned that you used the same architecture as GPT, which is a transformer decoder. However, you also talked about input encoder and how you configure the input to encoder.
I was a bit confused, whether you have both encoder and decoder? or it is a language model and then it would be just decoder. And if it is a decoder, then how do you encode your [X_s, X_r] ?
Could you please clarify this. Many thanks

atcbosselut · 2019-09-25T20:43:22Z

It's a transformer language model. The confusion stems from the fact that the transformer language model is a mix of the original transformer encoder and decoder from the original transformer paper: https://arxiv.org/abs/1706.03762

It has the same architecture as the transformer encoder since there's no cross-attention, but it also masks future tokens as a transformer decoder would do. I used the term "encoder" since the architecture of each block of the transformer language model is the same as the transformer encoder.

The input during training is [X_s, X_r, X_o] where future tokens are masked when computing the self-attention of a particular token.

During generation, the input is [X_s, X_r, Y_o] where Y_o is the set of generated tokens of the phrase object up to that point.

ehsan-soe · 2019-09-25T21:22:51Z

@atcbosselut Thanks
Can you correct me if I am wrong? During the training you take [X_s; X_r] as seed text and you start predicting from end of X_r? I mean you compute self attention for all tokens up to the current time step but only take into account X_o in the loss function, right?

atcbosselut · 2019-09-25T21:40:20Z

In theory, this is what happens. In practice, it's all computed in parallel so a loss value is computed for all tokens (except the first token of X_s) in [X_s, X_r, X_o] and you just mask the loss values computed for tokens of X_s and X_r (i.e., multiply them by 0) before performing backpropagation, effectively zeroing the gradients of those intermediate losses.

ehsan-soe · 2019-10-08T23:21:50Z

@atcbosselut Can you kindly point me to where you did mask out the loss computed for tokens in x_s and x_r in your code? Many thanks

atcbosselut · 2019-10-09T00:04:43Z

comet-commonsense/src/train/batch.py

Line 46 in 0a8a94b

final_loss = (loss * loss_mask).sum(1)

guotong1988 · 2019-12-09T11:01:14Z

Same question. Thank you.

atcbosselut · 2019-12-10T00:51:11Z

Is your question answered by the discussion above?

guotong1988 · 2019-12-10T00:53:48Z

Yes. Thank you.

atcbosselut closed this as completed Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the model a transformer encoder-decoder or just decoder? #7

Is the model a transformer encoder-decoder or just decoder? #7

ehsan-soe commented Sep 25, 2019 •

edited

atcbosselut commented Sep 25, 2019

ehsan-soe commented Sep 25, 2019

atcbosselut commented Sep 25, 2019

ehsan-soe commented Oct 8, 2019

atcbosselut commented Oct 9, 2019

guotong1988 commented Dec 9, 2019

atcbosselut commented Dec 10, 2019

guotong1988 commented Dec 10, 2019

Is the model a transformer encoder-decoder or just decoder? #7

Is the model a transformer encoder-decoder or just decoder? #7

Comments

ehsan-soe commented Sep 25, 2019 • edited

atcbosselut commented Sep 25, 2019

ehsan-soe commented Sep 25, 2019

atcbosselut commented Sep 25, 2019

ehsan-soe commented Oct 8, 2019

atcbosselut commented Oct 9, 2019

guotong1988 commented Dec 9, 2019

atcbosselut commented Dec 10, 2019

guotong1988 commented Dec 10, 2019

ehsan-soe commented Sep 25, 2019 •

edited