Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the model a transformer encoder-decoder or just decoder? #7

Closed
ehsan-soe opened this issue Sep 25, 2019 · 8 comments
Closed

Is the model a transformer encoder-decoder or just decoder? #7

ehsan-soe opened this issue Sep 25, 2019 · 8 comments

Comments

@ehsan-soe
Copy link

ehsan-soe commented Sep 25, 2019

Hi,

Thank you for the very nice and interesting work.
I have a question regarding the model. In the paper it s mentioned that you used the same architecture as GPT, which is a transformer decoder. However, you also talked about input encoder and how you configure the input to encoder.
I was a bit confused, whether you have both encoder and decoder? or it is a language model and then it would be just decoder. And if it is a decoder, then how do you encode your [X_s, X_r] ?
Could you please clarify this. Many thanks

@atcbosselut
Copy link
Owner

It's a transformer language model. The confusion stems from the fact that the transformer language model is a mix of the original transformer encoder and decoder from the original transformer paper: https://arxiv.org/abs/1706.03762

It has the same architecture as the transformer encoder since there's no cross-attention, but it also masks future tokens as a transformer decoder would do. I used the term "encoder" since the architecture of each block of the transformer language model is the same as the transformer encoder.

The input during training is [X_s, X_r, X_o] where future tokens are masked when computing the self-attention of a particular token.

During generation, the input is [X_s, X_r, Y_o] where Y_o is the set of generated tokens of the phrase object up to that point.

@ehsan-soe
Copy link
Author

@atcbosselut Thanks
Can you correct me if I am wrong? During the training you take [X_s; X_r] as seed text and you start predicting from end of X_r? I mean you compute self attention for all tokens up to the current time step but only take into account X_o in the loss function, right?

@atcbosselut
Copy link
Owner

In theory, this is what happens. In practice, it's all computed in parallel so a loss value is computed for all tokens (except the first token of X_s) in [X_s, X_r, X_o] and you just mask the loss values computed for tokens of X_s and X_r (i.e., multiply them by 0) before performing backpropagation, effectively zeroing the gradients of those intermediate losses.

@ehsan-soe
Copy link
Author

@atcbosselut Can you kindly point me to where you did mask out the loss computed for tokens in x_s and x_r in your code? Many thanks

@atcbosselut
Copy link
Owner

final_loss = (loss * loss_mask).sum(1)

@guotong1988
Copy link

Same question. Thank you.

@atcbosselut
Copy link
Owner

Is your question answered by the discussion above?

@guotong1988
Copy link

Yes. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants