New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the model a transformer encoder-decoder or just decoder? #7
Comments
It's a transformer language model. The confusion stems from the fact that the transformer language model is a mix of the original transformer encoder and decoder from the original transformer paper: https://arxiv.org/abs/1706.03762 It has the same architecture as the transformer encoder since there's no cross-attention, but it also masks future tokens as a transformer decoder would do. I used the term "encoder" since the architecture of each block of the transformer language model is the same as the transformer encoder. The input during training is [X_s, X_r, X_o] where future tokens are masked when computing the self-attention of a particular token. During generation, the input is [X_s, X_r, Y_o] where Y_o is the set of generated tokens of the phrase object up to that point. |
@atcbosselut Thanks |
In theory, this is what happens. In practice, it's all computed in parallel so a loss value is computed for all tokens (except the first token of X_s) in [X_s, X_r, X_o] and you just mask the loss values computed for tokens of X_s and X_r (i.e., multiply them by 0) before performing backpropagation, effectively zeroing the gradients of those intermediate losses. |
@atcbosselut Can you kindly point me to where you did mask out the loss computed for tokens in x_s and x_r in your code? Many thanks |
comet-commonsense/src/train/batch.py Line 46 in 0a8a94b
|
Same question. Thank you. |
Is your question answered by the discussion above? |
Yes. Thank you. |
Hi,
Thank you for the very nice and interesting work.
I have a question regarding the model. In the paper it s mentioned that you used the same architecture as GPT, which is a transformer decoder. However, you also talked about input encoder and how you configure the input to encoder.
I was a bit confused, whether you have both encoder and decoder? or it is a language model and then it would be just decoder. And if it is a decoder, then how do you encode your [X_s, X_r] ?
Could you please clarify this. Many thanks
The text was updated successfully, but these errors were encountered: