You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that you're trying to decode auto-regressively using BERT representations as a drop-in replacement for word embeddings. But BERT is bi-directional; the representation at token i has information about all tokens j > i. So, your model already knows what it needs to predict, before it predicts it.
In order for this to be correct you need to mask attention to all tokens j > i, which I don't think you do currently.
The text was updated successfully, but these errors were encountered:
Hi,
It seems that you're trying to decode auto-regressively using BERT representations as a drop-in replacement for word embeddings. But BERT is bi-directional; the representation at token i has information about all tokens j > i. So, your model already knows what it needs to predict, before it predicts it.
In order for this to be correct you need to mask attention to all tokens j > i, which I don't think you do currently.
The text was updated successfully, but these errors were encountered: