-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in self-attention? #47
Comments
Hi Andreas,
your observation is correct. It’s not exactly the standard attention mechanism. I’ve not thoroughly compared the two, but current code was written on purpose. The reason for this is that we have to manipulate features of size (bs, n, n, de) anyway, so using vector attention scores instead of scalar does not create a strong memory bottleneck.
I would be interesting to investigate this further, though.
… On 31 May 2023, at 17:33, AndreasBergmeister ***@***.***> wrote:
Hi Clement, in file src/models/transformer_model.py line 159, you intend to compute the unnormalized attention scores, i.e. the dot product of the query and key vectors. However, in the code, just the query and key vectors are multiplied, without summing over the feature dimension. This effectively computes a separate attention score for each feature dimension.
On line 184 you comment that the shape of attn is 'bs, n, n, n_head', although it actually is 'bs, n, n, n_head, df', which can be seen on line 191, where attn is multiplied with a vector of shape '(bs, 1, n, n_head, df)'.
I couldn't find any comments on this in the paper, so I'm wondering if is on purpose or a bug.
—
Reply to this email directly, view it on GitHub <#47>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEJOOTWY7JR22MPQ53SXWFDXI5QE7ANCNFSM6AAAAAAYVUWHVI>.
You are receiving this because you are subscribed to this thread.
|
Alright, many thanks for the clarification and quick response! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi Clement, in file
src/models/transformer_model.py
line 159, you intend to compute the unnormalized attention scores, i.e. the dot product of the query and key vectors. However, in the code, just the query and key vectors are multiplied, without summing over the feature dimension. This effectively computes a separate attention score for each feature dimension.On line 184 you comment that the shape of
attn
is 'bs, n, n, n_head', although it actually is 'bs, n, n, n_head, df', which can be seen on line 191, whereattn
is multiplied with a vector of shape '(bs, 1, n, n_head, df)'.I couldn't find any comments on this in the paper, so I'm wondering if is on purpose or a bug.
The text was updated successfully, but these errors were encountered: