Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale problem #5

Closed
lucastononrodrigues opened this issue Jun 18, 2021 · 3 comments
Closed

Scale problem #5

lucastononrodrigues opened this issue Jun 18, 2021 · 3 comments

Comments

@lucastononrodrigues
Copy link

Hey I am a little bit confused about the scale.

Inside SineSPE() you deal with the scale (both d^0.25 and num_realizations^0.25)
On the other hand when you show the application in pytorch, after applying the filter you divide by sqrt(num_realizations) again, why is that?
https://github.com/aliutkus/spe/blob/main/src/pytorch/examples/test_spe.ipynb

@cifkao
Copy link
Collaborator

cifkao commented Jun 19, 2021

The division by (num_realizations * keys_dim) ** 0.25 corresponds to eq. (16)–(17) in the paper, then the division by sqrt(num_realizations) is what happens normally in scaled dot-product attention (since num_realizations is the new dimension of the keys). All this leads to num_realizations * sqrt(keys_dim) in the denominator, which is what we want according to eq. (15).

@lucastononrodrigues
Copy link
Author

I see, thank you for the fast reply!
I didn't realize that there was the sqrt(num_realizations) remaining for it to be complete.
On the other hand when you apply it, for instance, according to equation 10:
y<- d^-1 phi(Q^) [[ phi(K^)]^T V] (10)
Q^ and K^ has (num_realizations)**0.25 * (D)**0.25 normalization
therefore Q^ * [K^]^T would have (num_realizations)**0.5 * (D)**0.5,
then we multiply it by (num_realizations)**0.5 and get the 1/(R*sqrt(D)) from equation 15.
but how should I normalize when I have kernels such as phi(Q^) * (phi(K^))^T?
Should I just divide equation 10 by sqrt(R), or should I do it before the kernel?
In performers for instance the normalization takes place before sending the features to the kernel function.

@cifkao
Copy link
Collaborator

cifkao commented Jun 19, 2021

SPE assumes that the attention mechanism will normalize QK^T by R^(1/2). In the Performer you never actually compute QK^T, so what you do instead is normalize both Q and K by R^(1/4), which is equivalent. This should normally happen automatically when you plug our modified keys/queries into an existing Performer implementation, e.g. here (where softmax_temp=1/sqrt(query_dimensions), so in the end you divide by sqrt(sqrt(query_dimensions))).

So you should check if your implementation of phi already does this normalization or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants