Scale problem #5

lucastononrodrigues · 2021-06-18T21:51:58Z

Hey I am a little bit confused about the scale.

Inside SineSPE() you deal with the scale (both d^0.25 and num_realizations^0.25)
On the other hand when you show the application in pytorch, after applying the filter you divide by sqrt(num_realizations) again, why is that?
https://github.com/aliutkus/spe/blob/main/src/pytorch/examples/test_spe.ipynb

cifkao · 2021-06-19T08:00:31Z

The division by (num_realizations * keys_dim) ** 0.25 corresponds to eq. (16)–(17) in the paper, then the division by sqrt(num_realizations) is what happens normally in scaled dot-product attention (since num_realizations is the new dimension of the keys). All this leads to num_realizations * sqrt(keys_dim) in the denominator, which is what we want according to eq. (15).

lucastononrodrigues · 2021-06-19T12:58:40Z

I see, thank you for the fast reply!
I didn't realize that there was the sqrt(num_realizations) remaining for it to be complete.
On the other hand when you apply it, for instance, according to equation 10:
y<- d^-1 phi(Q^) [[ phi(K^)]^T V] (10)
Q^ and K^ has (num_realizations)**0.25 * (D)**0.25 normalization
therefore Q^ * [K^]^T would have (num_realizations)**0.5 * (D)**0.5,
then we multiply it by (num_realizations)**0.5 and get the 1/(R*sqrt(D)) from equation 15.
but how should I normalize when I have kernels such as phi(Q^) * (phi(K^))^T?
Should I just divide equation 10 by sqrt(R), or should I do it before the kernel?
In performers for instance the normalization takes place before sending the features to the kernel function.

cifkao · 2021-06-19T14:53:31Z

SPE assumes that the attention mechanism will normalize QK^T by R^(1/2). In the Performer you never actually compute QK^T, so what you do instead is normalize both Q and K by R^(1/4), which is equivalent. This should normally happen automatically when you plug our modified keys/queries into an existing Performer implementation, e.g. here (where softmax_temp=1/sqrt(query_dimensions), so in the end you divide by sqrt(sqrt(query_dimensions))).

So you should check if your implementation of phi already does this normalization or not.

lucastononrodrigues closed this as completed Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale problem #5

Scale problem #5

lucastononrodrigues commented Jun 18, 2021

cifkao commented Jun 19, 2021

lucastononrodrigues commented Jun 19, 2021

cifkao commented Jun 19, 2021

Scale problem #5

Scale problem #5

Comments

lucastononrodrigues commented Jun 18, 2021

cifkao commented Jun 19, 2021

lucastononrodrigues commented Jun 19, 2021

cifkao commented Jun 19, 2021