Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply to linear attention and got "Nan" issue #2

Open
Yogurt928 opened this issue Dec 23, 2021 · 2 comments
Open

Apply to linear attention and got "Nan" issue #2

Yogurt928 opened this issue Dec 23, 2021 · 2 comments

Comments

@Yogurt928
Copy link

Hi,

I tried this method to linear attention
http://proceedings.mlr.press/v119/katharopoulos20a/katharopoulos20a.pdf
as following code:

    #PermuterFormer - P
    q = q.gather(-1, self.permutation[:, :, :q.shape[2]].expand_as(q))
    k = k.gather(-1, self.permutation[:, :, :k.shape[2]].expand_as(k))

    # Apply the feature map to the queries and keys
    Q = torch.nn.functional.elu(q) + 1
    K = torch.nn.functional.elu(k) + 1

    #PermuterFormer - r
    Q *= (self.ratio.unsqueeze(-1) ** torch.arange(Q.shape[2], device=Q.device).unsqueeze(0)).unsqueeze(-1)
    K *= ((1 / self.ratio).unsqueeze(-1) ** torch.arange(K.shape[2], device=K.device).unsqueeze(0)).unsqueeze(-1)

    if mask is not None:
        K.masked_fill_(mask.unsqueeze(1).unsqueeze(-1), 0.0)

    # Compute the KV matrix
    KV = torch.einsum("nhsd,nhsm->nhmd", K, v)

    # Compute the normalizer
    Z = 1/(torch.einsum("nhld,nhd->nlh", Q, K.sum(dim=2))+self.eps)

    # Finally compute and return the new values
    V = torch.einsum("nhld,nhmd,nlh->nlhm", Q, KV, Z)

But always got "nan" issue after 1~5 steps.
From my perspective, this may caused by this step:

    Q *= (self.ratio.unsqueeze(-1) ** torch.arange(Q.shape[2], device=Q.device).unsqueeze(0)).unsqueeze(-1)
    K *= ((1 / self.ratio).unsqueeze(-1) ** torch.arange(K.shape[2], device=K.device).unsqueeze(0)).unsqueeze(-1)

which multiply a very small number to Q and a very big number to K when the index is large.

Do I use the correct integration way? Or any suggestion for this?

Thanks.

@cpcp1998
Copy link
Owner

cpcp1998 commented Jan 1, 2022

Yes, it is caused by very large K.

A trivial solution is to set the ratio to something very close to 1 so that K does not explode as the sequence length grows.

As for the paper Transformers are RNNs, there is a better solution. Instead of multiplying K by large numbers and Q by small numbers, modifying equations (18) and (19) to s_i = r * s_{i-1} + ... and z_i = r * z_{i-1} + ... has the same effect. But AFAIK this involves modification of CUDA codes in the fast-transformers package.

@Yogurt928
Copy link
Author

Thanks for replying.
Still has another question,
In the paper, to ensure the similarity function depends only on the relative positions rather than absolute ones,
image

image
this property is a restriction on the result of matrix multiplication on q and k.
But it seems k and v will be multiplied at first, and then the q.
So, how to guarantee the restriction on q x k will works on q x (k x v)?
Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants