Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem about cross-attention during training #43

Open
dorianzhang7 opened this issue Aug 10, 2023 · 1 comment
Open

Problem about cross-attention during training #43

dorianzhang7 opened this issue Aug 10, 2023 · 1 comment

Comments

@dorianzhang7
Copy link

dorianzhang7 commented Aug 10, 2023

hello @Fantasy-Studio
I noticed something odd when trying to train the network with the upload code.
After training some iterations, I checked the parameter values of the cross-attention module of the saved model and found that only the parameters of the to_v network have changed, and the parameters of the to_k and to_q networks have not changed (no matter how many times they are trained). Therefore, I specifically recorded the backpropagation gradient value of the cross-attention model parameters, as shown in the following figure:
to_k:
image
to_v:
image
This situation is consistent with what I have observed so far. After debugging the code, I found that the CLIP used in the paper only extracts outputs.pooler_output as cond, which has a dimension of 1X1024. After passing the cross-attention network, the q vector is 4096X40, k and v vectors are 1X40.
According to the cross attention formula:
image
The result of the product of q and k is a vector of 4096X1. After softmax processing, because it is a one-dimensional vector, its vector value will all become 1. At this time, the mechanism of the attention module will fail, and the output result will be the v vector. It has nothing to do with k and q.
The above is my analysis and verification of this situation, but when I compared sd-v1-4.ckpt with the pre-trained model parameters of the paint-by-example uploaded by the author, I found that the to_k, to_q, and to_v of the cross-attention modules of the two are different, which makes me very confused. I would like to ask you if you have encountered the same problem, thank you very much for your reply!

Will this happen when researchers on related topics train this part of the code? Thank you for your answers

@dorianzhang7
Copy link
Author

Since the conditional vector is one-dimensional class vector from CLIP, the values of the attention map in cross-attention network are all equal to 1. So I doubt whether the parameters of the attention map can be trained, How does the pre-trained model do it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant