Problem about cross-attention during training #43

dorianzhang7 · 2023-08-10T13:05:00Z

hello @Fantasy-Studio
I noticed something odd when trying to train the network with the upload code.
After training some iterations, I checked the parameter values of the cross-attention module of the saved model and found that only the parameters of the to_v network have changed, and the parameters of the to_k and to_q networks have not changed (no matter how many times they are trained). Therefore, I specifically recorded the backpropagation gradient value of the cross-attention model parameters, as shown in the following figure:
to_k:

to_v:

This situation is consistent with what I have observed so far. After debugging the code, I found that the CLIP used in the paper only extracts outputs.pooler_output as cond, which has a dimension of 1X1024. After passing the cross-attention network, the q vector is 4096X40, k and v vectors are 1X40.
According to the cross attention formula:

The result of the product of q and k is a vector of 4096X1. After softmax processing, because it is a one-dimensional vector, its vector value will all become 1. At this time, the mechanism of the attention module will fail, and the output result will be the v vector. It has nothing to do with k and q.
The above is my analysis and verification of this situation, but when I compared sd-v1-4.ckpt with the pre-trained model parameters of the paint-by-example uploaded by the author, I found that the to_k, to_q, and to_v of the cross-attention modules of the two are different, which makes me very confused. I would like to ask you if you have encountered the same problem, thank you very much for your reply!

Will this happen when researchers on related topics train this part of the code? Thank you for your answers

dorianzhang7 · 2023-08-15T05:02:00Z

Since the conditional vector is one-dimensional class vector from CLIP, the values of the attention map in cross-attention network are all equal to 1. So I doubt whether the parameters of the attention map can be trained, How does the pre-trained model do it?

dorianzhang7 mentioned this issue Aug 14, 2023

How it works that the self.learnable_vector is learnable? #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem about cross-attention during training #43

Problem about cross-attention during training #43

dorianzhang7 commented Aug 10, 2023 •

edited

Loading

dorianzhang7 commented Aug 15, 2023

Problem about cross-attention during training #43

Problem about cross-attention during training #43

Comments

dorianzhang7 commented Aug 10, 2023 • edited Loading

dorianzhang7 commented Aug 15, 2023

dorianzhang7 commented Aug 10, 2023 •

edited

Loading