-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Unable to understand positional encoding and masks. #25
Comments
Hi, Question 1
Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part. detr/models/position_encoding.py Lines 30 to 31 in 0fb754c
we create an image tensor which is similar in spirit to meshgrid , but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail) detr/models/position_encoding.py Lines 32 to 35 in 0fb754c
Then, in detr/models/position_encoding.py Lines 40 to 43 in 0fb754c
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps. Question 2
Those are calculated in Line 299 in 0fb754c
Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.
I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions. |
Perfect explanation! |
Hi @fmassa , I have one doubt, For positional encoding sine what is the input format. is tensor_list.mask kind of 0's and 1's where 1 is bounding box area and 0 is outer the bbox. so using that mask we are finding positional embedding is that right. |
Can someone please explain me how you calculated the positional encoding?
I know what positional encoding is, but models.positional_encoding.py is but overwhelming. I want to know what are considered as positional encoding while working with images. Are these calculated for feature maps or somewhat else?
How do you calculate masks when using images in transformers?
I know what masks are, but how do we calculate these when dealing with images?
I found no answers to these questions anywhere so posting it here.
The text was updated successfully, but these errors were encountered: