Unable to understand positional encoding and masks. #25

saahiluppal · 2020-05-31T08:05:37Z

Can someone please explain me how you calculated the positional encoding?
I know what positional encoding is, but models.positional_encoding.py is but overwhelming. I want to know what are considered as positional encoding while working with images. Are these calculated for feature maps or somewhat else?
How do you calculate masks when using images in transformers?
I know what masks are, but how do we calculate these when dealing with images?

I found no answers to these questions anywhere so posting it here.

fmassa · 2020-05-31T10:32:56Z

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

detr/models/position_encoding.py

Lines 30 to 31 in 0fb754c

    
           y_embed = not_mask.cumsum(1, dtype=torch.float32) 
        
           x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)

detr/models/position_encoding.py

Lines 32 to 35 in 0fb754c

    
           if self.normalize: 
        
               eps = 1e-6 
        
               y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale 
        
               x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in

detr/models/position_encoding.py

Lines 40 to 43 in 0fb754c

    
           pos_x = x_embed[:, :, :, None] / dim_t 
        
           pos_y = y_embed[:, :, :, None] / dim_t 
        
           pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3) 
        
           pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

detr/util/misc.py

Line 299 in 0fb754c

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.

I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

saahiluppal · 2020-06-02T07:13:56Z

Perfect explanation!

vigneshgig · 2021-07-19T10:46:00Z

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

detr/models/position_encoding.py

Lines 30 to 31 in 0fb754c

y_embed = not_mask.cumsum(1, dtype=torch.float32)

x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)

detr/models/position_encoding.py

Lines 32 to 35 in 0fb754c

if self.normalize:

eps = 1e-6

y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale

x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in

detr/models/position_encoding.py

Lines 40 to 43 in 0fb754c

pos_x = x_embed[:, :, :, None] / dim_t

pos_y = y_embed[:, :, :, None] / dim_t

pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)

pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

detr/util/misc.py

Line 299 in 0fb754c

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.
I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

Hi @fmassa , I have one doubt, For positional encoding sine what is the input format. is tensor_list.mask kind of 0's and 1's where 1 is bounding box area and 0 is outer the bbox. so using that mask we are finding positional embedding is that right.
I have implemented the position encoding for my project where to extract spatial positional features. currently I just used one hot encoder by dividing the image into a grid, so if the bounding box is overlap the grid make it has one and if not zero and s o on. but I encountered this sine positional encoding so planning to add this positional encoding. and if possible please explain what's the difference between one hot encoding with grid and this positional encoding
Thanks

fmassa closed this as completed May 31, 2020

fmassa added the question Further information is requested label May 31, 2020

TranThienDat-Nguyen mentioned this issue Mar 4, 2022

The use of mask in the detection dataset #476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to understand positional encoding and masks. #25

Unable to understand positional encoding and masks. #25

saahiluppal commented May 31, 2020 •

edited

Loading

fmassa commented May 31, 2020

saahiluppal commented Jun 2, 2020

vigneshgig commented Jul 19, 2021 •

edited

Loading

Question 1

Question 2

Unable to understand positional encoding and masks. #25

Unable to understand positional encoding and masks. #25

Comments

saahiluppal commented May 31, 2020 • edited Loading

fmassa commented May 31, 2020

Question 1

Question 2

saahiluppal commented Jun 2, 2020

vigneshgig commented Jul 19, 2021 • edited Loading

Question 1

Question 2

saahiluppal commented May 31, 2020 •

edited

Loading

vigneshgig commented Jul 19, 2021 •

edited

Loading