Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

Unable to understand positional encoding and masks. #25

Closed
saahiluppal opened this issue May 31, 2020 · 3 comments
Closed

Unable to understand positional encoding and masks. #25

saahiluppal opened this issue May 31, 2020 · 3 comments
Labels
question Further information is requested

Comments

@saahiluppal
Copy link

saahiluppal commented May 31, 2020

  1. Can someone please explain me how you calculated the positional encoding?
    I know what positional encoding is, but models.positional_encoding.py is but overwhelming. I want to know what are considered as positional encoding while working with images. Are these calculated for feature maps or somewhat else?

  2. How do you calculate masks when using images in transformers?
    I know what masks are, but how do we calculate these when dealing with images?

I found no answers to these questions anywhere so posting it here.

@fmassa
Copy link
Contributor

fmassa commented May 31, 2020

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

y_embed = not_mask.cumsum(1, dtype=torch.float32)
x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)
if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in
pos_x = x_embed[:, :, :, None] / dim_t
pos_y = y_embed[:, :, :, None] / dim_t
pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.

I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

@fmassa fmassa closed this as completed May 31, 2020
@fmassa fmassa added the question Further information is requested label May 31, 2020
@saahiluppal
Copy link
Author

Perfect explanation!

@vigneshgig
Copy link

vigneshgig commented Jul 19, 2021

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

y_embed = not_mask.cumsum(1, dtype=torch.float32)
x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)

if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in

pos_x = x_embed[:, :, :, None] / dim_t
pos_y = y_embed[:, :, :, None] / dim_t
pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.
I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

Hi @fmassa , I have one doubt, For positional encoding sine what is the input format. is tensor_list.mask kind of 0's and 1's where 1 is bounding box area and 0 is outer the bbox. so using that mask we are finding positional embedding is that right.
I have implemented the position encoding for my project where to extract spatial positional features. currently I just used one hot encoder by dividing the image into a grid, so if the bounding box is overlap the grid make it has one and if not zero and s o on. but I encountered this sine positional encoding so planning to add this positional encoding. and if possible please explain what's the difference between one hot encoding with grid and this positional encoding
Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants