Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about decoder input and positional encoding #7

Closed
TimHo0331 opened this issue Jun 18, 2021 · 7 comments
Closed

Questions about decoder input and positional encoding #7

TimHo0331 opened this issue Jun 18, 2021 · 7 comments

Comments

@TimHo0331
Copy link

Hi,

  1. In page 4, it is said that 'The decoder inputs are the trajectory proposals, which are initialied by a set of learnable positional encoding'. And in page 9, it is said that 'The decoder receives proposals(randomly initialized), positional encoding of proposals, as well as encoder memory...'
    So, what is the input of the first decoder layer? Is it randomly initialized proposals added by learnable positional encoding? And what is the initialization distribution?
  2. In page 9, it is said that 'In encoder, spatial positional encoding are added to the queries and keys at each MHSA layer'
    Is the pisitional encoding in encoder fixed or learnable? Is this positional encoding used in both motion extractor, map aggregator and social constructor or only one of them?
    Thank you.
@sparshgarg23
Copy link

sparshgarg23 commented Jun 23, 2021

don't know if this helps but from what I could figure out : the decoder's input is a set of on the quality of the predefined anchors/waypoints/trajectories randomly initialzied(or learnt from heuristics). along with a the positional encoding of the proposals.These two along with the the set of histories serve as the input for the first transformer which outputs a list of candidate proposals.
Now lets talk about point 2.
In the map stage we have 3 inputs

  1. The context from the map
  2. The motion extracter's output
  3. As well as the positional encoding(this can be skipped and is only used in map aggregator and motion extractor).And yes it's learnable
    Of course,it would be better if the authors/ of the paper or anyone else could correct me.

@jzhang538
Copy link
Collaborator

Hi,

  1. In page 4, it is said that 'The decoder inputs are the trajectory proposals, which are initialied by a set of learnable positional encoding'. And in page 9, it is said that 'The decoder receives proposals(randomly initialized), positional encoding of proposals, as well as encoder memory...'
    So, what is the input of the first decoder layer? Is it randomly initialized proposals added by learnable positional encoding? And what is the initialization distribution?
  2. In page 9, it is said that 'In encoder, spatial positional encoding are added to the queries and keys at each MHSA layer'
    Is the pisitional encoding in encoder fixed or learnable? Is this positional encoding used in both motion extractor, map aggregator and social constructor or only one of them?
    Thank you.

Sorry for the late reply.

For the first question, I need to declare that the trajectory proposals are learnable parameters initialized by torch.nn.Parameter(). It keeps updating during the training process and becomes more and more meaningful. The input of the first decoder layer (motion extractor) is only the initialized trajectory proposals as well as the encoding memory of trajectory history.

For the second question, similar to DETR, the initialized proposals (by nn.Parameter function) are also used as positional encoding for each decoder layer. It is used in all three transformer modules with no distinction.

@jzhang538
Copy link
Collaborator

don't know if this helps but from what I could figure out : the decoder's input is a set of on the quality of the predefined anchors/waypoints/trajectories randomly initialzied(or learnt from heuristics). along with a the positional encoding of the proposals.These two along with the the set of histories serve as the input for the first transformer which outputs a list of candidate proposals.
Now lets talk about point 2.
In the map stage we have 3 inputs

  1. The context from the map
  2. The motion extracter's output
  3. As well as the positional encoding(this can be skipped and is only used in map aggregator and motion extractor).And yes it's learnable
    Of course,it would be better if the authors/ of the paper or anyone else could correct me.

Thanks so much for the kind reply! For point 2, we also add positional encoding for each trajectory proposal, which is the initialized representation.

@sparshgarg23
Copy link

sparshgarg23 commented Jul 7, 2021

thanks for the reply,any ideas on when the code will be made public??
Maybe you could provide some data visualization stuff first and then slowly make parts of it public.

@TimHo0331
Copy link
Author

Hi,

  1. In page 4, it is said that 'The decoder inputs are the trajectory proposals, which are initialied by a set of learnable positional encoding'. And in page 9, it is said that 'The decoder receives proposals(randomly initialized), positional encoding of proposals, as well as encoder memory...'
    So, what is the input of the first decoder layer? Is it randomly initialized proposals added by learnable positional encoding? And what is the initialization distribution?
  2. In page 9, it is said that 'In encoder, spatial positional encoding are added to the queries and keys at each MHSA layer'
    Is the pisitional encoding in encoder fixed or learnable? Is this positional encoding used in both motion extractor, map aggregator and social constructor or only one of them?
    Thank you.

Sorry for the late reply.

For the first question, I need to declare that the trajectory proposals are learnable parameters initialized by torch.nn.Parameter(). It keeps updating during the training process and becomes more and more meaningful. The input of the first decoder layer (motion extractor) is only the initialized trajectory proposals as well as the encoding memory of trajectory history.

For the second question, similar to DETR, the initialized proposals (by nn.Parameter function) are also used as positional encoding for each decoder layer. It is used in all three transformer modules with no distinction.

thanks for your kind reply.
For the second question, I think you misunderstand my meaning. I want to ask that for the encoder of all three transformer module, do you use positional encoding? fixed or learnable?

@jzhang538
Copy link
Collaborator

Hi,

  1. In page 4, it is said that 'The decoder inputs are the trajectory proposals, which are initialied by a set of learnable positional encoding'. And in page 9, it is said that 'The decoder receives proposals(randomly initialized), positional encoding of proposals, as well as encoder memory...'
    So, what is the input of the first decoder layer? Is it randomly initialized proposals added by learnable positional encoding? And what is the initialization distribution?
  2. In page 9, it is said that 'In encoder, spatial positional encoding are added to the queries and keys at each MHSA layer'
    Is the pisitional encoding in encoder fixed or learnable? Is this positional encoding used in both motion extractor, map aggregator and social constructor or only one of them?
    Thank you.

Sorry for the late reply.
For the first question, I need to declare that the trajectory proposals are learnable parameters initialized by torch.nn.Parameter(). It keeps updating during the training process and becomes more and more meaningful. The input of the first decoder layer (motion extractor) is only the initialized trajectory proposals as well as the encoding memory of trajectory history.
For the second question, similar to DETR, the initialized proposals (by nn.Parameter function) are also used as positional encoding for each decoder layer. It is used in all three transformer modules with no distinction.

thanks for your kind reply.
For the second question, I think you misunderstand my meaning. I want to ask that for the encoder of all three transformer module, do you use positional encoding? fixed or learnable?

According to our experiments, the positional encoding of encoder part doesn't affect much to the final results. You can ignore it for simplicity and the result will be almost the same.

@jzhang538
Copy link
Collaborator

thanks for the reply,any ideas on when the code will be made public??
Maybe you could provide some data visualization stuff first and then slowly make parts of it public.

Thanks so much for the suggestion. We will try to make the visualization code public soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants