Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Position Id #7

Open
Wangpeiyi9979 opened this issue Sep 16, 2021 · 1 comment
Open

Position Id #7

Wangpeiyi9979 opened this issue Sep 16, 2021 · 1 comment

Comments

@Wangpeiyi9979
Copy link

Wangpeiyi9979 commented Sep 16, 2021

Hi, thanks for your nice work.
When I read the source code, I have a simple question for the position id used in the code as follow,

parameters['position_ids'][0]

tensor([ 2, 47,  3,  4,  5,  6,  7, 42, 43, 44, 45, 46, 48, 49, 50, 51, 52, 89,
        90, 91, 92, 93,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
        22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
        69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
        87, 88, 94,  1,  1,  1], device='cuda:0')

I find that the position id is not ordered, and what are the benefits of such a position ID

@mahnerak
Copy link
Member

mahnerak commented Sep 23, 2021

@Wangpeiyi9979 As you know, in the case of Transformers the only way of providing the self-attention layers with positional information is position_ids. If you take the sentence, and rearrange (with the very same permutation) both token ids and position ids, the result will not change.

So why do we rearrange in such a way?
Well, this is done for more optimized insertion of prompts while training (especially with batches).

For example, in the same batch you have two samples you train an NLI (or other sentence-pair classification) on, and prompt tokens [P_1], [P_2], [MASK], [P_3] and [P_4] are being inserted between and after the sentences:

  • [CLS] ▁a ▁dog ▁drops ▁a ▁red ▁disc ▁on ▁a ▁beach . [P_1] [P_2] [MASK] [P_3] ▁a ▁dog ▁drops ▁a ▁red ▁disc [P_4]
    and
  • [CLS] ▁three ▁biker s ▁stop ▁in ▁town . [P_1] [P_2] [MASK] [P_3] ▁biker s ▁stop ▁for ▁gas [P_4]

In the same batch they will appear like:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
input_ids [CLS] ▁a ▁dog ▁drops ▁a ▁red ▁disc ▁on ▁a ▁beach . [P_1] [P_2] [MASK] [P_3] ▁a ▁dog ▁drops ▁a ▁red ▁disc [P_4]
position_ids 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
input_ids [CLS] ▁three ▁biker s ▁stop ▁in ▁town . [P_1] [P_2] [MASK] [P_3] ▁biker s ▁stop ▁for ▁gas [P_4]
position_ids 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Now, you can notice that the special tokens are not aligned, and it is not effective to insert prompt embeddings in such positions.

However, if we permute tokens, all the special tokens are aligned. Moreover, not only [CLS] is accessible with encodings[:, :, 0] but the embeddings for [MASK] are accessible with encodings[:, :, 1]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
input_ids [CLS] [MASK] [P_1] [P_2] [P_3] [P_4] ▁a ▁dog ▁drops ▁a ▁red ▁disc ▁on ▁a ▁beach . ▁a ▁dog ▁drops ▁a ▁red ▁disc
position_ids 0 13 11 12 14 21 1 2 3 4 5 6 7 8 9 10 15 16 17 18 19 20
input_ids [CLS] [MASK] [P_1] [P_2] [P_3] [P_4] ▁three ▁biker s ▁stop ▁in ▁town . ▁biker s ▁stop ▁for ▁gas
position_ids 0 10 8 9 11 17 1 2 3 4 5 6 7 12 13 14 15 16

This trick is performed if the flag reorder_optimized is enabled. This is equivalent to what a training without this trick will lead to, but much faster.

BTW: in RoBERTa models 1 servers as padding id, for most of the other transformer models you can see 0 value for padding.

@mahnerak mahnerak pinned this issue Oct 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants