## Input IDs：数字化的输入序列

In [1]:
#Here’s an example using the BERT tokenizer, which is a WordPiece(sub-word units) tokenizer:
#详见2016年Google's Neural Machine Translation System中的WordPiece model，目的是改善稀有词的表示
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"



In [2]:
#可以看到WordPiece　tokenizer将某些词拆分了
tokenized_sequence = tokenizer.tokenize(sequence)
assert tokenized_sequence == ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']

In [3]:
#converted into IDs
encoded_sequence = tokenizer.encode(sequence)
assert encoded_sequence == [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]

## Attention mask:　区分padding词

batch时句长不一，需要padding，padding词不需要做attention

In [4]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer.encode(sequence_a)
assert len(encoded_sequence_a) == 8

encoded_sequence_b = tokenizer.encode(sequence_b)
assert len(encoded_sequence_b) == 19

In [5]:
padded_sequence_a = tokenizer.encode(sequence_a, max_length=19, pad_to_max_length=True)

assert padded_sequence_a == [101, 1188, 1110, 170, 1603, 4954,  119, 102,    0,    0,    0,    0,    0,    0,    0,    0,   0,   0,   0]
assert encoded_sequence_b == [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]

In [6]:
#The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them. 
#For the BertTokenizer, 1 indicate a value that should be attended to while 0 indicate a padded value.
sequence_a_dict = tokenizer.encode_plus(sequence_a, max_length=19, pad_to_max_length=True)

assert sequence_a_dict['input_ids'] == [101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
assert sequence_a_dict['attention_mask'] == [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## Token Type IDs:区分两个句子

在BERT中对应the segment IDs，0代表1st sentence, 1代表2nd sentence

In [7]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# [CLS] SEQ_A [SEP] SEQ_B [SEP]

sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_sequence = tokenizer.encode(sequence_a, sequence_b)
assert tokenizer.decode(encoded_sequence) == "[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]"

In [8]:
encoded_dict = tokenizer.encode_plus(sequence_a, sequence_b)

assert encoded_dict['input_ids'] == [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102]
assert encoded_dict['token_type_ids'] == [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

## Position IDs:位置编码

注意：**transformer本身并不具备捕捉位置信息的能力**，因此需要额外的位置编码。主要的位置编码方式有：
* 正弦编码(transfomer原论文采用的方式)
* 绝对位置编码:抱抱脸团队给的默认方式，Absolute positional embeddings are selected in the range [0, config.max_position_embeddings - 1]
* 相对位置编码