## 一般术语
autoencoding models:see MLM
autogressive models: see CLM
CLM: 因果语言建模，模型按顺序读取文本，必须预测下一个单词。
MLM：掩蔽语言建模，其中模型看到文本的损坏版本，通过随机屏蔽某些token完成，必须预测原始文本
multimodel:多个任务一起。
NLG：自然语言生成
NLP：自然语言处理
NLU：自然语言理解
pretrained model：预训练模型
RNN：RNN网络
seq2seq：从输入生成新序列的模型，如翻译模型，汇总模型（bart t5）
token：句子的一部分，通常是单词，也可以是子单词、标题符号。

## 模型输入
每个模型都不同，但与其他模型有相似之处，大多数模型使用相同的输入。
## 输入ID
作为输入传递给模型的唯一必需参数，是令牌所有，是生成将由模型用作输入的序列的令牌的数值表示形

In [2]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"

In [3]:
tokenized_sequence = tokenizer.tokenize(sequence)

In [4]:
print(tokenized_sequence)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']


## 这些令牌可以转换为模型可以理解的ID，通过直接将句子馈送给标记器来实现。


In [6]:
encoded_sequence = tokenizer(sequence)["input_ids"]
print(encoded_sequence)

[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]


In [8]:
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)

[CLS] A Titan RTX has 24GB of VRAM [SEP]


In [10]:
print(tokenizer.decode(tokenizer(decoded_sequence)["input_ids"]))
# 会自动加[cls]

[CLS] [CLS] A Titan RTX has 24GB of VRAM [SEP] [SEP]


In [12]:
## Attention mask

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."


encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

In [13]:
len(encoded_sequence_a), len(encoded_sequence_b)

(8, 19)

[[101, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102], [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102]]


In [14]:
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)# []里面的句子，一定要按从句子长度大到小的顺序排列

In [15]:
padded_sequences["input_ids"]

[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [101,
  1188,
  1110,
  170,
  1897,
  1263,
  4954,
  119,
  1135,
  1110,
  1120,
  1655,
  2039,
  1190,
  1103,
  4954,
  138,
  119,
  102]]

In [16]:
padded_sequences["attention_mask"]

[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

# 令牌类型 ID

[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]


In [17]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])


In [19]:
encoded_dict['token_type_ids']

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# 位置ID
转换器不知道每个令牌的位置，为此创建位置ID
是可选参数，如果未将位置ID传递给模型，会自动创建为绝对位置嵌入。
[0, config.max_position_embedding-1]

# Feed Forward Chunking
[batch_size, sequence_length]
[batch_size, sequence_lenght, config.intermediate_size]
sequence_length
[batch_size, 

chunk_size