# 模型与分词器

## 模型
除了`AutoModel`根据checkpoint自动加载模型以外，也可以直接使用模型对应的`Model`类，例如BERT对应的就是`BertModel`:

In [7]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

注意，在大部分情况下，我们都应该使用`AutoModel`来加载模型。

### 加载模型
所有存储在HuggingFace都可以使用`AutoModel.from_pretrained()`来加载权重，参数可以像上

In [5]:
from transformers import BertModel

model = BertModel.from_pretrained('../models/bert')

部分模型的 Hub 页面中会包含很多文件，我们通常只需要下载模型对应的 config.json 和 pytorch_model.bin，以及分词器对应的 tokenizer.json、tokenizer_config.json 和 vocab.txt。


### 保存模型
保存模型通过调用`Model.save_pretrained()`函数实现，例如保存加载的BERT模型：

In [10]:
from transformers import AutoModel

model = AutoModel.from_pretrained('bert-base-cased')
model.save_pretrained('../models/bert-base-cased')

这会在保存路径下创建两个文件：
- config.json: 模型配置文件，存储模型结果参数，如Transformer层数、特征空间维度
- pytorch_model.bin: 又称state dictionary，存储模型的权重

简单来说，配置文件记录模型的结构，模型权重记录模型的参数，这两个文件缺一不可。我们自己保存的模型同样通过`Model.from_pretrained()`加载，只需要传递保存目录的路径。



## 分词器
由于神经网络模型不能直接处理文本，因此我们需要先将文本转换为数字，这个过程被称为编码(Encoding)，其中包含两个步骤：
1. 使用分词器(tokenizer)将文本按词、子词、字符切分为tokens
2. 将所有的token映射到对应的token ID

### 分词策略
根据切分粒度的不同，分词策略可以分为以下几种：
- 按词切分：可以直接使用Python的`split()`函数按空格进行分词：


In [7]:
tokenizer_text = "Jim Henson was a puppeteer".split()
print(tokenizer_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


这种方式会导致生成巨大的词表。

> 词表就是一个映射字典，负责将token映射到对应的ID(从0开始)。神经网络模型就是通过这些token ID来区分每一个token。

当遇到不在词表中的词时，分词器会使用一个专门的`[UNK] token`来表示它是unknown的。


- 按字符切分(Character-based)：将文本切分为每个字符
- 按子词切分(SubWord)：高频词直接保留，低频词被切分为更有意义的子词。例如"annoyingly"是一个低频词，可以切分"annoying"和"ly"，这两个子词不仅出现频率更高，而且词义也得以保留。

### 加载与保存分词器
分词器的加载与保存与模型相似，使用`Tokenizer.from_pretrained()`和`Tokenizer.save_pretrained()`函数。例如加载并保存BERT模型的分词器：

In [11]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
tokenizer.save_pretrained('../models/bert-base-cased')

('../models/bert-base-cased/tokenizer_config.json',
 '../models/bert-base-cased/special_tokens_map.json',
 '../models/bert-base-cased/vocab.txt',
 '../models/bert-base-cased/added_tokens.json')

同样地，在大部分情况下我们都应该使用`AutoTokenizer`来加载分词器：

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenizer.save_pretrained('../models/bert-base-cased')

('../models/bert-base-cased/tokenizer_config.json',
 '../models/bert-base-cased/special_tokens_map.json',
 '../models/bert-base-cased/vocab.txt',
 '../models/bert-base-cased/added_tokens.json',
 '../models/bert-base-cased/tokenizer.json')

调用`Tokenizer.save_pretrained()`函数会保存在路径下创建三个文件：
- special_tokens_map.json: 映射文件，里面包含unknown token等特殊字符的映射关系
- tokenizer_config.json: 分词器配置文件，存储构建分词器需要的参数
- vocab.txt: 词表，一行一个token，行号就是对应的token ID(从0开始)


### 编码与解码文件
文本编码(Encoding)过程包含两个步骤：
- 分词: 使用分词器某种策略将文本切分为tokens
- 映射: 将tokens转化为对应的token IDs

下面我们首先使用BERT分词器来对文本进行分词：

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
sequence = "Using a transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']


可以看到，BERT分词器采用的是子词切分策略，它会不断切分词语直到获得词表中的token，例如“transformer”会被切分为“transform”和“##er”

然后，我们通过`convert_tokens_to_ids()`将切分出的tokens转换为对应的token IDs：

In [16]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 3014]


还可以通过`encode()`函数将这两个步骤合并，并且`encode()`会自动添加模型需要的特殊token，例如BERT分词会分别在序列的首尾添加[CLS]和[SEP]

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
sequence = "Using a transformer network is simple"
sequence_ids = tokenizer.encode(sequence)

print(sequence_ids)

[101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102]


还可以通过`encode()`函数将这两个步骤合并，并且`encode()`会自动添加模型的特殊token，例如BERT分词器会分别在序列的首尾添加[CLS]和[SEP]:

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

sequence = 'Using a transformer network is simple'
sequence_ids = tokenizer.encode(sequence)

print(sequence_ids)

[101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102]


In [19]:
# 在实际编码文本时，最常见的是直接使用分词器进行处理
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenizer_text = tokenizer('Using a transformer network is simple')
print(tokenizer_text)

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


文本解码(Decoding)与编码相反，负责将token IDs转换回原来的字符串。注意，解码过程不是简单地将token IDs映射会tokens，还需要合并那些被分为多个token的单词。下面我们通过`decode()`函数解码前面生成的token IDs：

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

decode_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decode_string)

decoded_string = tokenizer.decode([101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102])
print(decoded_string)

Using a transformer network is simple
[CLS] Using a Transformer network is simple [SEP]


## 处理多段文本

现实场景中，我们往往会同时处理多段文本，而且模型也只接受批(batch)数据作为输入，即使只有一段文本，也需要将它组成一个只包含一个样本的batch

In [21]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for HuggingFace course my whole life"

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs: \n", input_ids)

output = model(input_ids)
print("Logits: \n", output.logits)

Input IDs: 
 tensor([[ 1045,  1005,  2310,  2042,  3403,  2005, 17662, 12172,  2607,  2026,
          2878,  2166]])
Logits: 
 tensor([[-1.9676,  2.1872]], grad_fn=<AddmmBackward0>)


In [23]:
# 实际场景中，我们直接使用文词器对文本进行处理
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for HuggingFace course my whole life"

tokenized_inputs = tokenizer(sequence, return_tensors='pt')
print("Inputs Keys: \n", tokenized_inputs.keys())
print("\nInput IDs: \n", tokenized_inputs['input_ids'])

output = model(**tokenized_inputs)
print("\nLogits: \n", output.logits)

Inputs Keys: 
 dict_keys(['input_ids', 'attention_mask'])

Input IDs: 
 tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005, 17662, 12172,  2607,
          2026,  2878,  2166,   102]])

Logits: 
 tensor([[-2.1236,  2.1398]], grad_fn=<AddmmBackward0>)


### Padding操作

按批输入多段文本产生的一个直接问题就是：batch中的文本有长有短，而输入张量必须是严格的二维矩形，维度为(batch size, sequence length)，即每一段文本编码后的token IDs数量必须一样多。

In [24]:
# 下面这样的ID列表是无法转换为张量的
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In [25]:
# 需要通过padding操作
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [26]:
# 模型的padding ID可以通过其分词器的`pad_token_id`属性获得
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


### Attention Mask
Attention Mask是一个尺寸与input IDs完全相同，且仅由0和1组成的张量，0表示对应位置的token是填充符，不参与计算。当然，一些特殊的模型结构也会借助Attention Mask来遮蔽掉指定的tokens

In [27]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]
batched_attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
outputs = model(
    torch.tensor(batched_ids),
    attention_mask=torch.tensor(batched_attention_mask),
)
print(outputs.logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


目前大部分Transformer模型只能接受长度不超过512或1024的token序列，因此长序列有三种处理方法：
1. 使用一个支持长文的Transformer模型，例如Longformer和LED(最大4096)
2. 设定最大长度`max_sequence_length`以截断输入序列：`sequence = sequence[:max_sequence_length]`
3. 将长文切片为短文本块(chunk)，然后分别对每一个chunk编码

### 直接使用分词器

In [28]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = [
    "I've been waiting for HuggingFace course my whole life",
    "So have I!"
]

model_inputs = tokenizer(sequence)
print(model_inputs)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


Padding操作通过`padding`参数来控制：
- `padding="longest"`：将序列填充到当前batch中最长序列的长度
- `padding="max_length"`：将所有序列填充到模型能够接受的最大长度

In [29]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = [
    "I've been waiting for HuggingFace course my whole life",
    "So have I!"
]

model_inputs = tokenizer(sequence, padding="longest")
print(model_inputs)

model_inputs = tokenizer(sequence, padding="max_length")
print(model_inputs)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 17662, 12172, 2607, 2026, 2878, 2166, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

**截断操作**通过`truncation`参数来控制，如果`truncation=True`，那么大于模型最大接受长度的序列都会被截断

In [30]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = [
    "I've been waiting for HuggingFace course my whole life",
    "So have I!"
]

model_inputs = tokenizer(sequence, max_length=8, truncation=True)
print(model_inputs)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


In [32]:
# 分词器还可以指定返回的张量格式：pt为PyTorch张量，tf为Tensorflow张量，np为NumPy
model_inputs = tokenizer(sequence, padding=True, return_tensors='pt')
print(model_inputs)

model_inputs = tokenizer(sequence, padding=True, return_tensors='np')
print(model_inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005, 17662, 12172,  2607,
          2026,  2878,  2166,   102],
        [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
{'input_ids': array([[  101,  1045,  1005,  2310,  2042,  3403,  2005, 17662, 12172,
         2607,  2026,  2878,  2166,   102],
       [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,
            0,     0,     0,     0,     0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [34]:
# 实际使用分词器时，我们通常会同时进行 padding 操作和截断操作，并设置返回格式为 Pytorch 张量，这样就可以直接将分词结果送入模型：

tokens = tokenizer(sequence, padding=True, truncation=True, return_tensors='pt')
print(tokens)
output = model(**tokens)
print(output.logits)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005, 17662, 12172,  2607,
          2026,  2878,  2166,   102],
        [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
tensor([[-2.1236,  2.1398],
        [-3.6183,  3.9138]], grad_fn=<AddmmBackward0>)


### 编码句子对
除了对单段文本进行编码以外（batch 只是并行地编码多个单段文本），对于 BERT 等包含“句子对”预训练任务的模型，它们的分词器都支持对“句子对”进行编码

In [37]:
from transformers import AutoTokenizer

checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer('This is the first sentence', 'This is the second sentence')
print(inputs)

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(tokens)

{'input_ids': [101, 1188, 1110, 1103, 1148, 5650, 102, 1188, 1110, 1103, 1248, 5650, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'This', 'is', 'the', 'first', 'sentence', '[SEP]', 'This', 'is', 'the', 'second', 'sentence', '[SEP]']


> 如果选择其他模型，分词器的输出不一定会包含`token_type_ids`项。分词器只需保证输出格式与模型训练时的输入一致即可

In [44]:
# 实际使用，不需要去关注编码结果中是否包含 token_type_ids 项，分词器会根据checkpoint自动调整输出格式
from transformers import AutoTokenizer

checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sentence1_list = ["First sentence.", "This is the second sentence.", "Third one."]
sentence2_list = ["First sentence is short.", "The second sentence is very very very long.", "ok."]

tokens = tokenizer(
    sentence1_list,
    sentence2_list,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

print(tokens)
print(tokens['input_ids'].shape)

{'input_ids': tensor([[  101,  1752,  5650,   119,   102,  1752,  5650,  1110,  1603,   119,
           102,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1188,  1110,  1103,  1248,  5650,   119,   102,  1109,  1248,
          5650,  1110,  1304,  1304,  1304,  1263,   119,   102],
        [  101,  4180,  1141,   119,   102, 21534,   119,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
torch.Size([3, 18])


## 添加Token
实际操作中，我们还经常会遇到输入中需要包含特殊标记符的情况，例如使用 [ENT_START] 和 [ENT_END] 标记出文本中的实体。由于这些自定义 token 并不在预训练模型原来的词表中，因此直接运用分词器处理就会出现问题。

In [45]:
sentence = 'Two [ENT_START] cars [ENT_END] collided in a [ENT_START] tunnel [ENT_END] this morning.'
print(tokenizer.tokenize(sentence))

['Two', '[', 'E', '##NT', '_', 'ST', '##AR', '##T', ']', 'cars', '[', 'E', '##NT', '_', 'E', '##ND', ']', 'collided', 'in', 'a', '[', 'E', '##NT', '_', 'ST', '##AR', '##T', ']', 'tunnel', '[', 'E', '##NT', '_', 'E', '##ND', ']', 'this', 'morning', '.']


### 添加新token
Transformer库提供了两种方式来添加新token，分别是：
- `add_tokens()`添加普通token：参数是新token列表，如果token不在词表中，就会被添加到词表的最后

In [46]:
checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

num_added_toks = tokenizer.add_tokens(['new_token1', 'new_token2'])
print('I have added', num_added_toks, 'tokens')

I have added 2 tokens


In [47]:
# 为了防止token已经包含在词表中，还可以预先对新token列表进行过滤
new_tokens = ['new_token1', 'new_token2']
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())
tokenizer.add_tokens(list(new_tokens))

0

- `add_special_tokens()`添加特殊token：参数是包含特殊token的字典，键值只能从`bos_token`,`eos_token`,`unk_token`,`sep_token`, `pad_token`, `cls_token`, `mask_token`, `additional_special_tokens`中选择。如果token不在词表中，就会被添加到词表的最后。添加后，还可以通过特殊属性来访问这些token，例如`tokenizer.cls_token`就指向cls token

In [51]:
checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sepcial_tokens_dict = {'cls_token': '[MY_CLS]'}

num_added_toks = tokenizer.add_special_tokens(sepcial_tokens_dict)
print('We have added', num_added_toks, 'tokens')

assert tokenizer.cls_token == '[MY_CLS]'

We have added 1 tokens


In [53]:
# 我们也可以使用add_tokens() 添加特殊token，只需要额外设置参数special_tokens=True
checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

num_added_toks = tokenizer.add_tokens(['[NEW_tok1]', '[NEW_tok2]'])
num_added_toks = tokenizer.add_tokens(["[NEW_tok3]", "[NEW_tok4]"], special_tokens=True)

print("We have added", num_added_toks, 'tokens')
print(tokenizer.tokenize('[NEW_tok1] Hello [NEW_tok2] [NEW_tok3] World [NEW_tok4]!'))

We have added 2 tokens
['[NEW_tok1]', 'Hello', '[NEW_tok2]', '[NEW_tok3]', 'World', '[NEW_tok4]', '!']


> 特殊token的标准化(normalization)与普通token有一些不同，比如不会被小写

In [54]:
from transformers import AutoTokenizer

checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

num_added_toks = tokenizer.add_tokens(['[ENT_START]', '[ENT_END]'], special_tokens=True)

print('I have added', num_added_toks, 'tokens')

sentence = 'Two [ENT_START] cars [ENT_END] collided in a [ENT_START] tunnel [ENT_END] this morning.'
print(tokenizer.tokenize(sentence))

I have added 2 tokens
['Two', '[ENT_START]', 'cars', '[ENT_END]', 'collided', 'in', 'a', '[ENT_START]', 'tunnel', '[ENT_END]', 'this', 'morning', '.']


### 调整embedding矩阵
>> 向词表中添加新的token后，必须重置模型embedding矩阵的大小，也就是向矩阵中添加新token对应的embedding，这样模型才可以正常工作，将token映射到对应的embedding

调整embedding矩阵通过`resize_token_embeddings()`函数来实现

In [56]:
from transformers import AutoTokenizer, AutoModel

checkpoint = "../models/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

print('vocabulary size:', len(tokenizer))
num_added_toks = tokenizer.add_tokens(['[ENT_START]', '[ENT_END]'], special_tokens=True)
print("After we add", num_added_toks, "tokens")
print('vocabulary size:', len(tokenizer))

model.resize_token_embeddings(len(tokenizer))
print(model.embeddings.word_embeddings.weight.size())

# Randomly generated matrix
print(model.embeddings.word_embeddings.weight[-2:, :])

vocabulary size: 28996
After we add 2 tokens
vocabulary size: 28998
torch.Size([28998, 768])
tensor([[-0.0081, -0.0199, -0.0088,  ..., -0.0140, -0.0205, -0.0110],
        [-0.0081, -0.0199, -0.0088,  ..., -0.0140, -0.0205, -0.0110]],
       grad_fn=<SliceBackward0>)


## Token embedding 初始化
如果有充分的语料对模型进行微调或者继续预训练，那么将新添加 token 初始化为随机向量没什么问题。但是如果训练语料较少，甚至是只有很少语料的 few-shot learning 场景下，这种做法就存在问题。研究表明，在训练数据不够多的情况下，这些新添加 token 的 embedding 只会在初始值附近小幅波动。换句话说，即使经过训练，它们的值事实上还是随机的。
### 直接赋值


In [57]:
import torch

with torch.no_grad():
    model.embeddings.word_embeddings.weight[-2:, :] = torch.zeros([2, model.config.hidden_size], requires_grad=True)
print(model.embeddings.word_embeddings.weight[-2:, :])

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], grad_fn=<SliceBackward0>)


注意，初始化 embedding 的过程并不可导，因此这里通过 torch.no_grad() 暂停梯度的计算。

In [58]:
# 现实场景中，更为常见的做法是使用已有 token 的 embedding 来初始化新添加 token。
import torch

token_id = tokenizer.convert_tokens_to_ids('entity')
token_embedding = model.embeddings.word_embeddings.weight[token_id]
print(token_id)

with torch.no_grad():
    for i in range(1, num_added_toks + 1):
        model.embeddings.word_embeddings.weight[-i, :] = token_embedding.clone().detach().requires_grad_(True)
print(model.embeddings.word_embeddings.weight[-2:, :])

9127
tensor([[-0.0277, -0.0564,  0.0192,  ..., -0.0485, -0.0089, -0.0457],
        [-0.0277, -0.0564,  0.0192,  ..., -0.0485, -0.0089, -0.0457]],
       grad_fn=<SliceBackward0>)


### 初始化为已有token的值
更高级的做法是根据新添加的token的语义来进行初始化。例如将值初始化为token语义描述中所有token的平均值，假设新token $t_i$的语义描述为$w_{i,1},w_{i,2},...,w_{i,n}$，那么初始化$t_i$，那么初始化$t_i$的embedding为$E(t_i)=\frac{1}{n} \sum_{j=1}^{n},E(w_{i, j})$

这里$E$表示预训练模型的embedding矩阵

In [59]:
descriptions = ['start of entity', 'end of entity']

with torch.no_grad():
    for i, token in enumerate(reversed(descriptions), start=1):
        tokenized = tokenizer.tokenize(token)
        print(tokenized)
        tokenized_ids = tokenizer.convert_tokens_to_ids(tokenized)
        new_embedding = model.embeddings.word_embeddings.weight[tokenized_ids].mean(axis=0)
        model.embeddings.word_embeddings.weight[-i, :] = new_embedding.clone().detach().requires_grad_(True)
print(model.embeddings.word_embeddings.weight[-2:, :])

['end', 'of', 'entity']
['start', 'of', 'entity']
tensor([[-0.0060, -0.0084, -0.0003,  ..., -0.0499, -0.0044, -0.0052],
        [-0.0324, -0.0079, -0.0157,  ..., -0.0434, -0.0019,  0.0049]],
       grad_fn=<SliceBackward0>)
