<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
<a href="http://mng.bz/orYv">《从零构建大语言模型》</a>一书的补充代码，作者：<a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>代码仓库：<a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# 使用新Token扩展Tiktoken BPE分词器

- 本notebook解释了如何扩展现有的BPE分词器；具体来说，将重点关注如何为流行的[tiktoken](https://github.com/openai/tiktoken)实现执行此操作
- 有关分词的一般介绍，请参阅[第2章](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb)和BPE从零开始的[link]教程
- 例如，假设有一个GPT-2分词器，想要对以下文本进行编码

In [1]:
import tiktoken

base_tokenizer = tiktoken.get_encoding("gpt2")
sample_text = "Hello, MyNewToken_1 is a new token. <|endoftext|>"

token_ids = base_tokenizer.encode(sample_text, allowed_special={"<|endoftext|>"})
print(token_ids)

[15496, 11, 2011, 3791, 30642, 62, 16, 318, 257, 649, 11241, 13, 220, 50256]


- 遍历每个token ID可以更好地理解token ID如何通过词汇表解码

In [2]:
for token_id in token_ids:
    print(f"{token_id} -> {base_tokenizer.decode([token_id])}")

15496 -> Hello
11 -> ,
2011 ->  My
3791 -> New
30642 -> Token
62 -> _
16 -> 1
318 ->  is
257 ->  a
649 ->  new
11241 ->  token
13 -> .
220 ->  
50256 -> <|endoftext|>


- 如上所示，`"MyNewToken_1"`被分解为5个独立的子词token——这对于BPE处理未知单词是正常行为
- 但是，假设这是一个希望编码为单个token的特殊token，类似于其他一些单词或`"<|endoftext|>"`；本笔记本解释了如何做到这一点

&nbsp;
## 1. 添加特殊token

- 注意，必须将新token添加为特殊token；原因是在分词器训练过程中没有为新创建的token创建"合并"——即使有它们，在不破坏现有分词方案的情况下整合它们也会非常具有挑战性（参见BPE从零开始的笔记本[link]以理解"合并"）
- 假设想添加2个新token

In [3]:
# Define custom tokens and their token IDs
custom_tokens = ["MyNewToken_1", "MyNewToken_2"]
custom_token_ids = {
    token: base_tokenizer.n_vocab + i for i, token in enumerate(custom_tokens)
}

- 接下来，创建一个自定义的`Encoding`对象，如下所示保存的特殊token

In [4]:
# Create a new Encoding object with extended tokens
extended_tokenizer = tiktoken.Encoding(
    name="gpt2_custom",
    pat_str=base_tokenizer._pat_str,
    mergeable_ranks=base_tokenizer._mergeable_ranks,
    special_tokens={**base_tokenizer._special_tokens, **custom_token_ids},
)

- 就这样，可以现在检查它是否可以对示例文本进行编码

- 可见，新token `50257`和`50258`现在被编码在输出中：

In [5]:
special_tokens_set = set(custom_tokens) | {"<|endoftext|>"}

token_ids = extended_tokenizer.encode(
    "Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>",
    allowed_special=special_tokens_set
)
print(token_ids)

[36674, 2420, 351, 220, 50257, 290, 220, 50258, 13, 220, 50256]


- 同样，也可以在每个token级别查看

In [6]:
for token_id in token_ids:
    print(f"{token_id} -> {extended_tokenizer.decode([token_id])}")

36674 -> Sample
2420 ->  text
351 ->  with
220 ->  
50257 -> MyNewToken_1
290 ->  and
220 ->  
50258 -> MyNewToken_2
13 -> .
220 ->  
50256 -> <|endoftext|>


- 如上所示，已成功更新了分词器
- 但是，要将其与预训练的LLM一起使用，还必须更新LLM的嵌入层和输出层，这将在下一节中讨论

&nbsp;
## 2. 更新预训练的LLM

- 在本节中，将了解在更新分词器后如何更新现有的预训练LLM
- 为此，使用主书中使用的原始预训练GPT-2模型

&nbsp;
### 2.1 加载预训练的GPT模型

In [7]:
from llms_from_scratch.ch05 import download_and_load_gpt2
# For llms_from_scratch installation instructions, see:
# https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg

settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")

checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 34.4kiB/s]
encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 4.78MiB/s]
hparams.json: 100%|█████████████████████████| 90.0/90.0 [00:00<00:00, 24.7kiB/s]
model.ckpt.data-00000-of-00001: 100%|███████| 498M/498M [00:33<00:00, 14.7MiB/s]
model.ckpt.index: 100%|███████████████████| 5.21k/5.21k [00:00<00:00, 1.05MiB/s]
model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:00, 2.33MiB/s]
vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 2.45MiB/s]


In [8]:
from llms_from_scratch.ch04 import GPTModel
# For llms_from_scratch installation instructions, see:
# https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

# Define model configurations in a dictionary for compactness
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# Copy the base configuration and update with specific model settings
model_name = "gpt2-small (124M)"  # Example model name
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval();

### 2.2 使用预训练的GPT模型

- 接下来，考虑下面的示例文本，使用原始分词器和新分词器对其进行分词

In [9]:
sample_text = "Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>"

original_token_ids = base_tokenizer.encode(
    sample_text, allowed_special={"<|endoftext|>"}
)

In [10]:
new_token_ids = extended_tokenizer.encode(
    "Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>",
    allowed_special=special_tokens_set
)

- 现在，将原始token ID输入到GPT模型中

In [11]:
import torch

with torch.no_grad():
    out = gpt(torch.tensor([original_token_ids]))

print(out)

tensor([[[ 0.2204,  0.8901,  1.0138,  ...,  0.2585, -0.9192, -0.2298],
         [ 0.6745, -0.0726,  0.8218,  ..., -0.1768, -0.4217,  0.0703],
         [-0.2009,  0.0814,  0.2417,  ...,  0.3166,  0.3629,  1.3400],
         ...,
         [ 0.1137, -0.1258,  2.0193,  ..., -0.0314, -0.4288, -0.1487],
         [-1.1983, -0.2050, -0.1337,  ..., -0.0849, -0.4863, -0.1076],
         [-1.0675, -0.5905,  0.2873,  ..., -0.0979, -0.8713,  0.8415]]])


- 如上所示，这可以正常工作（注意代码显示原始输出，而不将输出转换回文本以简化说明；有关更多详细信息，请查看第5章[link]5.3.3节中的`generate`函数）

- 如果现在对更新后的分词器生成的token ID执行相同操作会怎样？

```python
with torch.no_grad():
    gpt(torch.tensor([new_token_ids]))

print(out)

...
# IndexError: index out of range in self
```

- 如上所示，这会导致索引错误
- 原因是GPT模型通过其输入嵌入层和输出层期望固定的词汇表大小

&nbsp;
### 2.3 更新嵌入层

- 从更新嵌入层开始
- 首先，注意嵌入层有50,257个条目，对应于词汇表大小

In [12]:
gpt.tok_emb

Embedding(50257, 768)

- 希望扩展此嵌入层，添加2个更多条目
- 简言之，创建一个具有更大大小的新嵌入层，然后复制旧嵌入层的值

In [13]:
num_tokens, emb_size = gpt.tok_emb.weight.shape
new_num_tokens = num_tokens + 2

# Create a new embedding layer
new_embedding = torch.nn.Embedding(new_num_tokens, emb_size)

# Copy weights from the old embedding layer
new_embedding.weight.data[:num_tokens] = gpt.tok_emb.weight.data

# Replace the old embedding layer with the new one in the model
gpt.tok_emb = new_embedding

print(gpt.tok_emb)

Embedding(50259, 768)


- 如上所示，现在有了一个增强的嵌入层

&nbsp;
### 2.4 更新输出层

- 接下来，必须扩展输出层，该层有50,257个输出特征，类似于嵌入层对应于词汇表大小（顺便说一下，可能会发现讨论PyTorch中Linear和Embedding层相似性的奖励材料很有用）

In [14]:
gpt.out_head

Linear(in_features=768, out_features=50257, bias=False)

- 扩展输出层的程序类似于扩展嵌入层

In [15]:
original_out_features, original_in_features = gpt.out_head.weight.shape

# Define the new number of output features (e.g., adding 2 new tokens)
new_out_features = original_out_features + 2

# Create a new linear layer with the extended output size
new_linear = torch.nn.Linear(original_in_features, new_out_features)

# Copy the weights and biases from the original linear layer
with torch.no_grad():
    new_linear.weight[:original_out_features] = gpt.out_head.weight
    if gpt.out_head.bias is not None:
        new_linear.bias[:original_out_features] = gpt.out_head.bias

# Replace the original linear layer with the new one
gpt.out_head = new_linear

print(gpt.out_head)

Linear(in_features=768, out_features=50259, bias=True)


- 首先在这个更新后的模型上尝试原始token ID

In [16]:
with torch.no_grad():
    output = gpt(torch.tensor([original_token_ids]))
print(output)

tensor([[[ 0.2267,  0.9132,  1.0494,  ..., -0.2330, -0.3008, -1.1458],
         [ 0.6808, -0.0495,  0.8574,  ...,  0.0671,  0.5572, -0.7873],
         [-0.1947,  0.1045,  0.2773,  ...,  1.3368,  0.8479, -0.9660],
         ...,
         [ 0.1200, -0.1027,  2.0549,  ..., -0.1519, -0.2096,  0.5651],
         [-1.1920, -0.1819, -0.0981,  ..., -0.1108,  0.8435, -0.3771],
         [-1.0612, -0.5674,  0.3229,  ...,  0.8383, -0.7121, -0.4850]]])


- 接下来，在更新的token上尝试

In [17]:
with torch.no_grad():
    output = gpt(torch.tensor([new_token_ids]))
print(output)

tensor([[[ 0.2267,  0.9132,  1.0494,  ..., -0.2330, -0.3008, -1.1458],
         [ 0.6808, -0.0495,  0.8574,  ...,  0.0671,  0.5572, -0.7873],
         [-0.1947,  0.1045,  0.2773,  ...,  1.3368,  0.8479, -0.9660],
         ...,
         [-0.0656, -1.2451,  0.7957,  ..., -1.2124,  0.1044,  0.5088],
         [-1.1561, -0.7380, -0.0645,  ..., -0.4373,  1.1401, -0.3903],
         [-0.8961, -0.6437, -0.1667,  ...,  0.5663, -0.5862, -0.4020]]])


- 可见，模型可以在扩展的token集合上工作
- 在实践中，现在希望对包含新token的数据（特别是新的嵌入和输出层）对模型进行微调（或持续预训练）

**关于权重绑定的说明**

- 如果模型使用权重绑定，这意味着嵌入层和输出层共享相同的权重，类似于Llama 3[链接]，那么更新输出层就简单多了
- 在这种情况下，可以简单地从嵌入层复制权重：

In [18]:
gpt.out_head.weight = gpt.tok_emb.weight

In [19]:
with torch.no_grad():
    output = gpt(torch.tensor([new_token_ids]))