<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# 从头实现BPE分词器
- 这是一个独立的notebook，从零开始实现流行的字节对编码（BPE）分词算法，该算法用于 GPT-2 到 GPT-4、Llama 3 等模型
- 关于分词目的的更多详情，请参考[第2章](https://github.com/zzfive/LLMs-from-scratch-bias/blob/main/ch02/01_main-chapter-code/ch02.ipynb)；这里的代码是解释 BPE 算法的额外材料
- OpenAI 为训练原始 GPT 模型实现的原始 BPE 分词器可以在[这里](https://github.com/openai/gpt-2/blob/master/src/encoder.py)找到
- BPE 算法最初是由 Philip Gage 在 1994 年的论文"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)"中描述
- 现在大多数项目，包括 Llama 3，都使用 OpenAI 的开源 [tiktoken 库](https://github.com/openai/tiktoken)，因为它的计算性能；例如，它允许加载预训练的 GPT-2 和 GPT-4 分词器（Llama 3 模型也是使用 GPT-4 分词器训练的）
- 上述实现与本笔记本中的实现之间的区别，除了它之外，还包括一个用于训练分词器的函数（出于教育目的）
- 还有一个名为 [minBPE](https://github.com/karpathy/minbpe) 的实现，它支持训练，可能更高效（这里的实现主要关注教育目的）；与 `minbpe` 相比，此处的实现还允许加载原始 OpenAI 分词器词汇表和 BPE "合并"（此外，Hugging Face 分词器也能够训练和加载各种分词器；更多信息请参见[这个 GitHub 讨论](https://github.com/rasbt/LLMs-from-scratch/discussions/485)，这是一位读者在尼泊尔语上训练 BPE 分词器的讨论）

&nbsp;
# 1 BPE背后的主要思想
- BPE的主要想法是将文本转换为可用于LLM训练的整数表征，即token ID，如[Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb)中所示

<p align="center">
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/bpe-overview.webp" width="600px">
</p>

&nbsp;
## 1.1 Bits and bytes/位与字节
- 在开始BPE之前，先引入字节的概念
- 考虑将文本转换为字节数组（毕竟BPE代表"字节"对编码）

In [1]:
text = "This is some text"
byte_array = bytearray(text, "utf-8")
print(byte_array)

bytearray(b'This is some text')


- 对bytearray对象调用list()，会每个字节被视为单独的元素，结果是一个与字节值对应的整数列表

In [3]:
ids = list(byte_array)  # 每一个字节对应一个字符，每个字节内存存储一个数值，唯一表示该字符
print(ids)

[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116]


In [4]:
# 可以使用chr函数查看一个字节数值对应的字符
chr(105)

'i'

- 这是一种有效的方法，可以将文本转换为在大型语言模型(LLM)嵌入层中需要的token ID表示
- 然而，这种方法的缺点是它为每个字符创建一个ID（对于短文本来说，这会产生很多ID）
- 也就是说，对于一个17个字符的输入文本，必须使用17个标记ID作为LLM的输入

In [5]:
print("Number of characters:", len(text))
print("Number of token IDs:", len(ids))

Number of characters: 17
Number of token IDs: 17


- 如果你之前使用过大型语言模型(LLM)，你可能知道BPE分词器有一个词汇表，为整个单词或子词分配token ID，而不是为每个字符分配token ID。
- 例如，GPT-2分词器将相同的文本（"This is some text"）仅标记为4个标记而不是17个：1212, 318, 617, 2420
- 可以使用交互式[tiktoken应用](https://tiktokenizer.vercel.app/?model=gpt2)或[tiktoken库](https://github.com/openai/tiktoken)来验证这一点：

<center>
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/tiktokenizer.webp" width="600px">
</center>

```python
import tiktoken

gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2_tokenizer.encode("This is some text")
# prints [1212, 318, 617, 2420]
```


- 因为一个字节包含8个bits，有2<sup>8</sup> = 256种可能的值，即一个字节可表示256不同的字符，从0~255
- 执行以下代码 `bytearray(range(0, 257))`, 会收到异常 `ValueError: byte must be in range(0, 256)`)
- BPE分词器通常使用这256个值作为其前256个**单字符标记**；可以通过运行以下代码进行直观检查：

```python
import tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

for i in range(300):
    decoded = gpt2_tokenizer.decode([i])
    print(f"{i}: {decoded}")
"""
prints:
0: !
1: "
2: #
...
255: �  # <---- single character tokens up to here
256:  t
257:  a
...
298: ent
299:  n
"""
```

- 上面需要注意的是，第256和257项不是单字符值，而是双字符值（一个空格+一个字母），这是原始GPT-2 BPE分词器的一个小缺点（在GPT-4分词器中已经得到改进）

## 1.2 建立词汇表
- BPE标记化算法的目标是构建一个包含常见子词的词汇表，如298: ent（例如可以在entangle, entertain, enter, entrance, entity, ...等词中找到），甚至是完整的单词

```
318: is
617: some
1212: This
2420: text
```

- 在开始进行实际代码实现之前，如今用于LLM分词器的形式可以总结如下

## 1.3 BPE算法概述

**1. 识别频繁对**
- 在每次迭代中，扫描文本以找到最常出现的字节对（或字符对）

**2. 替换并记录**
- 用一个新的占位符ID替换该对（使用尚未使用的ID，例如，如果我们从0...255开始，第一个占位符将是256）
- 在查找表中记录这个映射
- 查找表的大小是一个超参数，也称为"词汇表大小"（对于GPT-2，是50,257）

**3. 重复直到没有收益**
- 不断重复步骤1、2，持续合并最频繁的对
- 当不能继续压缩后停止

**解压缩/解码**
- 恢复原始文本，使用查找表将每个ID替换为其对应的对，从而逆转该过程

&nbsp;
## 1.4 BPE算法例子

### 1.4.1 上述步骤1、2的具体例子
- 假设有文本(训练数据) "the cat in the hat"，要从中为BPE分词器构建词汇表
**迭代1**
1. 识别频繁对
- 在这个文本中，"th"出现了两次
2. 替换并记录
- 用一个尚未使用的新标记ID替换"th"，例如，256
- 新文本是："<256>e cat in <256>e hat"
- 新词汇表是

```
  0: ...
  ...
  256: "th"
```

**迭代2**
1. 识别频繁对
- 在文本<256>e cat in <256>e hat中，对<256>e出现了两次
2. 替换并记录
- 用一个尚未使用的新标记ID替换"<256>e"，例如，257
- 新文本是："<257> cat in <257> hat"
- 新词汇表是

```
  0: ...
  ...
  256: "th"
  257: "<256>e"
```

**迭代3**
1. 识别频繁对
- 在文本<257> cat in <257> hat中，对<257> 出现了两次
2. 替换并记录
- 用一个尚未使用的新标记ID替换"<257> "，例如，258
- 新文本是："<258>cat in <258>hat"
- 新词汇表是

```
  0: ...
  ...
  256: "th"
  257: "<256>e"
  258: "<257> "
```

**依此类推**

&nbsp;
### 1.4.2 解码部分的具体例子(步骤3)
- 要恢复原始文本，通过按照引入的相反顺序，用每个token ID对应的对来替换它们，从而逆转这个过程
- 从最后的压缩文本开始："<258>cat in <258>hat"
- 代替 `<258>` → `<257> `: `<257> cat in <257> hat`  
- 代替 `<257>` → `<256>e`: `<256>e cat in <256>e hat`
- 代替 `<256>` → "th": `the cat in the hat`

# 2 一个简单的BPE实现
- 下面是上述算法的Python类实现，模仿了tiktoken Python的用户接口
- 请注意，上面的编码部分通过train()描述了原始训练步骤；然而，encode()方法的工作方式类似（尽管由于特殊标记处理看起来更复杂）:
1. 将输入文本拆分为单个字节
2. 重复查找并替换（合并）相邻标记（对），当它们匹配学习到的BPE合并中的任何对时（从最高到最低"等级"，即按照学习的顺序）
3. 继续合并，直到无法应用更多合并
4. 最终的标记ID列表就是编码输出

In [7]:
from collections import Counter, deque
from typing import Union, Tuple, List
from functools import lru_cache
import json


class BPETokenizerSimple:
    def __init__(self):
        # Maps token_id to token_str (e.g., {11246: "some"})
        self.vocab = {}
        # Maps token_str to token_id (e.g., {"some": 11246})
        self.inverse_vocab = {}
        # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}
        self.bpe_merges = {}  # 字典，key是元组，即合并的两个tokend_id，value是合并后的一个token_id

    def train(self, text: str, vocab_size: int, allowed_special: set = {"<|endoftext|>"}) -> None:
        """
        Train the BPE tokenizer from scratch.

        Args:
            text (str): The training text.
            vocab_size (int): The desired vocabulary size.
            allowed_special (set): A set of special tokens to include.
        """

        # Preprocess: Replace spaces with 'Ġ'
        # Note that Ġ is a particularity of the GPT-2 BPE implementation
        # E.g., "Hello world" might be tokenized as ["Hello", "Ġworld"]
        # (GPT-4 BPE would tokenize it as ["Hello", " world"])
        processed_text = []
        for i, char in enumerate(text):  # 将文本中的空格替换为Ġ
            if char == " " and i != 0:
                processed_text.append("Ġ")
            if char != " ":
                processed_text.append(char)
        processed_text = "".join(processed_text)

        # Initialize vocab with unique characters, including 'Ġ' if present
        # Start with the first 256 ASCII characters
        unique_chars = [chr(i) for i in range(256)]

        # Extend unique_chars with characters from processed_text that are not already included
        unique_chars.extend(
            char for char in sorted(set(processed_text))
            if char not in unique_chars
        )

        # Optionally, ensure 'Ġ' is included if it is relevant to your text processing
        if "Ġ" not in unique_chars:
            unique_chars.append("Ġ")

        # Now create the vocab and inverse vocab dictionaries
        self.vocab = {i: char for i, char in enumerate(unique_chars)}
        self.inverse_vocab = {char: i for i, char in self.vocab.items()}

        # Add allowed special tokens
        if allowed_special:  # 添加允许的特殊token
            for token in allowed_special:
                if token not in self.inverse_vocab:
                    new_id = len(self.vocab)
                    self.vocab[new_id] = token
                    self.inverse_vocab[token] = new_id

        # Tokenize the processed_text into token IDs；将processed_text以字符为单位转换为token_id
        token_ids = [self.inverse_vocab[char] for char in processed_text]

        # BPE steps 1-3: Repeatedly find and replace frequent pairs
        for new_id in range(len(self.vocab), vocab_size):  # 从len(self.vocab)开始，直到设置的vocab_size
            pair_id = self.find_freq_pair(token_ids, mode="most")  # 从目前的token_ids中找到最频繁的pair
            if pair_id is None:  # No more pairs to merge. Stopping training.
                break
            # 当前遍历的new_id就是上述找到的最频繁pair的新的token_id，用这个new_id替换token_ids中所有的pair_id
            token_ids = self.replace_pair(token_ids, pair_id, new_id)
            self.bpe_merges[pair_id] = new_id  # 记录pair_id合并过程，用于后续解码

        # Build the vocabulary with merged tokens
        for (p0, p1), new_id in self.bpe_merges.items():
            merged_token = self.vocab[p0] + self.vocab[p1]
            self.vocab[new_id] = merged_token
            self.inverse_vocab[merged_token] = new_id

    def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):
        """
        Load pre-trained vocabulary and BPE merges from OpenAI's GPT-2 files.

        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json').
            bpe_merges_path (str): Path to the bpe_merges file  (GPT-2 calls it 'vocab.bpe').
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # Convert loaded vocabulary to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}

        # Handle newline character without adding a new token 处理换行符
        if "\n" not in self.inverse_vocab:
            # Use an existing token ID as a placeholder for '\n'
            # Preferentially use "<|endoftext|>" if available
            # 尝试使用现有的特殊 token 作为换行符的占位符，优先使用"<|endoftext|>"，如果不可用，则使用"Ġ"，如果也不可用，则使用""
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                # If no fallback token is available, raise an error
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")

            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id] = "\n"

        # Load BPE merges  加载BPE合并规则
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
            # Skip header line if present
            if lines and lines[0].startswith("#"):  # 跳过可能存在的注释行
                lines = lines[1:]

            for line in lines:
                pair = tuple(line.strip().split())  # 将每行按空格分割成一个元组    
                if len(pair) == 2:  # 确保每行只有两个元素
                    token1, token2 = pair  # 将元组中的两个元素分别赋值给token1和token2
                    if token1 in self.inverse_vocab and token2 in self.inverse_vocab:  # 确保token1和token2都在词汇表中
                        token_id1 = self.inverse_vocab[token1]
                        token_id2 = self.inverse_vocab[token2]
                        merged_token = token1 + token2
                        if merged_token in self.inverse_vocab:  # 确保合并后的token也在词汇表中
                            merged_token_id = self.inverse_vocab[merged_token]
                            self.bpe_merges[(token_id1, token_id2)] = merged_token_id
                        # print(f"Loaded merge: '{token1}' + '{token2}' -> '{merged_token}' (ID: {merged_token_id})")
                        else:
                            print(f"Merged token '{merged_token}' not found in vocab. Skipping.")
                    else:
                        print(f"Skipping pair {pair} as one of the tokens is not in the vocabulary.")

    def encode(self, text: str, allowed_special: set = None) -> List[int]:
        """
        Encode the input text into a list of token IDs, with tiktoken-style handling of special tokens.

        Args:
            text (str): The text to encode.
            allowed_special (set or None): Special tokens to allow passthrough. If None, special handling is disabled.

        Returns:
            List[int]: The list of token IDs.
        """
        import re

        token_ids = []
    
        # If special token handling is enabled
        if allowed_special is not None and len(allowed_special) > 0:
            # Build regex to match allowed special tokens
            special_pattern = (
                "(" + "|".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + ")"
            )
    
            last_index = 0
            for match in re.finditer(special_pattern, text):
                prefix = text[last_index:match.start()]
                token_ids.extend(self.encode(prefix, allowed_special=None))  # Encode prefix without special handling
    
                special_token = match.group(0)
                if special_token in self.inverse_vocab:
                    token_ids.append(self.inverse_vocab[special_token])
                else:
                    raise ValueError(f"Special token {special_token} not found in vocabulary.")
                last_index = match.end()
    
            text = text[last_index:]  # Remaining part to process normally
    
            # Check if any disallowed special tokens are in the remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
        
        # If no special tokens, or remaining text after special token split:
        tokens = []
        # First split on newlines to preserve them
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)

        for token in tokens:
            if token in self.inverse_vocab:  # 如果token在词汇表中，则直接添加到token_ids中
                # token is contained in the vocabulary as is
                token_ids.append(self.inverse_vocab[token])
            else:  # 如果token不在词汇表中，则需要通过BPE进行处理
                # Attempt to handle subword tokenization via BPE
                sub_token_ids = self.tokenize_with_bpe(token)  # 将token进行BPE处理
                token_ids.extend(sub_token_ids)  # 将处理后的token_ids添加到token_ids中

        return token_ids

    def tokenize_with_bpe(self, token: str) -> List[int]:
        """
        Tokenize a single token using BPE merges.

        Args:
            token (str): The token to tokenize.

        Returns:
            List[int]: The list of token IDs after applying BPE.
        """
        # Tokenize the token into individual characters (as initial token IDs)
        token_ids = [self.inverse_vocab.get(char, None) for char in token]
        if None in token_ids:  # 如果token_ids中存在None，则说明token中存在词汇表中不存在的字符
            missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]  # 找出token中哪些字符不在词汇表中
            raise ValueError(f"Characters not found in vocab: {missing_chars}")  # 抛出异常，提示哪些字符不在词汇表中

        can_merge = True
        while can_merge and len(token_ids) > 1:  # 当can_merge为False或token_ids长度小于1时表明不可合并，不会执行
            can_merge = False
            new_tokens = []
            i = 0
            while i < len(token_ids) - 1:
                pair = (token_ids[i], token_ids[i + 1])  # 获取当前token_ids中的相邻两个元素
                if pair in self.bpe_merges:  # 如果pair在BPE合并规则中
                    merged_token_id = self.bpe_merges[pair]  # 获取合并后的token_id
                    new_tokens.append(merged_token_id)  # 将合并后的token_id添加到new_tokens中
                    # Uncomment for educational purposes:
                    # print(f"Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')")
                    i += 2  # Skip the next token as it's merged
                    can_merge = True
                else:  # 如果pair不在BPE合并规则中，则将当前token添加到new_tokens中
                    new_tokens.append(token_ids[i])
                    i += 1
            if i < len(token_ids):
                new_tokens.append(token_ids[i])
            token_ids = new_tokens  # 更新token_ids，继续下一轮合并

        return token_ids

    def decode(self, token_ids: List[int]) -> str:
        """
        Decode a list of token IDs back into a string.

        Args:
            token_ids (List[int]): The list of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        decoded_string = ""
        for i, token_id in enumerate(token_ids):
            if token_id not in self.vocab:
                raise ValueError(f"Token ID {token_id} not found in vocab.")
            token = self.vocab[token_id]
            if token == "\n":
                if decoded_string and not decoded_string.endswith(" "):
                    decoded_string += " "  # Add space if not present before a newline
                decoded_string += token
            elif token.startswith("Ġ"):
                decoded_string += " " + token[1:]
            else:
                decoded_string += token
        return decoded_string

    def save_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Save the vocabulary and BPE merges to JSON files.

        Args:
            vocab_path (str): Path to save the vocabulary.
            bpe_merges_path (str): Path to save the BPE merges.
        """
        # Save vocabulary
        with open(vocab_path, "w", encoding="utf-8") as file:
            json.dump(self.vocab, file, ensure_ascii=False, indent=2)

        # Save BPE merges as a list of dictionaries
        with open(bpe_merges_path, "w", encoding="utf-8") as file:
            merges_list = [{"pair": list(pair), "new_id": new_id}
                           for pair, new_id in self.bpe_merges.items()]
            json.dump(merges_list, file, ensure_ascii=False, indent=2)

    def load_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Load the vocabulary and BPE merges from JSON files.

        Args:
            vocab_path (str): Path to the vocabulary file.
            bpe_merges_path (str): Path to the BPE merges file.
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            self.vocab = {int(k): v for k, v in loaded_vocab.items()}
            self.inverse_vocab = {v: int(k) for k, v in loaded_vocab.items()}

        # Load BPE merges
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            merges_list = json.load(file)
            for merge in merges_list:
                pair = tuple(merge["pair"])
                new_id = merge["new_id"]
                self.bpe_merges[pair] = new_id

    @lru_cache(maxsize=None)
    def get_special_token_id(self, token):
        return self.inverse_vocab.get(token, None)

    @staticmethod
    def find_freq_pair(token_ids: List[int], mode: str = "most") -> Union[Tuple[int, int], None]:  # 基于mode可以找到最频繁或最不频繁的pair
        # zip(token_ids, token_ids[1:])先创建一个迭代器，生成相邻标记对的元组；假设token_ids = [1, 2, 3, 4]，会生成 [(1, 2), (2, 3), (3, 4)]
        # Counter对上述生成的迭代器计数，就是计算每个元素出现的次数，返回一个字典，key为标记对，value为出现的次数
        pairs = Counter(zip(token_ids, token_ids[1:]))

        if not pairs:
            return None

        if mode == "most":
            return max(pairs.items(), key=lambda x: x[1])[0]  # 返回出现次数最多的pair
        elif mode == "least":
            return min(pairs.items(), key=lambda x: x[1])[0]  # 返回出现次数最少的pair
        else:
            raise ValueError("Invalid mode. Choose 'most' or 'least'.")

    @staticmethod
    def replace_pair(token_ids: List[int], pair_id: Tuple[int, int], new_id: int) -> List[int]:
        dq = deque(token_ids)  # 便于高效地从左侧移除元素
        replaced = []  # 记录替换后的token_ids

        while dq:
            current = dq.popleft()  # 先从左侧取出第一个元素
            if dq and (current, dq[0]) == pair_id:  # 如果当前元素和下一个元素组成的pair和传入的pair_id相同
                replaced.append(new_id)  # 将new_id添加到replaced中
                # Remove the 2nd token of the pair, 1st was already removed
                dq.popleft()  # 因为已经将dq中第一个元素和current合并，所以也要将其移除
            else:
                replaced.append(current)  # 如果当前元素和下一个元素组成的pair和传入的pair_id不同，则将当前元素添加到replaced中

        return replaced

- BPETokenizerSimple类中有很多代码，详细讨论它超出了本笔记本的范围，但下一部分提供了一个简短的使用概述，以便更好地理解类方法。

# 3 BPE实现演练
- 在实践中，强烈推荐使用[tiktoken](https://github.com/openai/tiktoken)，因为上面的实现主要关注可读性和教育目的，而不是性能
- 然而，使用方法与tiktoken大致相似，只是tiktoken没有训练方法
- 通过下面的一些例子来看看上面的`BPETokenizerSimple` Python代码是如何工作的（详细的代码讨论超出了本笔记本的范围）

## 3.1 训练、编码和解码
- 首先，使用一些文本作为训练集

In [8]:
import os
import urllib.request

if not os.path.exists("../01_main-chapter-code/the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "../01_main-chapter-code/the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

with open("../01_main-chapter-code/the-verdict.txt", "r", encoding="utf-8") as f: # added ../01_main-chapter-code/
    text = f.read()

- 接下来，初始化并训练BPE分词器，词汇表大小为1,000
- 请注意，由于之前讨论的字节值，词汇表大小默认已经是256，所以只"学习"744个词汇条目（如果考虑<|endoftext|>特殊标记和Ġ空白标记；所以，确切地说是742个）
- 作为比较，GPT-2词汇表有50,257个标记，GPT-4词汇表有100,256个标记（tiktoken中的`cl100k_base`），而GPT-4o使用199,997个标记（tiktoken中的`o200k_base`）；与上面的简单示例文本相比，它们都有更大的训练集

In [9]:
tokenizer = BPETokenizerSimple()
tokenizer.train(text, vocab_size=1000, allowed_special={"<|endoftext|>"})

- 可能想检查词汇表内容（但请注意，这会创建一个很长的列表）

In [10]:
# print(tokenizer.vocab)
print(len(tokenizer.vocab))

1000


- 这个词汇表是通过合并742次创建的（= 1000 - len(range(0, 256)) - len(special_tokens) - "Ġ" = 1000 - 256 - 1 - 1 = 742）

In [11]:
print(len(tokenizer.bpe_merges))

742


- 这意味前256个单一token
- 接下来使用学习的tokenizer编码一些文本

In [12]:
input_text = "Jack embraced beauty through art and life."
token_ids = tokenizer.encode(input_text)
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]


In [13]:
input_text = "Jack embraced beauty through art and life.<|endoftext|> "
token_ids = tokenizer.encode(input_text)
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 60, 124, 271, 683, 102, 116, 461, 116, 124, 62]


In [16]:
input_text = "Jack embraced beauty through art and life.<|endoftext|> "
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]


In [17]:
print("Number of characters:", len(input_text))
print("Number of token IDs:", len(token_ids))

Number of characters: 56
Number of token IDs: 21


- 从上面的长度可以看出，一个42个字符的句子被编码成20个标记ID，与基于字符字节的编码相比，有效地将输入长度大约减少了一半。
- 请注意，词汇表本身在decode()方法中使用，它允许将标记ID映射回文本

In [18]:
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]


In [19]:
print(tokenizer.decode(token_ids))

Jack embraced beauty through art and life.<|endoftext|>


- 通过遍历每个标记ID，可以更好地理解标记ID是如何通过词汇表解码的。

In [20]:
for token_id in token_ids:
    print(f"{token_id} -> {tokenizer.decode([token_id])}")

424 -> Jack
256 ->  
654 -> em
531 -> br
302 -> ac
311 -> ed
256 ->  
296 -> be
97 -> a
465 -> ut
121 -> y
595 ->  through
841 ->  ar
116 -> t
287 ->  a
466 -> nd
256 ->  
326 -> li
972 -> fe
46 -> .
257 -> <|endoftext|>


- 正如上所示，大多数token IDs代表2个字符的子词；这是因为训练数据文本非常短，没有那么多重复的单词，而且使用了相对较小的词汇表大小
- 总结一下，调用decode(encode())应该能够重现任意输入文本

In [21]:
tokenizer.decode(
    tokenizer.encode("This is some text.")
)

'This is some text.'

In [22]:
tokenizer.decode(
    tokenizer.encode("This is some text with \n newline characters.")
)

'This is some text with \n newline characters.'

## 3.2 保存和加载分词器
- 解析来为如何保存和重复使用分词器

In [27]:
# 保存分词器
tokenizer.save_vocab_and_merges(vocab_path="vocab.json", bpe_merges_path="bpe_merges.txt")

In [28]:
# 加载分词器
tokenizer2 = BPETokenizerSimple()
tokenizer2.load_vocab_and_merges(vocab_path="vocab.json", bpe_merges_path="bpe_merges.txt")

- 加载的分词器应该可以与之前的产生一样的结果

In [29]:
print(tokenizer2.decode(token_ids))

Jack embraced beauty through art and life.<|endoftext|>


In [30]:
tokenizer2.decode(
    tokenizer.encode("This is some text with \n newline characters.")
)

'This is some text with \n newline characters.'

## 3.3 加载原始的GPT-2分词器
- 最后，加载原始的GPT-2分词器种的词汇表和合并文件

In [31]:
import os
import urllib.request

def download_file_if_absent(url, filename, search_dirs):
    for directory in search_dirs:
        file_path = os.path.join(directory, filename)
        if os.path.exists(file_path):
            print(f"{filename} already exists in {file_path}")
            return file_path

    target_path = os.path.join(search_dirs[0], filename)
    try:
        with urllib.request.urlopen(url) as response, open(target_path, "wb") as out_file:
            out_file.write(response.read())
        print(f"Downloaded {filename} to {target_path}")
    except Exception as e:
        print(f"Failed to download {filename}. Error: {e}")
    return target_path

# Define the directories to search and the files to download
search_directories = [".", "../02_bonus_bytepair-encoder/gpt2_model/"]

files_to_download = {
    "https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe": "vocab.bpe",
    "https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json": "encoder.json"
}

# Ensure directories exist and download files if needed
paths = {}
for url, filename in files_to_download.items():
    paths[filename] = download_file_if_absent(url, filename, search_directories)

vocab.bpe already exists in ../02_bonus_bytepair-encoder/gpt2_model/vocab.bpe
encoder.json already exists in ../02_bonus_bytepair-encoder/gpt2_model/encoder.json


- 使用load_vocab_and_merges_from_openai方法加载GPT-2分词器的文件

In [32]:
tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(
    vocab_path=paths["encoder.json"], bpe_merges_path=paths["vocab.bpe"]
)

- 使用以下命令查看词汇表大小

In [33]:
len(tokenizer_gpt2.vocab)

50257

- 显示能通过BPETokenizerSimple类使用GPT-2分词器

In [34]:
input_text = "This is some text"
token_ids = tokenizer_gpt2.encode(input_text)
print(token_ids)

[1212, 318, 617, 2420]


In [28]:
print(tokenizer_gpt2.decode(token_ids))

This is some text


- 可以使用交互式[tiktoken应用](https://tiktokenizer.vercel.app/?model=gpt2)或[tiktoken库](https://github.com/openai/tiktoken)来验证这是否产生了正确的标记：

<center>
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/tiktokenizer.webp" width="600px">
</center>

```python
import tiktoken

gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2_tokenizer.encode("This is some text")
# prints [1212, 318, 617, 2420]
```

# 4 结论
- 就是这样！这就是BPE的工作原理，包括用于创建新分词器的训练方法，或从原始OpenAI GPT-2模型加载GPT-2分词器词汇表和合并规则
- 希望你发现这个简短的教程对教育目的有用；如果你有任何问题，请随时在[这里](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-a)开启新的讨论
- 关于与其他分词器实现的性能比较，请参阅[这个笔记本](https://github.com/zzfive/LLMs-from-scratch-bias/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)