<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# Byte Pair Encoding (BPE) 分词器 From Scratch

- 这个 is 一个 standalone 笔记本 implementing 这个 popular byte pair encoding (BPE) tokenization algorithm, 哪个 is used in models like GPT-2 to GPT-4, Llama 3, etc., from scratch for educational purposes
- For more details about 这个 purpose of tokenization, please refer to [第 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-第-代码/ch02.ipynb); 这个 代码 这里 is bonus material explaining 这个 BPE algorithm
- 这个 original BPE 分词器 那个 OpenAI implemented for 训练 这个 original GPT models can be found [这里](https://github.com/openai/GPT-2/blob/master/src/encoder.py)
- 这个 BPE algorithm was originally described in 1994: "[一个 New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)" by Philip Gage
- Most projects, including Llama 3, nowadays 使用 OpenAI's open-source [tiktoken 库](https://github.com/openai/tiktoken) due to its computational 性能; 它 allows loading pretrained GPT-2 和 GPT-4 tokenizers, for 示例 (这个 Llama 3 models were trained using 这个 GPT-4 分词器 as well)
- 这个 difference between 这个 implementations above 和 my 实现 in 这个 笔记本, besides 它 being is 那个 它 also includes 一个 函数 for 训练 这个 分词器 (for educational purposes)
- 那里's also 一个 实现 called [minBPE](https://github.com/karpathy/minbpe) with 训练 support, 哪个 is maybe more performant (my 实现 这里 is focused on educational purposes); in contrast to `minbpe` my 实现 additionally allows loading 这个 original OpenAI 分词器 vocabulary 和 BPE "merges" (additionally, Hugging Face tokenizers are also capable of 训练 和 loading various tokenizers; see [这个 GitHub discussion](https://github.com/rasbt/LLMs-from-scratch/discussions/485) by 一个 reader 谁 trained 一个 BPE 分词器 on 这个 Nepali language for more info)

&nbsp;
# 1. 这个 main idea behind byte pair encoding (BPE)

- 这个 main idea in BPE is to 转换 text into 一个 integer representation (词元 IDs) for 大语言模型 训练 (see [第 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-第-代码/ch02.ipynb))

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/bpe-overview.webp" width="600px">

&nbsp;
## 1.1 Bits 和 bytes

- Before getting to 这个 BPE algorithm, 让我们 introduce 这个 notion of bytes
- Consider converting text into 一个 byte array (BPE stands for "byte" pair encoding after all):

In [1]:
text = "This is some text"
byte_ary = bytearray(text, "utf-8")
print(byte_ary)

bytearray(b'This is some text')


- 当 我们 调用 `list()` on 一个 `bytearray` object, each byte is treated as 一个 individual element, 和 这个 result is 一个 list of integers corresponding to 这个 byte values:

In [2]:
ids = list(byte_ary)
print(ids)

[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116]


- 这个 would be 一个 valid way to 转换 text into 一个 词元 ID representation 那个 我们 need for 这个 嵌入 层 of 一个 大语言模型
- However, 这个 downside of 这个 approach is 那个 它 is creating one ID for each character (那个's 一个 lot of IDs for 一个 short text!)
- I.e., 这个 means for 一个 17-character 输入 text, 我们 have to 使用 17 词元 IDs as 输入 to 这个 大语言模型:

In [3]:
print("Number of characters:", len(text))
print("Number of token IDs:", len(ids))

Number of characters: 17
Number of token IDs: 17


- 如果 你 have worked with LLMs before, 你 may know 那个 这个 BPE tokenizers have 一个 vocabulary 哪里 我们 have 一个 词元 ID for whole words 或者 subwords instead of each character
- For 示例, 这个 GPT-2 分词器 tokenizes 这个 same text ("这个 is some text") into only 4 instead of 17 tokens: `1212, 318, 617, 2420`
- 你 can double-检查 这个 using 这个 interactive [tiktoken app](https://tiktokenizer.vercel.app/?模型=gpt2) 或者 这个 [tiktoken 库](https://github.com/openai/tiktoken):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/tiktokenizer.webp" width="600px">

```python
导入 tiktoken

gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2_tokenizer.encode("这个 is some text")
# prints [1212, 318, 617, 2420]
```

- Since 一个 byte consists of 8 bits, 那里 are 2<sup>8</sup> = 256 possible values 那个 一个 single byte can represent, ranging from 0 to 255
- 你 can confirm 这个 by executing 这个 代码 `bytearray(range(0, 257))`, 哪个 will warn 你 那个 `ValueError: byte must be in range(0, 256)`)
- 一个 BPE 分词器 usually uses these 256 values as its 首先 256 single-character tokens; one could visually 检查 这个 by running 这个 following 代码:

```python
导入 tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

for i in range(300):
    decoded = gpt2_tokenizer.decode([i])
    打印(f"{i}: {decoded}")
"""
prints:
0: !
1: "
2: #
...
255: �  # <---- single character tokens up to 这里
256:  t
257:  一个
...
298: ent
299:  n
"""
```

- Above, note 那个 entries 256 和 257 are not single-character values 但是 double-character values (一个 whitespace + 一个 letter), 哪个 is 一个 little shortcoming of 这个 original GPT-2 BPE 分词器 (这个 has been improved in 这个 GPT-4 分词器)

&nbsp;
## 1.2 Building 这个 vocabulary

- 这个 goal of 这个 BPE tokenization algorithm is to 构建 一个 vocabulary of commonly occurring subwords like `298: ent` (哪个 can be found in *entangle, entertain, enter, entrance, entity, ...*, for 示例), 或者 even 完成 words like 

```
318: is
617: some
1212: 这个
2420: text
```

- 这个 BPE algorithm was originally described in 1994: "[一个 New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)" by Philip Gage
- Before 我们 获取 to 这个 actual 代码 实现, 这个 form 那个 is used for 大语言模型 tokenizers today can be summarized as described in 这个 following sections.

&nbsp;
## 1.3 BPE algorithm outline

**1. Identify frequent pairs**
- In each iteration, scan 这个 text to find 这个 most commonly occurring pair of bytes (或者 characters)

**2. Replace 和 record**

- Replace 那个 pair with 一个 new placeholder ID (one not already in 使用, e.g., 如果 我们 开始 with 0...255, 这个 首先 placeholder would be 256)
- Record 这个 mapping in 一个 lookup table
- 这个 size of 这个 lookup table is 一个 hyperparameter, also called "vocabulary size" (for GPT-2, 那个's
50,257)

**3. Repeat until no gains**

- Keep repeating steps 1 和 2, continually merging 这个 most frequent pairs
- 停止 当 no further compression is possible (e.g., no pair occurs more than once)

**Decompression (decoding)**

- To restore 这个 original text, reverse 这个 处理 by substituting each ID with its corresponding pair, using 这个 lookup table



&nbsp;
## 1.4 BPE algorithm 示例

### 1.4.1 Concrete 示例 of 这个 encoding part (steps 1 & 2 in section 1.3)

- Suppose 我们 have 这个 text (训练 数据集) `这个 cat in 这个 hat` from 哪个 我们 want to 构建 这个 vocabulary for 一个 BPE 分词器

**Iteration 1**

1. Identify frequent pairs
  - In 这个 text, "th" appears twice (at 这个 beginning 和 before 这个 second "e")

2. Replace 和 record
  - replace "th" with 一个 new 词元 ID 那个 is not already in 使用, e.g., 256
  - 这个 new text is: `<256>e cat in <256>e hat`
  - 这个 new vocabulary is

```
  0: ...
  ...
  256: "th"
```

**Iteration 2**

1. **Identify frequent pairs**  
   - In 这个 text `<256>e cat in <256>e hat`, 这个 pair `<256>e` appears twice

2. **Replace 和 record**  
   - replace `<256>e` with 一个 new 词元 ID 那个 is not already in 使用, for 示例, `257`.  
   - 这个 new text is:
     ```
     <257> cat in <257> hat
     ```
   - 这个 updated vocabulary is:
     ```
     0: ...
     ...
     256: "th"
     257: "<256>e"
     ```

**Iteration 3**

1. **Identify frequent pairs**  
   - In 这个 text `<257> cat in <257> hat`, 这个 pair `<257> ` appears twice (once at 这个 beginning 和 once before “hat”).

2. **Replace 和 record**  
   - replace `<257> ` with 一个 new 词元 ID 那个 is not already in 使用, for 示例, `258`.  
   - 这个 new text is:
     ```
     <258>cat in <258>hat
     ```
   - 这个 updated vocabulary is:
     ```
     0: ...
     ...
     256: "th"
     257: "<256>e"
     258: "<257> "
     ```
     
- 和 so forth

&nbsp;
### 1.4.2 Concrete 示例 of 这个 decoding part (step 3 in section 1.3)

- To restore 这个 original text, 我们 reverse 这个 处理 by substituting each 词元 ID with its corresponding pair in 这个 reverse order they were introduced
- 开始 with 这个 final compressed text: `<258>cat in <258>hat`
-  Substitute `<258>` → `<257> `: `<257> cat in <257> hat`  
- Substitute `<257>` → `<256>e`: `<256>e cat in <256>e hat`
- Substitute `<256>` → "th": `这个 cat in 这个 hat`

&nbsp;
## 2. 一个 simple BPE 实现

- Below is 一个 实现 of 这个 algorithm described above as 一个 Python 类 那个 mimics 这个 `tiktoken` Python user interface
- Note 那个 这个 encoding part above describes 这个 original 训练 step via `train()`; however, 这个 `encode()` 方法 works similarly (although 它 looks 一个 bit more complicated because of 这个 special 词元 handling):

1. Split 这个 输入 text into individual bytes
2. Repeatedly find & replace (merge) adjacent tokens (pairs) 当 they match any pair in 这个 learned BPE merges (from highest to lowest "rank," i.e., in 这个 order they were learned)
3. Continue merging until no more merges can be applied
4. 这个 final list of 词元 IDs is 这个 encoded 输出

In [4]:
from collections import Counter, deque
from functools import lru_cache
import json


class BPETokenizerSimple:
    def __init__(self):
        # Maps token_id to token_str (e.g., {11246: "some"})
        self.vocab = {}
        # Maps token_str to token_id (e.g., {"some": 11246})
        self.inverse_vocab = {}
        # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}
        self.bpe_merges = {}

        # For 这个 official OpenAI GPT-2 merges, 使用 一个 rank dict:
        #  of form {(string_A, string_B): rank}, 哪里 lower rank = higher priority
        self.bpe_ranks = {}

    def train(self, text, vocab_size, allowed_special={"<|endoftext|>"}):
        """
        Train the BPE tokenizer from scratch.

        Args:
            text (str): The training text.
            vocab_size (int): The desired vocabulary size.
            allowed_special (set): A set of special tokens to include.
        """

        # Preprocess: Replace spaces with "Ġ"
        # Note 那个 Ġ is 一个 particularity of 这个 GPT-2 BPE 实现
        # E.g., "Hello world" might be tokenized as ["Hello", "Ġworld"]
        # (GPT-4 BPE would tokenize 它 as ["Hello", " world"])
        processed_text = []
        for i, char in enumerate(text):
            if char == " " and i != 0:
                processed_text.append("Ġ")
            if char != " ":
                processed_text.append(char)
        processed_text = "".join(processed_text)

        # 初始化 vocab with unique characters, including "Ġ" 如果 present
        # 开始 with 这个 首先 256 ASCII characters
        unique_chars = [chr(i) for i in range(256)]
        unique_chars.extend(
            char for char in sorted(set(processed_text))
            if char not in unique_chars
        )
        if "Ġ" not in unique_chars:
            unique_chars.append("Ġ")

        self.vocab = {i: char for i, char in enumerate(unique_chars)}
        self.inverse_vocab = {char: i for i, char in self.vocab.items()}

        # 添加 allowed special tokens
        if allowed_special:
            for token in allowed_special:
                if token not in self.inverse_vocab:
                    new_id = len(self.vocab)
                    self.vocab[new_id] = token
                    self.inverse_vocab[token] = new_id

        # Tokenize 这个 processed_text into 词元 IDs
        token_ids = [self.inverse_vocab[char] for char in processed_text]

        # BPE steps 1-3: Repeatedly find 和 replace frequent pairs
        for new_id in range(len(self.vocab), vocab_size):
            pair_id = self.find_freq_pair(token_ids, mode="most")
            if pair_id is None:
                break
            token_ids = self.replace_pair(token_ids, pair_id, new_id)
            self.bpe_merges[pair_id] = new_id

        # 构建 这个 vocabulary with merged tokens
        for (p0, p1), new_id in self.bpe_merges.items():
            merged_token = self.vocab[p0] + self.vocab[p1]
            self.vocab[new_id] = merged_token
            self.inverse_vocab[merged_token] = new_id

    def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):
        """
        Load pre-trained vocabulary and BPE merges from OpenAI's GPT-2 files.

        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json').
            bpe_merges_path (str): Path to the bpe_merges file  (GPT-2 calls it 'vocab.bpe').
        """
        # 加载 vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # 转换 loaded vocabulary to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}

        # Handle newline character without adding 一个 new 词元
        if "\n" not in self.inverse_vocab:
            # 使用 一个 existing 词元 ID as 一个 placeholder for '\n'
            # Preferentially 使用 "<|endoftext|>" 如果 available
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                # 如果 no fallback 词元 is available, raise 一个 error
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")

            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id] = "\n"

        # 加载 GPT-2 merges 和 store them with 一个 assigned "rank"
        self.bpe_ranks = {}  # 重置 ranks
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
            if lines and lines[0].startswith("#"):
                lines = lines[1:]

            rank = 0
            for line in lines:
                pair = tuple(line.strip().split())
                if len(pair) == 2:
                    token1, token2 = pair
                    # 如果 token1 或者 token2 not in vocab, skip
                    if token1 in self.inverse_vocab and token2 in self.inverse_vocab:
                        self.bpe_ranks[(token1, token2)] = rank
                        rank += 1
                    else:
                        print(f"Skipping pair {pair} as one token is not in the vocabulary.")

    def encode(self, text, allowed_special=None):
        """
        Encode the input text into a list of token IDs, with tiktoken-style handling of special tokens.
    
        Args:
            text (str): The input text to encode.
            allowed_special (set or None): Special tokens to allow passthrough. If None, special handling is disabled.
    
        Returns:
            List of token IDs.
        """
        import re
    
        token_ids = []
    
        # 如果 special 词元 handling is enabled
        if allowed_special is not None and len(allowed_special) > 0:
            # 构建 regex to match allowed special tokens
            special_pattern = (
                "(" + "|".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + ")"
            )
    
            last_index = 0
            for match in re.finditer(special_pattern, text):
                prefix = text[last_index:match.start()]
                token_ids.extend(self.encode(prefix, allowed_special=None))  # Encode prefix without special handling
    
                special_token = match.group(0)
                if special_token in self.inverse_vocab:
                    token_ids.append(self.inverse_vocab[special_token])
                else:
                    raise ValueError(f"Special token {special_token} not found in vocabulary.")
                last_index = match.end()
    
            text = text[last_index:]  # Remaining part to 处理 normally
    
            # 检查 如果 any disallowed special tokens are in 这个 remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
    
        # 如果 no special tokens, 或者 remaining text after special 词元 split:
        tokens = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)
    
        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
    
        return token_ids

    def tokenize_with_bpe(self, token):
        """
        Tokenize a single token using BPE merges.

        Args:
            token (str): The token to tokenize.

        Returns:
            List[int]: The list of token IDs after applying BPE.
        """
        # Tokenize 这个 词元 into individual characters (as initial 词元 IDs)
        token_ids = [self.inverse_vocab.get(char, None) for char in token]
        if None in token_ids:
            missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]
            raise ValueError(f"Characters not found in vocab: {missing_chars}")

        # 如果 我们 haven't loaded OpenAI's GPT-2 merges, 使用 my approach
        if not self.bpe_ranks:
            can_merge = True
            while can_merge and len(token_ids) > 1:
                can_merge = False
                new_tokens = []
                i = 0
                while i < len(token_ids) - 1:
                    pair = (token_ids[i], token_ids[i + 1])
                    if pair in self.bpe_merges:
                        merged_token_id = self.bpe_merges[pair]
                        new_tokens.append(merged_token_id)
                        # Uncomment for educational purposes:
                        # 打印(f"Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')")
                        i += 2  # Skip 这个 接下来 词元 as 它's merged
                        can_merge = True
                    else:
                        new_tokens.append(token_ids[i])
                        i += 1
                if i < len(token_ids):
                    new_tokens.append(token_ids[i])
                token_ids = new_tokens
            return token_ids

        # Otherwise, do GPT-2-style merging with 这个 ranks:
        # 1) 转换 token_ids back to string "symbols" for each ID
        symbols = [self.vocab[id_num] for id_num in token_ids]

        # Repeatedly merge all occurrences of 这个 lowest-rank pair
        while True:
            # Collect all adjacent pairs
            pairs = set(zip(symbols, symbols[1:]))
            if not pairs:
                break

            # Find 这个 pair with 这个 best (lowest) rank
            min_rank = float("inf")
            bigram = None
            for p in pairs:
                r = self.bpe_ranks.get(p, float("inf"))
                if r < min_rank:
                    min_rank = r
                    bigram = p

            # 如果 no valid ranked pair is present, 我们're done
            if bigram is None or bigram not in self.bpe_ranks:
                break

            # Merge all occurrences of 那个 pair
            first, second = bigram
            new_symbols = []
            i = 0
            while i < len(symbols):
                # 如果 我们 see (首先, second) at position i, merge them
                if i < len(symbols) - 1 and symbols[i] == first and symbols[i+1] == second:
                    new_symbols.append(first + second)  # merged symbol
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            symbols = new_symbols

            if len(symbols) == 1:
                break

        # 最后, 转换 merged symbols back to IDs
        merged_ids = [self.inverse_vocab[sym] for sym in symbols]
        return merged_ids

    def decode(self, token_ids):
        """
        Decode a list of token IDs back into a string.

        Args:
            token_ids (List[int]): The list of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        decoded_string = ""
        for i, token_id in enumerate(token_ids):
            if token_id not in self.vocab:
                raise ValueError(f"Token ID {token_id} not found in vocab.")
            token = self.vocab[token_id]
            if token == "\n":
                if decoded_string and not decoded_string.endswith(" "):
                    decoded_string += " "  # 添加 space 如果 not present before 一个 newline
                decoded_string += token
            elif token.startswith("Ġ"):
                decoded_string += " " + token[1:]
            else:
                decoded_string += token
        return decoded_string

    def save_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Save the vocabulary and BPE merges to JSON files.

        Args:
            vocab_path (str): Path to save the vocabulary.
            bpe_merges_path (str): Path to save the BPE merges.
        """
        # 保存 vocabulary
        with open(vocab_path, "w", encoding="utf-8") as file:
            json.dump(self.vocab, file, ensure_ascii=False, indent=2)

        # 保存 BPE merges as 一个 list of dictionaries
        with open(bpe_merges_path, "w", encoding="utf-8") as file:
            merges_list = [{"pair": list(pair), "new_id": new_id}
                           for pair, new_id in self.bpe_merges.items()]
            json.dump(merges_list, file, ensure_ascii=False, indent=2)

    def load_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Load the vocabulary and BPE merges from JSON files.

        Args:
            vocab_path (str): Path to the vocabulary file.
            bpe_merges_path (str): Path to the BPE merges file.
        """
        # 加载 vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            self.vocab = {int(k): v for k, v in loaded_vocab.items()}
            self.inverse_vocab = {v: int(k) for k, v in loaded_vocab.items()}

        # 加载 BPE merges
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            merges_list = json.load(file)
            for merge in merges_list:
                pair = tuple(merge["pair"])
                new_id = merge["new_id"]
                self.bpe_merges[pair] = new_id

    @lru_cache(maxsize=None)
    def get_special_token_id(self, token):
        return self.inverse_vocab.get(token, None)

    @staticmethod
    def find_freq_pair(token_ids, mode="most"):
        pairs = Counter(zip(token_ids, token_ids[1:]))

        if not pairs:
            return None

        if mode == "most":
            return max(pairs.items(), key=lambda x: x[1])[0]
        elif mode == "least":
            return min(pairs.items(), key=lambda x: x[1])[0]
        else:
            raise ValueError("Invalid mode. Choose 'most' or 'least'.")

    @staticmethod
    def replace_pair(token_ids, pair_id, new_id):
        dq = deque(token_ids)
        replaced = []

        while dq:
            current = dq.popleft()
            if dq and (current, dq[0]) == pair_id:
                replaced.append(new_id)
                # 移除 这个 2nd 词元 of 这个 pair, 1st was already removed
                dq.popleft()
            else:
                replaced.append(current)

        return replaced

- 那里 is 一个 lot of 代码 in 这个 `BPETokenizerSimple` 类 above, 和 discussing 它 in detail is out of scope for 这个 笔记本, 但是 这个 接下来 section offers 一个 short overview of 这个 usage to understand 这个 类 methods 一个 bit better

## 3. BPE 实现 walkthrough

- In practice, I highly recommend using [tiktoken](https://github.com/openai/tiktoken) as my 实现 above focuses on readability 和 educational purposes, not on 性能
- However, 这个 usage is more 或者 less similar to tiktoken, except 那个 tiktoken does not have 一个 训练 方法
- 让我们 see 如何 my `BPETokenizerSimple` Python 代码 above works by looking at some examples below (一个 detailed 代码 discussion is out of scope for 这个 笔记本)

### 3.1 训练, encoding, 和 decoding

- 首先, 让我们 consider some sample text as our 训练 数据集:

In [5]:
import os
import urllib.request

def download_file_if_absent(url, filename, search_dirs):
    for directory in search_dirs:
        file_path = os.path.join(directory, filename)
        if os.path.exists(file_path):
            print(f"{filename} already exists in {file_path}")
            return file_path

    target_path = os.path.join(search_dirs[0], filename)
    try:
        with urllib.request.urlopen(url) as response, open(target_path, "wb") as out_file:
            out_file.write(response.read())
        print(f"Downloaded {filename} to {target_path}")
    except Exception as e:
        print(f"Failed to download {filename}. Error: {e}")
    return target_path

verdict_path = download_file_if_absent(
    url=(
         "https://raw.githubusercontent.com/rasbt/"
         "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
         "the-verdict.txt"
    ),
    filename="the-verdict.txt",
    search_dirs="."
)

with open(verdict_path, "r", encoding="utf-8") as f: # added ../01_main-第-代码/
    text = f.read()

the-verdict.txt already exists in ./the-verdict.txt


- 接下来, 让我们 初始化 和 train 这个 BPE 分词器 with 一个 vocabulary size of 1,000
- Note 那个 这个 vocabulary size is already 256 by default due to 这个 byte values discussed earlier, so 我们 are only "learning" 744 vocabulary entries (如果 我们 consider 这个 `<|endoftext|>` special 词元 和 这个 `Ġ` whitespace 词元; so, 那个's 742 to be precise)
- For comparison, 这个 GPT-2 vocabulary is 50,257 tokens, 这个 GPT-4 vocabulary is 100,256 tokens (`cl100k_base` in tiktoken), 和 GPT-4o uses 199,997 tokens (`o200k_base` in tiktoken); they have all much bigger 训练 sets compared to our simple 示例 text above

In [6]:
tokenizer = BPETokenizerSimple()
tokenizer.train(text, vocab_size=1000, allowed_special={"<|endoftext|>"})

- 你 may want to inspect 这个 vocabulary contents (但是 note 它 will 创建 一个 long list)

In [7]:
# 打印(分词器.vocab)
print(len(tokenizer.vocab))

1000


- 这个 vocabulary is created by merging 742 times (`= 1000 - len(range(0, 256)) - len(special_tokens) - "Ġ" = 1000 - 256 - 1 - 1 = 742`)

In [8]:
print(len(tokenizer.bpe_merges))

742


- 这个 means 那个 这个 首先 256 entries are single-character tokens

- 接下来, 让我们 使用 这个 created merges via 这个 `encode` 方法 to encode some text:

In [9]:
input_text = "Jack embraced beauty through art and life."
token_ids = tokenizer.encode(input_text)
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]


In [10]:
input_text = "Jack embraced beauty through art and life.<|endoftext|> "
token_ids = tokenizer.encode(input_text)
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 60, 124, 271, 683, 102, 116, 461, 116, 124, 62]


In [11]:
input_text = "Jack embraced beauty through art and life.<|endoftext|> "
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]


In [12]:
print("Number of characters:", len(input_text))
print("Number of token IDs:", len(token_ids))

Number of characters: 56
Number of token IDs: 21


- From 这个 lengths above, 我们 can see 那个 一个 42-character sentence was encoded into 20 词元 IDs, effectively cutting 这个 输入 length roughly in half compared to 一个 character-byte-based encoding

- Note 那个 这个 vocabulary itself is used in 这个 `decode()` 方法, 哪个 allows us to map 这个 词元 IDs back into text:

In [13]:
print(token_ids)

[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]


In [14]:
print(tokenizer.decode(token_ids))

Jack embraced beauty through art and life.<|endoftext|>


- Iterating over each 词元 ID can give us 一个 better understanding of 如何 这个 词元 IDs are decoded via 这个 vocabulary:

In [15]:
for token_id in token_ids:
    print(f"{token_id} -> {tokenizer.decode([token_id])}")

424 -> Jack
256 ->  
654 -> em
531 -> br
302 -> ac
311 -> ed
256 ->  
296 -> be
97 -> a
465 -> ut
121 -> y
595 ->  through
841 ->  ar
116 -> t
287 ->  a
466 -> nd
256 ->  
326 -> li
972 -> fe
46 -> .
257 -> <|endoftext|>


- As 我们 can see, most 词元 IDs represent 2-character subwords; 那个's because 这个 训练 data text is very short with not 那个 many repetitive words, 和 because 我们 used 一个 relatively small vocabulary size

- As 一个 summary, calling `decode(encode())` should be able to reproduce arbitrary 输入 texts:

In [16]:
tokenizer.decode(
    tokenizer.encode("This is some text.")
)

'This is some text.'

In [17]:
tokenizer.decode(
    tokenizer.encode("This is some text with \n newline characters.")
)

'This is some text with \n newline characters.'

### 3.2 Saving 和 loading 这个 分词器

- 接下来, 让我们 look at 如何 我们 can 保存 这个 trained 分词器 for reuse later:

In [18]:
# 保存 trained 分词器
tokenizer.save_vocab_and_merges(vocab_path="vocab.json", bpe_merges_path="bpe_merges.txt")

In [19]:
# 加载 分词器
tokenizer2 = BPETokenizerSimple()
tokenizer2.load_vocab_and_merges(vocab_path="vocab.json", bpe_merges_path="bpe_merges.txt")

- 这个 loaded 分词器 should be able to 产生 这个 same results as before:

In [20]:
print(tokenizer2.decode(token_ids))

Jack embraced beauty through art and life.<|endoftext|>


In [21]:
tokenizer2.decode(
    tokenizer2.encode("This is some text with \n newline characters.")
)

'This is some text with \n newline characters.'

&nbsp;
### 3.3 Loading 这个 original GPT-2 BPE 分词器 from OpenAI

- 最后, 让我们 加载 OpenAI's GPT-2 分词器 files

In [22]:
# Download files 如果 not already present in 这个 directory

# 定义 这个 directories to search 和 这个 files to download
search_directories = [".", "../02_bonus_bytepair-encoder/gpt2_model/"]

files_to_download = {
    "https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe": "vocab.bpe",
    "https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json": "encoder.json"
}

# Ensure directories exist 和 download files 如果 needed
paths = {}
for url, filename in files_to_download.items():
    paths[filename] = download_file_if_absent(url, filename, search_directories)

vocab.bpe already exists in ../02_bonus_bytepair-encoder/gpt2_model/vocab.bpe
encoder.json already exists in ../02_bonus_bytepair-encoder/gpt2_model/encoder.json


- 接下来, 我们 加载 这个 files via 这个 `load_vocab_and_merges_from_openai` 方法:

In [23]:
tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(
    vocab_path=paths["encoder.json"], bpe_merges_path=paths["vocab.bpe"]
)

- 这个 vocabulary size should be `50257` as 我们 can confirm via 这个 代码 below:

In [24]:
len(tokenizer_gpt2.vocab)

50257

- 我们 can 现在 使用 这个 GPT-2 分词器 via our `BPETokenizerSimple` object:

In [25]:
input_text = "This is some text"
token_ids = tokenizer_gpt2.encode(input_text)
print(token_ids)

[1212, 318, 617, 2420]


In [26]:
print(tokenizer_gpt2.decode(token_ids))

This is some text


- 你 can double-检查 那个 这个 produces 这个 correct tokens using 这个 interactive [tiktoken app](https://tiktokenizer.vercel.app/?模型=gpt2) 或者 这个 [tiktoken 库](https://github.com/openai/tiktoken):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/tiktokenizer.webp" width="600px">

```python
导入 tiktoken

gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2_tokenizer.encode("这个 is some text")
# prints [1212, 318, 617, 2420]
```


&nbsp;
# 4. Conclusion

- 那个's 它! 那个's 如何 BPE works in 一个 nutshell, 完成 with 一个 训练 方法 for creating new tokenizers 或者 loading 这个 GPT-2 分词器 vocabular 和 merges from 这个 original OpenAI GPT-2 模型
- I hope 你 found 这个 brief 教程 useful for educational purposes; 如果 你 have any questions, please feel free to open 一个 new Discussion [这里](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-一个)
- For 一个 性能 comparison with other 分词器 implementations, please see [这个 笔记本](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)