# 使用来自 `tiktoken` 的字节对编码
tiktoken是一个用于OpenAI模型的快速BPE标记器。（BPE标记器是一种基于字节对编码（Byte Pair Encoding，简称BPE）的文本标记方法。字节对编码是一种数据压缩技术，但在自然语言处理中，它也被用于创建词汇表和对文本进行分词。）

In [1]:
# !pip install tiktoken

In [1]:
import importlib.metadata
# 打印出当前系统中安装的 tiktoken 库的版本号
print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.6.0


In [2]:
import tiktoken
# 创建一个使用 GPT-2 模型的编码器对象
tik_tokenizer = tiktoken.get_encoding("gpt2")
# ，定义一个包含文本的字符串变量，使用 tik_tokenizer 对象对文本进行编码
text = "Hello, world. Is this-- a test?"

In [3]:
# 参数 allowed_special，该参数指定哪些特殊字符允许出现在编码结果
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [4]:
# 进行解码
strings = tik_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


In [5]:
# 表示编码器的词汇表大小
print(tik_tokenizer.n_vocab)

50257


# 使用在GPT-2中使用的原始字节对编码实现

In [6]:
from bpe_openai_gpt2 import get_encoder, download_vocab

In [7]:
download_vocab()

Fetching encoder.json: 1.04Mit [00:02, 502kit/s]                                                    
Fetching vocab.bpe: 457kit [00:02, 212kit/s]                                                        


In [8]:
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

In [9]:
integers = orig_tokenizer.encode(text)

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [10]:
strings = orig_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


# 使用HuggingFace Transformers中的BytePair Tokenizer




In [11]:
# pip install transformers

In [12]:
import transformers

transformers.__version__

'4.33.3'

In [13]:
from transformers import GPT2Tokenizer
# 使用 HuggingFace Transformers 提供的 GPT2Tokenizer 类，创建一个预训练的 GPT-2 模型的标记器对象
hf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [14]:
hf_tokenizer(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

# 快速测试

In [15]:
with open('../01_main-chapter-code/the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

In [16]:
# 测量其运行时间，从而进行性能评估
%timeit orig_tokenizer.encode(raw_text)

14.6 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
%timeit tik_tokenizer.encode(raw_text)

2.9 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%timeit hf_tokenizer(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


28.6 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [20]:
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

28.3 ms ± 601 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
