# 学习Tokenizer
[参考链接](https://zhuanlan.zhihu.com/p/657047389)第6节开始的内容

## Tokenizer的运行机制
### 第一步：Normalization
这一步是：
- removing needless whitespace
- lowercasing
- removing accents<br>
也就是移除空格，大小写重置，removeing_accents就是café Gómez résumé变成cafe Gomez resume

### 第二步：Pre-tokenization
tokenizer 是不可以直接在原始的文本上训练的，需要做一些处理，比如这里的将句子切分成一个个词汇。这个环节叫做 Pre-tokenization。

In [1]:
### pre-tokenization演示
from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Hello, world! This is a test."
text1="Héllò hôw are ü?"
text2="Héllò Hôw ARE U?"
print(type(tokenizer.backend_tokenizer))
#remove accents表示
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))
#lowercasting
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò Hôw ARE U?"))


### pretokenize 演示
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)) #bert的分词处理是左闭右开
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)) #gpt2的分词处理空格被替换成了 Ġ
tokenizer = AutoTokenizer.from_pretrained("t5-small")
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)) #t5的分词处理空格被替换成了_,并且句首自动添加了_

  from .autonotebook import tqdm as notebook_tqdm


<class 'tokenizers.Tokenizer'>
hello how are u?
hello how are u?
[('Hello', (0, 5)), (',', (5, 6)), ('world', (7, 12)), ('!', (12, 13)), ('This', (14, 18)), ('is', (19, 21)), ('a', (22, 23)), ('test', (24, 28)), ('.', (28, 29))]
[('Hello', (0, 5)), (',', (5, 6)), ('Ġworld', (6, 12)), ('!', (12, 13)), ('ĠThis', (13, 18)), ('Ġis', (18, 21)), ('Ġa', (21, 23)), ('Ġtest', (23, 28)), ('.', (28, 29))]
[('▁Hello,', (0, 6)), ('▁world!', (7, 13)), ('▁This', (14, 18)), ('▁is', (19, 21)), ('▁a', (22, 23)), ('▁test.', (24, 29))]


### Tokenizer实战-minimind
复现minimind采用的BPE-tokenizer<br>
由于分词及其重要和依赖准确度，一般用到的都是成熟的bpe，这里的训练仅仅是为了跑通流程，直到其中各种流程和细节。

In [2]:
from tokenizers import (
    Tokenizer,
    models,
    pre_tokenizers,
    normalizers,
    processors,
    trainers,
    decoders,
)
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) #不添加前缀空格，和GPT的BPE对齐


### 记录特殊词汇表
除了最基本的word文本，还需要定义一些文本标记的符号，比如开始符号和结束符号


In [3]:
special_tokens=["<|sos|>", "<|eos|>","<unk>","<s>","</s>","<pad>","<mask>"]
trainer = trainers.BpeTrainer(
    vocab_size=256,
    special_tokens=special_tokens,
    show_progress=True,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)

In [4]:
#演示一下show_progress的效果和initial_alphabet的效果
from tokenizers import Tokenizer, trainers, pre_tokenizers,models

# 准备训练数据
data = ["Hello, this is a test.", "Tokenization is fun!,[UNK],<cls>,[CLS],[cls]"]

# 配置1：显示进度，使用ByteLevel初始字符集
trainer1 = trainers.BpeTrainer(
    vocab_size=8,
    special_tokens=["[UNK]", "[cls]", "[SEP]", "[PAD]", "[MASK]"],
    show_progress=True,  # 显示进度条
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()  # 256字节字符集
)
tokenizer1 = Tokenizer(models.BPE())
tokenizer1.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer1.train_from_iterator(data, trainer1)

# 配置2：不显示进度，自定义初始字符集
trainer2 = trainers.BpeTrainer(
    vocab_size=8,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    show_progress=False,  # 不显示进度条
    initial_alphabet=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
                      'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 
                      ' ', '!', ',', '.', ':', ';', '?']  # 仅包含常用ASCII字符
)
tokenizer2 = Tokenizer(models.BPE())
tokenizer2.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer2.train_from_iterator(data, trainer2)

# 测试分词结果
print("\n使用ByteLevel初始字符集的分词结果：")
# print(tokenizer1.encode("Hello, this is a test.你好").tokens)
print(tokenizer1.encode("Hello, this is a test.", "Tokenization is fun!,[UNK],<cls>,[CLS],[cls]").tokens)

print("\n使用自定义初始字符集的分词结果：")
# print(tokenizer2.encode("Hello, this is a test.你好").tokens)
print(tokenizer2.encode("Hello, this is a test.", "Tokenization is fun!,[UNK],<cls>,[CLS],[cls]").tokens)
#可以看到自定义初始集无法表示中文字符,而且特殊字符是有对应关系的





使用ByteLevel初始字符集的分词结果：
['Ġ', 'H', 'e', 'l', 'l', 'o', ',', 'Ġ', 't', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 's', 't', '.', 'Ġ', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', 'Ġ', 'i', 's', 'Ġ', 'f', 'u', 'n', '!', ',', '[UNK]', 'Ġ', ',', '<', 'c', 'l', 's', '>', ',', '[', 'C', 'L', 'S', ']', ',', '[cls]']

使用自定义初始字符集的分词结果：
['Ġ', 'H', 'e', 'l', 'l', 'o', ',', 'Ġ', 't', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 's', 't', '.', 'Ġ', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', 'Ġ', 'i', 's', 'Ġ', 'f', 'u', 'n', '!', ',', '[UNK]', 'Ġ', ',', '<', 'c', 'l', 's', '>', ',', '[CLS]', 'Ġ', ',', '[', 'c', 'l', 's', ']']


In [5]:
import json

def read_texts_from_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            yield data['text']

data_path = '../../toydata/tokenizer_data.jsonl'
data_iter = read_texts_from_jsonl(data_path)
print(f'Row 1: {next(data_iter)}')

Row 1: <s>近年来，人工智能技术迅速发展，深刻改变了各行各业的面貌。机器学习、自然语言处理、计算机视觉等领域的突破性进展，使得智能产品和服务越来越普及。从智能家居到自动驾驶，再到智能医疗，AI的应用场景正在快速拓展。随着技术的不断进步，未来的人工智能将更加智能、更加贴近人类生活。</s>


In [6]:
tokenizer.train_from_iterator(data_iter, trainer=trainer)






# 设置decode项
decode用于将bpe分词和input_ids的编号对应起来


In [7]:
tokenizer.decoder = decoders.ByteLevel()
assert tokenizer.token_to_id('<unk>') == 2
assert tokenizer.token_to_id('<s>') == 3
assert tokenizer.token_to_id('</s>') == 4

# 保存训练好的tokenzier

In [8]:
import os
tokenizer_dir = "./model/toy_tokenizer"
os.makedirs(tokenizer_dir, exist_ok=True)
tokenizer.save(os.path.join(tokenizer_dir, "tokenizer.json")) # At this point, you will see a file named tokenizer.json under tokenizer_dir
tokenizer.model.save(tokenizer_dir) # generate vocab.json & merges.txt

['./model/toy_tokenizer/vocab.json', './model/toy_tokenizer/merges.txt']

# 手动创建一份配置文件
[关于配置文件](https://blog.csdn.net/xiezhipu/article/details/145585777)<br>
[关于normalizer的配置说明](https://blog.csdn.net/weixin_49346755/article/details/126496833)<br>
[关于post_process的配置说明](https://blog.csdn.net/weixin_49346755/article/details/126499720)<br>
[chat_template的设计规则](https://www.guyuehome.com/detail?id=1888166611628642305)<br>

In [None]:
config = {
    "add_bos_token": False,
    "add_eos_token": False,
    "add_prefix_space": False,
    "added_tokens_decoder": {
        "0":{
      "content": "<|sos|>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    },
    "1":{
      "content": "<|eos|>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    },
    "2":{
      "content": "<unk>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    },
    "3":{
      "content": "<s>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    },
    "4":{
      "content": "</s>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    },
    "5":{
      "content": "<pad>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    },
    "6":{
      "content": "<mask>",
      "single_word": False,
      "lstrip": False,
      "rstrip": False,
      "normalized": False,
      "special": True
    }
    },
    "additional_special_tokens": ["mask","<old>","<me>"],
    "bos_token": "<s>",
    "clean_up_tokenization_spaces": False,
    "eos_token": "</s>",
    "legacy": True,
    "model_max_length": 32768,
    "pad_token": "<unk>",
    "sp_model_kwargs": {},
    "spaces_between_special_tokens": False,
    "tokenizer_class": "PreTrainedTokenizerFast",
    "unk_token": "<unk>",
    # "chat_template": "{{ '<s>' + messages[0]['text'] + '</s>' }}"   #普通的模板，只能处理单条数据
    "chat_template": "{% for message in messages %} {{ '<s>' + message['text'] + '</s>'}} {% endfor %}"  #自定义模板，Jinja2模板引擎,可以处理多条对话
}

with open(os.path.join(tokenizer_dir, "tokenizer_config.json"), "w", encoding="utf-8") as config_file:
    json.dump(config, config_file, ensure_ascii=False, indent=4)

print("Tokenizer training completed and saved.")

Tokenizer training completed and saved.


# tokenizer的加载

In [10]:
from numpy import add
from transformers import AutoTokenizer
tokenizer_dir1 = "./model/toy_tokenizer"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir1)
msg=[{"text":"<|sos|><mask>this is a test <me><old><pad><|eos|>"},
     {"text":"<s> 这是我的测试文本。This is<pad> a test<mask>.</s>"}
]
new_msg1 = tokenizer.apply_chat_template(
    msg,
    tokenize=False,  # 设置为True以进行分词
    add_special_tokens=True,  # 添加自定义聊天模板
)
new_msg2 = tokenizer.apply_chat_template(
    msg,
    tokenize=True,  # 设置为False以不进行分词
    add_special_tokens=False,  # 不添加自定义聊天模板
)
print(f'原始文本：{msg}')
print(f'修改文本：{new_msg1} (添加自定义聊天模板)')
print(f'修改文本：{new_msg2} (不添加自定义聊天模板)')

原始文本：[{'text': '<|sos|><mask>this is a test <me><old><pad><|eos|>'}, {'text': '<s> 这是我的测试文本。This is<pad> a test<mask>.</s>'}]
修改文本： <s><|sos|><mask>this is a test <me><old><pad><|eos|></s>  <s><s> 这是我的测试文本。This is<pad> a test<mask>.</s></s>  (添加自定义聊天模板)
修改文本：[227, 3, 0, 6, 90, 78, 79, 89, 227, 79, 89, 227, 71, 227, 90, 75, 89, 90, 227, 265, 264, 5, 1, 4, 227, 227, 3, 3, 227, 171, 130, 254, 169, 253, 114, 169, 237, 246, 170, 255, 233, 169, 120, 240, 171, 114, 250, 169, 251, 236, 169, 257, 112, 166, 229, 231, 58, 78, 79, 89, 227, 79, 89, 5, 227, 71, 227, 90, 75, 89, 90, 6, 20, 4, 4, 227] (不添加自定义聊天模板)


In [11]:
print(f'分词器词表大小：{tokenizer.vocab_size}')

分词器词表大小：263


In [12]:
model_inputs = tokenizer(new_msg1)
print(f'查看分词结果：\n{model_inputs}')

查看分词结果：
{'input_ids': [227, 3, 0, 6, 90, 78, 79, 89, 227, 79, 89, 227, 71, 227, 90, 75, 89, 90, 227, 265, 264, 5, 1, 4, 227, 227, 3, 3, 227, 171, 130, 254, 169, 253, 114, 169, 237, 246, 170, 255, 233, 169, 120, 240, 171, 114, 250, 169, 251, 236, 169, 257, 112, 166, 229, 231, 58, 78, 79, 89, 227, 79, 89, 5, 227, 71, 227, 90, 75, 89, 90, 6, 20, 4, 4, 227], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [13]:
response = tokenizer.decode(model_inputs['input_ids'], skip_special_tokens=True)
print(f'对分词结果进行解码：{response} (不保留特殊字符)' )

对分词结果进行解码： this is a test    这是我的测试文本。This is a test.  (不保留特殊字符)
