# 什么是tokenizer
[tokenizer全解析](https://zhuanlan.zhihu.com/p/651430181)<br>
NLP中的划分方法主要为:字划分、词划分、subword的划分<br>
目前业界主要包括四种subword划分分词model:BPE,BBPE,wordpiece,unigram<br>
subword的主要划分阶段为：归一化->预分词->模型分词->后处理四个阶段
![tokenizer的过程](../images/tokenizer.jpg)


[三种分词方法的本质](https://zhuanlan.zhihu.com/p/664717335)
## BPE
BPE的原理很简单，就是将原有的词语先拆分为字，然后根据字的组合频率一步步取出最高概率的组合词
具体可以参考[第3节BPE](https://zhuanlan.zhihu.com/p/651430181)
## wordpiece
wordpiece主要原理和BPE近似，只是wordpiece选择的组合指标是互信息的大小
## unigram
unigram是一个自顶向下的过程，先组合好一个巨大的词表，然后在语料中寻找频率最高的前k条


# 训练自己的tokenizer
利用transformers库可以很方便的训练我们自己的BPE分词器

In [1]:
from tokenizers import (
    Tokenizer,
    models,
    trainers,
    decoders,
    processors,
    pre_tokenizers,
    normalizers
)
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

## 添加一些特殊标记用于单独分词


In [2]:
special_tokens=["<endoftext>","<|im_end|>","<|im_start|>","<pad>","<mask>","<s>","</s>","<unk>","<UNK>","<EOS>","<zzy>","<|s1|>","<|s2|>"]
trainer=trainers.BpeTrainer(
    vocab_size=6400,
    special_tokens=special_tokens,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    show_progress=True,
)

## 设计读取pretrain数据集的函数

In [3]:
import json
#采用的训练方法是train_from_iterator
#定义一个迭代器函数读取数据集
def read_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line.strip())
            yield data['text']  # 假设每行数据有一个'text'字段
#测试一下读取的方法
data_path="../data/pretrain_hq.jsonl"
#打印两条数据
data_iter = read_data(data_path)
for _ in range(1):
    print(next(data_iter))
del data_iter

<|im_start|>鉴别一组中文文章的风格和特点，例如官方、口语、文言等。需要提供样例文章才能准确鉴别不同的风格和特点。<|im_end|> <|im_start|>好的，现在帮我查一下今天的天气怎么样?今天的天气依据地区而异。请问你需要我帮你查询哪个地区的天气呢？<|im_end|> <|im_start|>打开闹钟功能，定一个明天早上七点的闹钟。好的，我已经帮您打开闹钟功能，闹钟将在明天早上七点准时响起。<|im_end|> <|im_start|>为以下场景写一句话描述：一个孤独的老人坐在公园长椅上看着远处。一位孤独的老人坐在公园长椅上凝视远方。<|im_end|> <|im_start|>非常感谢你的回答。请告诉我，这些数据是关于什么主题的？这些数据是关于不同年龄段的男女人口比例分布的。<|im_end|> <|im_start|>帮我想一个有趣的标题。这个挺有趣的："如何成为一名成功的魔术师" 调皮的标题往往会吸引读者的注意力。<|im_end|> <|im_start|>回答一个问题，地球的半径是多少？地球的平均半径约为6371公里，这是地球自赤道到两极的距离的平均值。<|im_end|> <|im_start|>识别文本中的语气，并将其分类为喜悦、悲伤、惊异等。
文本：“今天是我的生日！”这个文本的语气是喜悦。<|im_end|>


## 利用iterator训练器开始训练

In [4]:
data_path="../data/pretrain_hq.jsonl"
#开始训练
data_iter = read_data(data_path)
tokenizer.train_from_iterator(data_iter, trainer=trainer)
#设置解码器
tokenizer.decoder= decoders.ByteLevel()
# 验证特殊词汇训练结果
assert tokenizer.token_to_id("<endoftext>") == 0
assert tokenizer.token_to_id("<|im_end|>") == 1
assert tokenizer.token_to_id("<|im_start|>") == 2
assert tokenizer.token_to_id("<pad>") == 3
assert tokenizer.token_to_id("<mask>") == 4
assert tokenizer.token_to_id("<s>") == 5
assert tokenizer.token_to_id("</s>") == 6
assert tokenizer.token_to_id("<unk>") == 7
assert tokenizer.token_to_id("<UNK>") == 8
assert tokenizer.token_to_id("<EOS>") == 9
assert tokenizer.token_to_id("<zzy>") == 10
assert tokenizer.token_to_id("<|s1|>") == 11
assert tokenizer.token_to_id("<|s2|>") == 12
# 保存tokenizer
import os
save_path="../model"
os.makedirs(save_path, exist_ok=True)
tokenizer.save(os.path.join(save_path, "tokenizer.json"))
tokenizer.model.save(save_path)






['../model/vocab.json', '../model/merges.txt']

最终得到训练好的tokenizer.json文件以及对应的vocab.json和merges.txt

# 手动创建一份配置文件tokenizer_config.json
[关于配置文件](https://blog.csdn.net/xiezhipu/article/details/145585777)<br>
[关于normalizer的配置说明](https://blog.csdn.net/weixin_49346755/article/details/126496833)<br>
[关于post_process的配置说明](https://blog.csdn.net/weixin_49346755/article/details/126499720)<br>
[chat_template的设计规则](https://www.guyuehome.com/detail?id=1888166611628642305)<br>
配置文件的作用主要是记录tokenizer的normalize,pre_process,post_process,template,special_token等tokenizer的关键参数

In [5]:
# 手动创建配置文件
config = {
        "add_bos_token": False,
        "add_eos_token": False,
        "add_prefix_space": False,
        "added_tokens_decoder": {
            "0": {
                "content": "<|endoftext|>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },
            "1": {
                "content": "<|im_end|>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },
            "2": {
                "content": "<|im_start|>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },
            "3": {
                "content": "<pad>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },
            "4": {
                "content": "<mask>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"5": {
                "content": "<s>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"6": {
                "content": "</s>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"7": {
                "content": "<unk>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"8": {
                "content": "<UNK>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"9": {
                "content": "<EOS>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"10": {
                "content": "<zzy>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"11": {
                "content": "<|s1|>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },"12": {
                "content": "<|s2|>",
                "lstrip": False,
                "normalized": False,
                "rstrip": False,
                "single_word": False,
                "special": True
            },
        },
        "additional_special_tokens": [ "<pad>", "<mask>","<s>","</s>","<unk>","<UNK>","<EOS>","<zzy>","<|s1|>","<|s2|>"],
        "bos_token": "<|im_start|>",
        "clean_up_tokenization_spaces": False,
        "eos_token": "<|im_end|>",
        "legacy": True,
        "model_max_length": 32768,
        "pad_token": "<pad>",
        "sp_model_kwargs": {},
        "spaces_between_special_tokens": False,
        "tokenizer_class": "PreTrainedTokenizerFast",
        "unk_token": "<|endoftext|>",
        "chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{{ '<|im_start|>system\\n' + system_message + '<|im_end|>\\n' }}{% else %}{{ '<|im_start|>system\\nYou are a helpful assistant<|im_end|>\\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\\n' + content + '<|im_end|>\\n<|im_start|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|im_end|>' + '\\n' }}{% endif %}{% endfor %}"
    }

save_path = "../model"
os.makedirs(save_path, exist_ok=True)
# 保存配置文件
with open(os.path.join(save_path, "tokenizer_config.json"), "w", encoding="utf-8") as config_file:
    json.dump(config, config_file, ensure_ascii=False, indent=4)

print("Tokenizer training completed and saved.")

Tokenizer training completed and saved.


### 典型问题
config的added_tokens_decoder里<br>
我的"<|多余的flag|>","<|zzy|>"其实没有对应的encoder，所以会产生不对应的问题，这样的做法是错误的<br>
具体添加特殊标记的方法请参考模型的官方链接
[参考1：如何扩充词表](https://zhuanlan.zhihu.com/p/704346193#:~:text=%E7%AE%80%E5%8D%95%E6%9D%A5%E8%AF%B4%EF%BC%8C%E8%AF%BB%E5%85%A5%20tokenizer%20model%E4%B9%8B%E5%90%8E%EF%BC%8C%E8%B0%83%E7%94%A8%20tokenizer%20%E7%9A%84%20add_special_tokens%20%E6%96%B9%E6%B3%95%E7%BB%99%20tokenizer,model%20%E7%9A%84%20embedding%20size%EF%BC%8C%E9%80%9A%E8%BF%87%E8%B0%83%E7%94%A8%20model%20%E7%9A%84%20resize_token_embeddings%20%E6%96%B9%E6%B3%95%E6%9D%A5%E5%AE%9E%E7%8E%B0%E8%BF%99%E4%B8%80%E7%82%B9%E3%80%82)

#### template说明
采用的为jinja的模板引擎，用来表示template的格式
chat_template结构：
```
{% if messages[0]['role'] == 'system' %}
  {% set system_message = messages[0]['content'] %}
  {{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}
{% else %}
  {{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}
{% endif %}

{% for message in messages %}
  {% set content = message['content'] %}
  {% if message['role'] == 'user' %}
    {{ '<|im_start|>user\n' + content + '<|im_end|>\n<|im_start|>assistant\n' }}
  {% elif message['role'] == 'assistant' %}
    {{ content + '<|im_end|>' + '\n' }}
  {% endif %}
{% endfor %}
```

后续在读入我们的SFT的多轮对话数据集时候，就会用chat_template来转换数据集格式的。

#### 评估tokenizer的效果
简而言之就是利用AutoTokenizer调用本地的tokenizer训练文件
![调用过程](../images/AutoTokenizer.jpg)

In [6]:
from transformers import AutoTokenizer
# 评估tokenizer的效果
tokenizer_path="../model/"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=True)

#后续SFT数据的格式
messages = [
        {"role": "system", "content": "你是一个优秀的聊天机器人，总是给我正确的回应！"},
        {"role": "user", "content": '你来自哪里？'},
        {"role": "assistant", "content": '我来自地球'}
    ]
new_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False
    ) #会把原始的dict转化为template格式的string
print(new_prompt)

#tokenizer理论词表长度
print("Tokenizer vocabulary size:", tokenizer.vocab_size)
#tokenizer实际长度
len_tokenizer = len(tokenizer)
print("Tokenizer actual size:", len_tokenizer)

#encoder测试
model_inputs=tokenizer(new_prompt)
print("length of model inputs:", len(model_inputs['input_ids']))
input_ids= model_inputs['input_ids']
print("Input IDs:", input_ids)
response= tokenizer.decode(input_ids, skip_special_tokens=False)
#比对response和new_prompt是否一致
print("Response matches new prompt:", response == new_prompt)

#打印tokenizer的特殊token
print("Tokenizer special tokens:")
print(tokenizer.special_tokens_map)

  from .autonotebook import tqdm as notebook_tqdm


<|im_start|>system
你是一个优秀的聊天机器人，总是给我正确的回应！<|im_end|>
<|im_start|>user
你来自哪里？<|im_end|>
<|im_start|>assistant
我来自地球<|im_end|>

Tokenizer vocabulary size: 6400
Tokenizer actual size: 6401
length of model inputs: 46
Input IDs: [2, 95, 101, 315, 81, 89, 211, 405, 932, 5243, 3325, 2125, 273, 2611, 1140, 2607, 711, 480, 1005, 1, 211, 2, 97, 95, 3717, 211, 405, 2730, 3024, 433, 1, 211, 2, 77, 95, 95, 85, 315, 3932, 96, 211, 309, 2730, 1292, 1, 211]
Response matches new prompt: True
Tokenizer special tokens:
{'bos_token': '<|im_start|>', 'eos_token': '<|im_end|>', 'unk_token': '<|endoftext|>', 'pad_token': '<pad>', 'additional_special_tokens': ['<pad>', '<mask>', '<s>', '</s>', '<unk>', '<UNK>', '<EOS>', '<zzy>', '<|s1|>', '<|s2|>']}


### 结束语
经过这几步处理，我们实现了：<br>
- 利用BPE model和Pretrain数据集,训练出了自己的BPE-tokenizer
- 利用transformers调用tokenizer.json和tokenizer_config.json实现了分词器的构建
- 利用训练好的分词器，实现了word2vec