## Tokenizer使用简介
这个文档将简单介绍tokenizer的各种用法以及可定制的内容

### 文档
Tokenizer的输入为字符串，输出为pytorch的tensor（batched），输出一般会直接作为model.forward的输入

tokenizer完成的工作包括：分词（BPE模型），字符串到id的转换，padding，构建attention mask

文档详见https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

### 例子

In [10]:
from transformers import AutoTokenizer
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt_tok = AutoTokenizer.from_pretrained("gpt2")
text = "the answer to life the universe and everything"
text2 = "the answer is 42"

In [84]:
# basic information
print(gpt_tok)
print("vocabs in bert:", len(bert_tok))
bert_tok.get_vocab()

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)
vocabs in bert: 30522


{'##ン': 30263,
 '##幸': 30367,
 'dependent': 7790,
 'symphonies': 29355,
 'lining': 14834,
 '[unused983]': 988,
 '275': 17528,
 '##ne': 2638,
 'belts': 18000,
 'tx': 19067,
 'storylines': 22628,
 '##™': 30108,
 'incumbent': 7703,
 'deer': 8448,
 'torres': 13101,
 '##ald': 19058,
 'delivery': 6959,
 'schwarz': 29058,
 'heath': 9895,
 'democratic': 3537,
 'mausoleum': 19049,
 'capitol': 9424,
 'myself': 2870,
 'genealogy': 26684,
 'passes': 5235,
 'licked': 11181,
 '##ridge': 9438,
 '[unused905]': 910,
 'toni': 16525,
 'organized': 4114,
 'inverse': 19262,
 '##zog': 28505,
 'meets': 6010,
 '##®': 29656,
 'calming': 23674,
 'erased': 23516,
 'thighs': 9222,
 '°c': 6362,
 'courier': 18092,
 'tandem': 18231,
 'winger': 16072,
 'white': 2317,
 'mold': 18282,
 '[unused652]': 657,
 '1786': 17436,
 'frankish': 26165,
 'customized': 28749,
 '##ness': 2791,
 'secured': 7119,
 'hugged': 10308,
 'josh': 6498,
 'priests': 8656,
 'hector': 10590,
 'algae': 18670,
 '1976': 3299,
 'pirate': 11304,
 'ave

In [64]:
def print_words(encoded):
    print(bert_tok.convert_ids_to_tokens(encoded["input_ids"]))
def print_batch(encoded):
    for ids in encoded["input_ids"]:
        print(bert_tok.convert_ids_to_tokens(ids))

In [79]:
print("directily use")
encoded = bert_tok(text)
print(encoded)
print_words(encoded)
print()

print("return tensor")
encoded = bert_tok(text, return_tensors="pt")
print(encoded)
print_batch(encoded)
print()

print("no special tokens")
encoded = bert_tok(text, add_special_tokens=False)
print(encoded)
print_words(encoded)
print()

print("encode two sentences (for bert input)")
encoded = bert_tok(text, text2)
print(encoded)
print_words(encoded)
print()

print("only return input_ids")
encoded = bert_tok.encode(text)
print(encoded)
print(bert_tok.convert_ids_to_tokens(encoded))
print()

print("batched encode")
encoded = bert_tok([text, text2])
print(encoded)
print_batch(encoded)
print()

print("batched encode with padding")
encoded = bert_tok([text, text2], padding=True)
print(encoded)
print_batch(encoded)
print()

print("encode with stride")
encoded = bert_tok(text, max_length=4, truncation=True, stride=1, return_overflowing_tokens=True)
print(encoded)
print_batch(bert_tok(text, max_length=4, truncation=True, stride=1, return_overflowing_tokens=True))
print()

directily use
{'input_ids': [101, 1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'the', 'answer', 'to', 'life', 'the', 'universe', 'and', 'everything', '[SEP]']

return tensor
{'input_ids': tensor([[ 101, 1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'the', 'answer', 'to', 'life', 'the', 'universe', 'and', 'everything', '[SEP]']

no special tokens
{'input_ids': [1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
['the', 'answer', 'to', 'life', 'the', 'universe', 'and', 'everything']

encode two sentences (for bert input)
{'input_ids': [101, 1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673, 102, 1996, 3437, 2003, 4413, 102], 'token_type_ids': [0, 0, 0

In [50]:
print("decode")
encoded = bert_tok(text)
print(encoded)
decoded = bert_tok.decode(encoded["input_ids"])
print(decoded)
print()

print("batched decode")
encoded = bert_tok([text, text2])
print(encoded)
decoded = bert_tok.batch_decode(encoded["input_ids"])
print(decoded)
print()

print("decode without special tokens")
encoded = bert_tok(text)
print(encoded)
decoded = bert_tok.decode(encoded["input_ids"], skip_special_tokens=True)
print(decoded)
print()

decode
{'input_ids': [101, 1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] the answer to life the universe and everything [SEP]

batched decode
{'input_ids': [[101, 1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673, 102], [101, 1996, 3437, 2003, 4413, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
['[CLS] the answer to life the universe and everything [SEP]', '[CLS] the answer is 42 [SEP]']

decode without special tokens
{'input_ids': [101, 1996, 3437, 2000, 2166, 1996, 5304, 1998, 2673, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
the answer to life the universe and everything



In [57]:
print("long sequence")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz")
print(encoded)
print_words(encoded)
print(bert_tok.decode(encoded["input_ids"]))
print()

print("long sequence")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz")
print(encoded)
print_words(encoded)
print(bert_tok.decode(encoded["input_ids"]))
print()

print("long sequence with is_split_into_words ")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz".split(), is_split_into_words = True)
print(encoded)
print_words(encoded)
print(bert_tok.decode(encoded["input_ids"]))

long sequence
{'input_ids': [101, 5925, 3207, 2546, 28891, 15992, 13728, 2078, 6728, 4160, 12096, 2226, 2615, 2860, 18037, 2480, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'abc', '##de', '##f', '##ghi', '##jk', '##lm', '##n', 'op', '##q', '##rst', '##u', '##v', '##w', '##xy', '##z', '[SEP]']
[CLS] abcdefghijklmn opqrstuvwxyz [SEP]

long sequence
{'input_ids': [101, 5925, 3207, 2546, 28891, 15992, 13728, 2078, 6728, 4160, 12096, 2226, 2615, 2860, 18037, 2480, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'abc', '##de', '##f', '##ghi', '##jk', '##lm', '##n', 'op', '##q', '##rst', '##u', '##v', '##w', '##xy', '##z', '[SEP]']
[CLS] abcdefghijklmn opqrstuvwxyz [SEP]

long sequence with is_split_into_words 
{'input_ids': [101, 5925, 3207, 2546, 28891, 15992, 13728, 207

In [75]:
print("word to char")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz")
print(encoded)
print_words(encoded)
print(0, encoded.word_to_chars(0))
print(1, encoded.word_to_chars(1))
print()

print("word to token")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz")
print(encoded)
print_words(encoded)
print(0, encoded.word_to_tokens(0))
print(1, encoded.word_to_tokens(1))
print()

print("char to token")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz")
print(encoded)
print_words(encoded)
print(0, encoded.char_to_token(0))
print(5, encoded.char_to_token(5))
print()

print("char to word")
encoded = bert_tok("abcdefghijklmn opqrstuvwxyz")
print(encoded)
print_words(encoded)
print(0, encoded.char_to_word(0))
print(15, encoded.char_to_word(15))
print()

word to char
{'input_ids': [101, 5925, 3207, 2546, 28891, 15992, 13728, 2078, 6728, 4160, 12096, 2226, 2615, 2860, 18037, 2480, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'abc', '##de', '##f', '##ghi', '##jk', '##lm', '##n', 'op', '##q', '##rst', '##u', '##v', '##w', '##xy', '##z', '[SEP]']
0 CharSpan(start=0, end=14)
1 CharSpan(start=15, end=27)

word to token
{'input_ids': [101, 5925, 3207, 2546, 28891, 15992, 13728, 2078, 6728, 4160, 12096, 2226, 2615, 2860, 18037, 2480, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'abc', '##de', '##f', '##ghi', '##jk', '##lm', '##n', 'op', '##q', '##rst', '##u', '##v', '##w', '##xy', '##z', '[SEP]']
0 TokenSpan(start=1, end=8)
1 TokenSpan(start=8, end=16)

char to token
{'input_ids': [101, 5925, 3207, 2546, 28891, 15992, 137

In [59]:
def print_words(encoded):
    print(gpt_tok.convert_ids_to_tokens(encoded["input_ids"]))
def print_batch(encoded):
    for ids in encoded["input_ids"]:
        print(gpt_tok.convert_ids_to_tokens(ids))

In [62]:
print("chinese tokens")
encoded = gpt_tok("42是最好的随机种子, 42 is the best random seed")
print(encoded)
print_words(encoded)
print(gpt_tok.decode(encoded["input_ids"]))

chinese tokens
{'input_ids': [3682, 42468, 17312, 222, 25001, 121, 21410, 49694, 237, 17312, 118, 163, 100, 235, 36310, 11, 5433, 318, 262, 1266, 4738, 9403], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['42', 'æĺ¯', 'æľ', 'Ģ', 'å¥', '½', 'çļĦ', 'éļ', 'ı', 'æľ', 'º', 'ç', '§', 'į', 'åŃĲ', ',', 'Ġ42', 'Ġis', 'Ġthe', 'Ġbest', 'Ġrandom', 'Ġseed']
42是最好的随机种子, 42 is the best random seed


### 训练Tokenizer

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.pre_tokenizer = Whitespace()
files = [...]
tokenizer.train(files, trainer)

from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

### 保存和读取

In [None]:
tokenizer.save("tokenizer.json")

from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

### 总结


Tokenizer类实际上是HuggingFace Tokenizer库的二次封装，普通情况下，其提供的功能也已经足够丰富和实用，对于想要进一步定制输入输出格式、定制分词过程的，可以参考以下的文档：
- 分词器的概述：https://huggingface.co/docs/transformers/tokenizer_summary
- Tokenizer库的文档：https://huggingface.co/docs/tokenizers/python/latest/index.html
- 不同模型中的tokenizer文件，特别是`PreTrainedTokenizerFast`和`PreTrainedTokenizerBase`两个类的实现。
  - 一般继承`PreTrainedTokenizerFast`会重写的函数为：build_inputs_with_special_tokens, create_token_type_ids_from_sequences, save_vocabulary,以及__init__
  - 核心的函数为_encode_plus和_batch_encode_plus，定义在`PreTrainedTokenizerFast`中，主要功能包括调用后端的分词器、padding、添加special token、truncation等等
  - decode相关的函数为_decode_plus和_batch_decode_plus
  - 最基础的调用流程定义在`PreTrainedTokenizerBase`中
  - 