# 1. 认识Tokenizer

In [1]:
from transformers import AutoTokenizer

In [2]:
# 我们可以从huggingface hub中直接加载一个训练好的tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [3]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

加载出来的Tokenizer是一个`BertTokenizerFast`类型的对象，里面包含了：`vocab_size`，`special_tokens`，padding的控制等信息。

special_tokens_map中记录了模型使用的一些特殊的token。

In [4]:
print(tokenizer.special_tokens_map)
print(tokenizer.convert_tokens_to_ids(tokenizer.special_tokens_map.values()))

{'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
[100, 102, 0, 101, 103]


# 2. 基本用法

## 2.1 直接调用

直接调用`__call__`方法，可以直接对句子进行Tokenization。

In [5]:
test_examples = ["today is not so bad", "It is so bad", "It's good"]

In [6]:
in_tensors = tokenizer(
    test_examples,
    padding="longest",
    truncation=True,
    max_length=32,
    return_tensors="pt",
)
print(in_tensors.keys())
print(in_tensors["input_ids"])
print(in_tensors["attention_mask"])
print(in_tensors["token_type_ids"])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
tensor([[ 101, 2651, 2003, 2025, 2061, 2919,  102],
        [ 101, 2009, 2003, 2061, 2919,  102,    0],
        [ 101, 2009, 1005, 1055, 2204,  102,    0]])
tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 0]])
tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])


`__call__`方法返回的是一个`BatchEncoding`对象，它是`dict`的子类，所以我们可以通过`[key]`来索引。

https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/tokenizer#transformers.BatchEncoding

## 2.1. 填充与截段

我们在上面进行分词时，设置了`padding="longest"`的选项，说明了，所以句子都按batch中最长的句子的长度进行对齐，如果长度不足，则补充`padding token`，padding可以有以下几种选项：

* `True`或者`longest`：填充到批次中最长的序列（如果只提供一个序列，则不应用填充）。
* `max_length`: 填充到`max_length`参数指定的长度，如果没有提供`max_length`参数，则填充到模型接受的最大长度（`model_max_length`）。如果您只提供了一个序列，则仍将应用填充。
* `False`或者`do_not_pad`：不应用填充。这是默认行为。

同时我们也在上面设置了截段的选项`truncation=True`，这个参数的可能的选项有：

* `True`或者`longest_first`：截断到`max_length`指定的最大长度，如果如果没有提供`max_length`参数，则截断到模型接受的最大长度（`model_max_length`）。如果传入的是一个句子对，那么它将从最长的那个句子中删除字符，直到长度满足为止。如果在删除的过程中，会不断的检查剩余的两个句子的长度，选项最长的那个句子来删除token。
* `only_second`: 对于单个句子截断的行为与`True`一致。如果提供的是一对序列（或一批成对的序列），这只会截断一对序列的第二句，如果第二个句子不够截取，则报错。
* `only_first`: 对于单个句子截断的行为与`True`一致。如果提供的是一对序列（或一批成对的序列），这只会截断一对序列的第一句，如果第一个句子不够截取，则报错。
* `False`或者`do_not_truncate`：不进行截断。

## 2.2. 分词：tokenize：将句子转换为token（word piece）

In [7]:
# 不支持batch调用
tokenizer.tokenize(test_examples[0])

['today', 'is', 'not', 'so', 'bad']

In [8]:
# 我们使用的Bert模型对中文支持有限，不在vocab中的中文会被转换为 [UNK]
print(tokenizer.tokenize("你好，中国，Bert对中文的支持很有限"))

['[UNK]', '[UNK]', '，', '中', '国', '，', 'bert', '[UNK]', '中', '文', '的', '[UNK]', '[UNK]', '[UNK]', '有', '[UNK]']


In [9]:
# 对于一些生僻，错误的word，会进行拆为wordpiece
print(tokenizer.tokenize("hello-cat!, huggingFace, 123456"))

['hello', '-', 'cat', '!', ',', 'hugging', '##face', ',', '123', '##45', '##6']


In [10]:
print(tokenizer.tokenize("hello-cat!, huggingFace, 123456"))

['hello', '-', 'cat', '!', ',', 'hugging', '##face', ',', '123', '##45', '##6']


## 2.3. token和id的相互转换

In [11]:
tokens = tokenizer.tokenize(test_examples[0])
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
print(tokenizer.convert_ids_to_tokens(token_ids))

[2651, 2003, 2025, 2061, 2919]
['today', 'is', 'not', 'so', 'bad']


## 2.4. `encode`

encode接口只能返回 token ids，而且不支持batch调用。等价于：`convert_token_to_id(tokenize(text))`

In [12]:
tokenizer.encode(
    test_examples[0],
    test_examples[1],
)

[101, 2651, 2003, 2025, 2061, 2919, 102, 2009, 2003, 2061, 2919, 102]

In [13]:
print(
    tokenizer.convert_ids_to_tokens(
        tokenizer.encode(test_examples[0], test_examples[1])
    )
)

['[CLS]', 'today', 'is', 'not', 'so', 'bad', '[SEP]', 'it', 'is', 'so', 'bad', '[SEP]']


## 2.5. `encode_plus`

该接口已经deprecated，完全被`__call__`替换。

## 2.6. decode

decode是encode的逆运算：将id list 转化为一个字符串

In [14]:
token_ids = tokenizer.encode(test_examples[0], test_examples[1])
tokenizer.decode(token_ids)

'[CLS] today is not so bad [SEP] it is so bad [SEP]'

# 3. Fast Tokenizer / Slow Tokenizer

Fast Tokenizer 是基于rust来实现的，速度快；而Slow tokenizer是基于python实现，速度慢；

In [15]:
fast_tokenizer = AutoTokenizer.from_pretrained(model_name)
slow_tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

In [16]:
batch_long_texts = [
    "A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library 🤗 Tokenizers. The “Fast” implementations allows:",
    "The base classes PreTrainedTokenizer and PreTrainedTokenizerFast implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and “Fast” tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository). They both rely on PreTrainedTokenizerBase that contains the common methods, and SpecialTokensMixin.",
    "BatchEncoding holds the output of the PreTrainedTokenizerBase’s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask…). When the tokenizer is a “Fast” tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).",
    "Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.",
]
# bathc size = 4 * 8
batch_long_texts = batch_long_texts * 8

In [17]:
%%timeit
fast_tokenizer(batch_long_texts)

1.94 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [18]:
%%timeit
slow_tokenizer(batch_long_texts)

42 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


从上面的对比，可以看出在批量模型式下，FastTokenizer是SlowTokenizer的20倍。

# 4. `offset_mapping`和`word_ids`
FastTokenizer有一些特殊的返回值

* offset_mapping：标记了每一个token在原输出str中字符级别的索引位置
* word_ids：标记了每个token对应原输出中word的索引

这个对于NER或QA来说比较重要。

<div align="left">
  <img src="./assets/tokenizer.drawio.svg" width="660"/> </div>

In [19]:
inputs = fast_tokenizer(
    "In the big big world, I have a big dreamming", return_offsets_mapping=True
)
inputs

{'input_ids': [101, 1999, 1996, 2502, 2502, 2088, 1010, 1045, 2031, 1037, 2502, 3959, 6562, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 2), (3, 6), (7, 10), (11, 14), (15, 20), (20, 21), (22, 23), (24, 28), (29, 30), (31, 34), (35, 40), (40, 44), (0, 0)]}

In [20]:
print(inputs.offset_mapping)

[(0, 0), (0, 2), (3, 6), (7, 10), (11, 14), (15, 20), (20, 21), (22, 23), (24, 28), (29, 30), (31, 34), (35, 40), (40, 44), (0, 0)]


In [21]:
print(inputs.word_ids())

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, None]
