# 如何处理海量单词

In [15]:
s1 = 'albums sold 124443286539 copies'
s2 = 'technically perfect, melodically correct'
s3 = 'featuring a previously unheard track'
s4 = 'bestselling music artist'
s5 = 's1 d1 o1 and o2'
s6 = 'asbofwheohwbeif'

## 实例化 tokenizer

In [16]:
from transformers import BertTokenizer

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

## vocab

- tokenizer.vocab, tokenizer.id_to_tokens
- len(tokenizer.vocab) == 30522

In [17]:
len(tokenizer.vocab)

30522

In [18]:
tokenizer.vocab['[UNK]']

100

In [19]:
tokenizer.tokenize('bestselling')

['best', '##sell', '##ing']

In [20]:
cnt_sharp = 0
for token, id in tokenizer.vocab.items():
    if token.startswith('##'):
        # print(token)
        cnt_sharp += 1
print(cnt_sharp)

5828


## token 测试

In [22]:
inputs = tokenizer(s6)
inputs

{'input_ids': [101, 2004, 5092, 2546, 2860, 5369, 11631, 2860, 19205, 2546, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [27]:
tokens = tokenizer.tokenize(s6)
tokens

['as', '##bo', '##f', '##w', '##he', '##oh', '##w', '##bei', '##f']

In [29]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[2004, 5092, 2546, 2860, 5369, 11631, 2860, 19205, 2546]

In [32]:
word = tokenizer.decode(ids)
word

'asbofwheohwbeif'

## 总结

- tokenizer 轻易不会将一个词处理为 [UNK] (100)
- 基于词汇表，tokenize, encode, decode 一体
  - tokenize：word => token(s)，将word尽可能地映射为 vocab 中的 keys
  - encode: token => id
  - decode: id => token => word
    - encode 完了之后也不是终点（word），decode 还要能很好地将 id 还原，尽可能与输入的 word 对齐