# Tokenizer

There is a group of objects in the `transformers` library that provide tokinzation. This page considers their properties.

In [4]:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")

## Vocabulary

To get the vocabulary of the tonizer, use the `get_vocab` method: it returns dict where each available token corresponds to it's id.

---

The following cell represents a small subset of the vocabulary of the tokeniser under consideration.

In [20]:
vocab: dict[str, int] = tokenizer.get_vocab()
dict(list(vocab.items())[:10])

{'##úl': 26994,
 'Michaels': 19108,
 'Sculpture': 19477,
 'notoriety': 26002,
 '##kov': 7498,
 '##grating': 21889,
 '##¹': 28173,
 'Manny': 17381,
 'towers': 8873,
 '##gles': 15657}

## Transformations

In [10]:
tokens_list = tokenizer.tokenize("typical tokinezation example")
tokens_list

['typical', 'to', '##kin', '##ez', '##ation', 'example']

In [13]:
ids_list = tokenizer.convert_tokens_to_ids(tokens_list)
ids_list

[4701, 1106, 4314, 6409, 1891, 1859]

In [16]:
tokenizer.decode([3743, 1834, 7321])

'Instead needed thereafter'

## Special tokens

Typically, tokenisers have `pad_token`, `eos_token`, `bos_token` and `unk_token` fields that allow access to how the tokeniser in question handles service tokens. 

---

The next code represents the special tokens for the tokinesator we loaded earlier:

In [5]:
tokenizer.pad_token, tokenizer.eos_token, tokenizer.bos_token, tokenizer.unk_token

('[PAD]', None, None, '[UNK]')