# The Tokenizer Class

In this notebook we take a closer look at the `Tokenizer` class. We won't cover all methods and attributes, but only those that are likely to be useful when used in association with the `from_pretrained` method.

All pre-trained tokenizers are instances of the `PreTrainedTokenizer` base class.

In [1]:
from transformers import BertTokenizer, PreTrainedTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(isinstance(tokenizer, PreTrainedTokenizer))

True


Tokenizers allow:

1. Tokenizing, i.e., converting a string into the individual text tokens.
2. Numericalizing, i.e., converting the individual text tokens into integer IDs.
3. Performing the opposite conversion, i.e., from ID to text.
4. Adding new tokens to the vocabulary of the tokenizer.
5. Managing special tokens.

## Vocabulary

The Bert tokenizer ships with a vocabulary of ~30k tokens.

In [2]:
tokenizer.vocab_size

30522

When a token is not present, it is split into sub-units.

In [3]:
[tokenizer.tokenize(s) for s in ['Morphogranitic', 'Spliceosomic', 'Zlyxolotl']]

[['mor', '##ph', '##og', '##rani', '##tic'],
 ['sp', '##lice', '##oso', '##mic'],
 ['z', '##ly', '##x', '##olo', '##tl']]

## Special Tokens

The `add_special_tokens` method is used when the special tokens are not already in the vocabulary. For example, a GPT-2 model does not have a `<CLS>` token. Its argument is a dictionary whose keys must be in `['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens'].`

In almost all practical cases, we can ignore this method.

## tokenize vs. encode vs. encode_plus vs. batch_encode_plus vs. prepare_for_model

### tokenize

`tokenizer.tokenize` converts a string into a sequence of string tokens. It splits the sentence into words or sub-word tokens.

In [4]:
tokenizer.tokenize('What does tokenize do?')

['what', 'does', 'token', '##ize', 'do', '?']

### encode

Encode converts a string in a sequence of integer IDs. It is similar, but not identical, to `self.convert_tokens_to_ids(self.tokenize(text))`.

In [5]:
my_string = 'What does encode do?'
tokenizer.encode(my_string)

[101, 2054, 2515, 4372, 16044, 2079, 1029, 102]

In [6]:
print(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(my_string)))
print(tokenizer.encode(my_string, add_special_tokens=False))

[2054, 2515, 4372, 16044, 2079, 1029]
[2054, 2515, 4372, 16044, 2079, 1029]


We can see that `encode` by default adds the special tokens at the beginning and at the end of the sequence. It can do more than this. It can truncate the sequence, and to max length, and return tensors instead of lists (`'pt'` for PyTorch and `'tf'` for TensorFlow).

In [7]:
print(tokenizer.encode(my_string, max_length=4))
print(tokenizer.encode(my_string, max_length=16, pad_to_max_length=True))
print(type(tokenizer.encode(my_string, return_tensors='pt')))

[101, 2054, 2515, 102]
[101, 2054, 2515, 4372, 16044, 2079, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0]
<class 'torch.Tensor'>


### encode_plus

`encode_plus` does even more. It returns a dictionary structured as follows:

```
{
  input_ids: list[int],
  token_type_ids: list[int] if return_token_type_ids is True (default)
  attention_mask: list[int] if return_attention_mask is True (default)
  overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True
  num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True
  special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True
}
```

In [8]:
tokenizer.encode_plus(my_string)

{'input_ids': [101, 2054, 2515, 4372, 16044, 2079, 1029, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
tokenizer.encode_plus(my_string, max_length=16, pad_to_max_length=True)

{'input_ids': [101,
  2054,
  2515,
  4372,
  16044,
  2079,
  1029,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]}

In [10]:
tokenizer.encode_plus(my_string, max_length=4, return_overflowing_tokens=True)

{'overflowing_tokens': [4372, 16044, 2079, 1029],
 'num_truncated_tokens': 4,
 'input_ids': [101, 2054, 2515, 102],
 'token_type_ids': [0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1]}

### batch_encode_plus

`tokenizer.batch_encode_plus` can operate on a list of strings (a batch of inputs) and perform many, but not all of the operations described above. Importantly, it can return tensor, attention masks and perform truncation but it does **not** perform padding. An obvious question is: how can it add attention masks if there is no padding? Simple, it just return a list of 1s.

In [11]:
my_batch = ['One sentence to rule them all!',
            'My socks can sing like Abba.',
            'Why am I always hungry?'
           ]
tokenizer.batch_encode_plus(my_batch, max_length=100)

{'input_ids': [[2028, 6251, 2000, 3627, 2068, 2035, 999],
  [2026, 14829, 2064, 6170, 2066, 11113, 3676, 1012],
  [2339, 2572, 1045, 2467, 7501, 1029]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1]]}

## Shall we encode by hand?

The main problem with `batch_encode_plus` is that it cannot pad the sequences. Without padding the attention mask is useless. We could process each sentence via `encode_plus`. There is probably no point in returning tensors at this stage, as we have to extract the input IDs and the attention masks.

In [12]:
encoded_batch = [tokenizer.encode_plus(s, max_length=16, 
                       pad_to_max_length=True) for s in my_batch]
print(encoded_batch)

[{'input_ids': [101, 2028, 6251, 2000, 3627, 2068, 2035, 999, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}, {'input_ids': [101, 2026, 14829, 2064, 6170, 2066, 11113, 3676, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}, {'input_ids': [101, 2339, 2572, 1045, 2467, 7501, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]}]


In [13]:
input_ids = [x['input_ids'] for x in encoded_batch]
attn_mask = [x['attention_mask'] for x in encoded_batch]
print(input_ids)
print(attn_mask)

[[101, 2028, 6251, 2000, 3627, 2068, 2035, 999, 102, 0, 0, 0, 0, 0, 0, 0], [101, 2026, 14829, 2064, 6170, 2066, 11113, 3676, 1012, 102, 0, 0, 0, 0, 0, 0], [101, 2339, 2572, 1045, 2467, 7501, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0]]
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]


This can easily generate batches of input IDs and attention masks that can then be converted into PyTorch tensors.

In [14]:
print(torch.tensor(input_ids))
print(torch.tensor(attn_mask))

tensor([[  101,  2028,  6251,  2000,  3627,  2068,  2035,   999,   102,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  2026, 14829,  2064,  6170,  2066, 11113,  3676,  1012,   102,
             0,     0,     0,     0,     0,     0],
        [  101,  2339,  2572,  1045,  2467,  7501,  1029,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])


## From IDs to strings

The reverse mapping produces a string starting from a list of integer IDs. Let's see how.

The `convert_ids_to_tokens` method returns the individual (sub)word tokens associated with the indices.

In [15]:
tokenizer.convert_ids_to_tokens(input_ids[1])

['[CLS]',
 'my',
 'socks',
 'can',
 'sing',
 'like',
 'ab',
 '##ba',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

We can skip the special tokens during this conversion, but we still have sub-word tokens.

In [16]:
tokenizer.convert_ids_to_tokens(input_ids[1], skip_special_tokens=True)

['my', 'socks', 'can', 'sing', 'like', 'ab', '##ba', '.']

The `convert_tokens_to_string` does what the name suggests, and puts together the sub-word tokens.

In [17]:
tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[1], skip_special_tokens=True))

'my socks can sing like abba .'

The same result can be obtained with the `decode` method, which, by default, returns also the special tokens.

In [18]:
tokenizer.decode(input_ids[1], skip_special_tokens=True)

'my socks can sing like abba.'

### prepare_for_model

`prepare_for_model` is a strange function. It takes a sequence of input ids (or a pair), adds special tokens, truncates sequences, takes care of the special tokens, it can do padding, truncation, can return an attention mask and can return tensors. However the input has to be already numericalized.

In [24]:
encoded_string = tokenizer.encode(my_string, add_special_tokens=True)
print(encoded_string)
tokenizer.prepare_for_model(encoded_string, add_special_tokens=False, max_length=16, 
                            pad_to_max_length=True, return_attention_mask=True,
                            return_tensors='pt')

[101, 2054, 2515, 4372, 16044, 2079, 1029, 102]


{'input_ids': tensor([[  101,  2054,  2515,  4372, 16044,  2079,  1029,   102,     0,     0,
              0,     0,     0,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

However, it does not work with batches, so there is no clear advantage, at least at a first glance, compared with the previous approach.