## Tokenization


For transformers, a fundamental step is to convert the input text into a sequence of tokens. Tokenizers are used for this purpose. Different tokenization techniques can be used (e.g., Byte-Pair Encoding). 

These tokenizers need to be trained on some corpus (e.g., to figure out what the most common words are). However, the Hugging Face library provides pre-trained tokenizers that can be used out of the box.

Generally, each model has its own tokenizer. For example, the `BertTokenizer` is used for BERT models, and the `GPT2Tokenizer` is used for GPT-2 models. 


Since we will be using T5 for this exercise, we should be using the `T5Tokenizer` class. However, HuggingFace provides a common `AutoTokenizer` class that can be used to load the appropriate tokenizer for a given model  (do note, however, that the returned class will be the "correct" one!).

In [76]:
from transformers import AutoModel, AutoTokenizer

model_name = "google-t5/t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
print(type(tokenizer))

<class 'transformers.models.t5.tokenization_t5_fast.T5TokenizerFast'>




### Encoding/decoding

Tokenization can be carried out by passing a string that we want to tokenize. The tokenizer implements the `__call__` method, so we can call the tokenizer directly, as follows.

Note that the output is a dictionary, which generally has the following keys:

- `input_ids`: The tokenized input text (a list of token IDs by default). 
- `attention_mask`: A mask that indicates which elements in the input text are tokens and which are padding tokens. For now, we can ignore this (there is no padding). It will instead become useful when we encode batches of sentences of different lengths at the same time.

In [77]:
sentence = "hello, this is a sentence!"
tokens = tokenizer(sentence)
print(tokens)

{'input_ids': [21820, 6, 48, 19, 3, 9, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


We can reverse the encoding operation (i.e., going from token IDs to strings) by using the `decode` method of the tokenizer.

In [78]:
tokenizer.decode(tokens["input_ids"])

'hello, this is a sentence!</s>'

Note that we have an extra part at the end of the string, which is the special token `</s>`. This token is used to indicate the end of the input text (EOS). This token is automatically added by the tokenizer when encoding the input text.

To learn what the mapping between tokens and token IDs is, we can get the tokenizer's vocabulary (`.get_vocab()`), which provides the mapping between tokens and respective IDs. 

For convenience, we build also a reverse vocabulary (i.e., from IDs to tokens). 

In [79]:
import random 

vocabulary = tokenizer.get_vocab()
reverse_vocab = { v: k for k, v in vocabulary.items() }

vocab_keys = list(vocabulary.keys())


random.shuffle(vocab_keys)

# Show 10 random words from the vocabulary
{ k: vocabulary[k] for k in vocab_keys[:10] }

{'▁affair': 18431,
 '▁detection': 10664,
 '▁Krankheit': 19932,
 '▁stands': 5024,
 '▁1976': 16164,
 '▁Atmosphäre': 24071,
 '▁attitudes': 18537,
 '▁Expedition': 31578,
 '▁Excellence': 17929,
 '▁restroom': 27381}

Let's see what the token id for the special token `</s>` is.

In [80]:
vocabulary["</s>"]

1

And indeed, note that our `tokens` has a 1 showing up at the end!

In [81]:
tokens["input_ids"]

[21820, 6, 48, 19, 3, 9, 7142, 55, 1]

We can include special tokens inside of the strings themselves. For instance:

In [82]:
tokenizer("hello!</s></s>")

{'input_ids': [21820, 55, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1]}

Here, we have 2 `</s>` tokens (the ones we specified), plus an additional one that was added by the tokenizer.

Instead of getting token IDs directly, we may look at the tokens being produced, directly. We use the `tokenize()` method in this case. 

In [83]:
tokenizer.tokenize(sentence)

['▁hello', ',', '▁this', '▁is', '▁', 'a', '▁sentence', '!']

What's up with those `_`? They simply represent words that are starting after spaces. This helps us understand whether a token is being used at the beginning of a sentence, or if it's in the middle of a word.

In [84]:
print(tokenizer.tokenize("hello    ,world"))
print(tokenizer.tokenize("hello    , world"))

['▁hello', '▁', ',', 'world']
['▁hello', '▁', ',', '▁world']


In the above case, `_hello` is the token for the word "hello" at the beginning of the sentence. However, the word "world" is mapped to two different tokens, depending on whether there is a space before the word or not. 

Notice also how multiple spaces are compacted into a single one!

These are all tokenizer-specific details. The tokenizer is responsible for deciding how to tokenize the input text. You may observe different behaviors for different tokenizers.

### Special tokens

Each model typically has its own special tokens. Some are necessary for the training process, while others can be beneficial at inference time.

Special attributes are available in the tokenizer class to access these special tokens. Some examples are:

- `pad_token` is the token used for padding (as discussed later),
- `bos_token` and `eos_token` tokens are used to indicate the beginning and end of the input text, respectively,
- `mask_token` is used for masking tokens during training (e.g., for the masked LM task, with BERT),
- `sep_token` is used to separate sentences in the input text (e.g., next sentence prediction, with BERT),
- `cls_token` is used to indicate the beginning of the input text (e.g., for classification tasks, with BERT),
- `unk_token` is used to indicate unknown tokens (i.e., tokens that are not in the vocabulary).

Of course, not all tokenizers will use all tokens. So those attributes will be set to None, if not used.

For instance, T5 has EOS and PAD tokens, but no BOS token.

In [99]:
tokenizer.eos_token, tokenizer.pad_token, tokenizer.bos_token

('</s>', '<pad>', None)

The `_id` suffix is used to indicate the corresponding token ID (None if not applicable).

In [101]:
tokenizer.eos_token_id, tokenizer.pad_token_id, tokenizer.bos_token_id

(1, 0, None)

### Batch encoding/decoding

In general (especially at training time) we will want to encode multiple sentences at once (e.g., an entire batch of sentences).

We can pass a list of sentences to be encoded to the tokenizer. 

In [106]:
sentences = [
    "this is the first sentence",
    "instead, this is the second sequence!"
]
tokens = tokenizer(sentences)

for tok in tokens["input_ids"]:
    print(tok)

[48, 19, 8, 166, 7142, 1]
[1446, 6, 48, 19, 8, 511, 5932, 55, 1]


Of course, sentences of different lengths have a different number of tokens! However, tensors (that will be used by the model) need to have the same number of elements along each dimension. 

To do this, we can use padding: all sentences will be padded to the length of the longest sentence in the batch. This is done by adding `pad` tokens (`<pad>`, for T5). 

However, since the pad tokens are not part of the input text, we need to let the model know that it should not pay attention to them. That's what the `attention_mask` is for! 

In [107]:
tokens = tokenizer(sentences, padding=True)

for tok, att in zip(tokens["input_ids"], tokens["attention_mask"]):
    print(tok, att)

[48, 19, 8, 166, 7142, 1, 0, 0, 0] [1, 1, 1, 1, 1, 1, 0, 0, 0]
[1446, 6, 48, 19, 8, 511, 5932, 55, 1] [1, 1, 1, 1, 1, 1, 1, 1, 1]


The first sentence is padded to the same length as the second sentence, with 0's (remember, the ID for `<pad>`!). 

The attention mask for the first sentence also contains 0's for the padding tokens: the model will ignore them when processing the input text.

Since now all sentences have the same length, we can stack them into a single tensor. Luckily, the tokenizer can already do this for us, we just need to ask. 


In [111]:
# note: we pass return_tensors="pt" to get PyTorch tensors
# (the library also supports TensorFlow tensors, but we
# don't care about them!)
tokens = tokenizer(sentences, padding=True, return_tensors="pt")
print(tokens["input_ids"])
print(tokens["attention_mask"])

tensor([[  48,   19,    8,  166, 7142,    1,    0,    0,    0],
        [1446,    6,   48,   19,    8,  511, 5932,   55,    1]])
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])


## Model analysis

In [2]:
model = AutoModel.from_pretrained(model_name)

In [21]:
(model.decoder.embed_tokens.weight == model.encoder.embed_tokens.weight).all()

tensor(True)

In [25]:
import matplotlib.pyplot as plt

vectors = model.decoder.embed_tokens.weight.detach().cpu().numpy()

In [35]:
vectors.mean(), vectors.std()

(0.098358884, 18.803375)

In [38]:
(((vectors**2).sum(axis=1))**0.5).mean()

515.54474

In [40]:
tokenizer.get_vocab()["dog"]

10169

In [41]:
tokenizer.get_vocab()["cat"]

2138

In [44]:
tokenizer.get_vocab()["pen"]

3208

In [46]:
v_dog = vectors[10169]
v_cat = vectors[2138]
v_pen = vectors[3208]

(v_dog * v_cat).sum() / (((v_dog**2).sum() * (v_cat**2).sum())**0.5)

0.21410335918305048

In [47]:
(v_dog * v_pen ).sum() / (((v_dog**2).sum() * (v_pen**2).sum())**0.5)

0.09220335227181164