# Tokenizers
This notebook goes over the most common tokenizer concepts that I am likely to need in my day-to-day work. There is a really good [comprehensive tokenizer tutorial](https://huggingface.co/learn/nlp-course/chapter6/1?fw=pt) that goes over the different tokenization schemes and how to build a new tokenizer from the ground up.

### Summary
The most common way to use tokenizers for me will be -

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
sentences: list[str] = ...
batch = tokenizer(sentences, truncation=True, padding=True, return_tensors="pt")
```

Most of the models use subword tokenizers. Each model comes with its own corresponding tokenizer. This is because different models (even variants of the same core architecture) have different input schemes. The tokenizer abstracts this away. The simplest way of using a tokenizer is to just `__call__` the object. This will do the following three things - 
  1. Break the sentance up into multiple tokens
  2. Look up the index value of each resulting token from a central vocab
  3. Prepare ancilliary data like attention mask, token type indicators, etc. needed by the model

In [2]:
from transformers import BertTokenizer

In [3]:
# Load the tokenizer corresponding to this particular checkpoint
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
input = tokenizer("Attention is indeed all you need :-)")
input

{'input_ids': [101, 1335, 5208, 2116, 1110, 5750, 1155, 1128, 1444, 131, 118, 114, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

  * `input_ids` are the indexes of the tokens. Given this is a subword tokenizer, the number of tokens is more than the number of words. 
  * `token_type_ids` are used to indicate sentence boundaries, e.g., if there are two sentences that go into the model, then the first sentence will have a mask of `0`, the second will have a mask of `1`. 
  * `attention_mask` is the mask that tells the model which tokens to ignore. The ones masked with `0` are ignored.

In [4]:
# Two sentences are part of the same instance
sentence_1 = "Using a Transformer network is simple."
sentence_2 = "Attention is indeed all you need :-)"
tokenizer(sentence_1, sentence_2)

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 119, 102, 1335, 5208, 2116, 1110, 5750, 1155, 1128, 1444, 131, 118, 114, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

I can save the tokenizer locally by calling `tokenizer.save_pretrained("/local/path")`

To see the individual steps in action, I can use the various `tokenizer` methods.

In [5]:
toks = tokenizer.tokenize("Attention is indeed all you need :-)")
toks

['At', '##ten', '##tion', 'is', 'indeed', 'all', 'you', 'need', ':', '-', ')']

In [6]:
ids = tokenizer.convert_tokens_to_ids(toks)
ids

[1335, 5208, 2116, 1110, 5750, 1155, 1128, 1444, 131, 118, 114]

Going back from token IDs to the full sentance is also pretty easy -

In [7]:
tokenizer.decode(input["input_ids"])

'[CLS] Attention is indeed all you need : - ) [SEP]'

`[CLS]` and `[SEP]` are two special tokens that this particular BERT model was trained with. `[CLS]` is the classification task output. `[SEP]` is simply the token that separates two sentances for two-sentence inputs.

In [8]:
tokenizer.decode(tokenizer(sentence_1, sentence_2)["input_ids"])

'[CLS] Using a Transformer network is simple. [SEP] Attention is indeed all you need : - ) [SEP]'

#### Automatic Selection of Tokenizer
Instead of using the concrete tokenizer - `BertTokenizer` in this case, HF has a convenience class called `AutoTokenizer` that will get the right tokenizer based on the model name. This is a pretty standard pattern in HF APIs, where there is a concrete class for a particular model family, and then there are various auto-classes that will automatically choose the correct concrete implementation without the user having to worry about it. In most apps, I'll probably end up using the auto- family of classes.

In [9]:
from transformers import AutoTokenizer

In [10]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(type(tokenizer))
tokenizer("Using a Transformer network is simple")

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>


{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

#### Batching
Automatic tokenizer will usually select a fast variant of the tokenizer. Tokenizers can be fast and efficient when operating on a batch instead of individual sentences.

In [11]:
sentences = [
    "Using Transformers is easy!",
    "Attention is indeed all you need :-)"
]
batch = tokenizer(sentences)
batch

{'input_ids': [[101, 7993, 25267, 1110, 3123, 106, 102], [101, 1335, 5208, 2116, 1110, 5750, 1155, 1128, 1444, 131, 118, 114, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [12]:
print(f"Number of instances in the batch - {len(batch['input_ids'])}")
print(f"Number of tokens in the first instance - {len(batch['input_ids'][0])}")
print(f"Number of tokens in the second instance - {len(batch['input_ids'][1])}")

Number of instances in the batch - 2
Number of tokens in the first instance - 7
Number of tokens in the second instance - 13


In [13]:
print(tokenizer.decode(batch["input_ids"][0]))
print(tokenizer.decode(batch["input_ids"][1]))

[CLS] Using Transformers is easy! [SEP]
[CLS] Attention is indeed all you need : - ) [SEP]


In the case where we have two sentences in the same instance, we can batch it by passing two lists to the tokenizer. The resulting batch will consist of one setence from each of input lists, i.e., the input is zipped and then tokenized.

In [14]:
sentences_1 = [
    "Using Transformers is easy!",
    "Attention is indeed all you need :-)"    
]

sentences_2 = [
    "Its not that hard to use Transformers!",
    "You don't need anything more than attention :-)"
]
batch = tokenizer(sentences_1, sentences_2)
batch

{'input_ids': [[101, 7993, 25267, 1110, 3123, 106, 102, 2098, 1136, 1115, 1662, 1106, 1329, 25267, 106, 102], [101, 1335, 5208, 2116, 1110, 5750, 1155, 1128, 1444, 131, 118, 114, 102, 1192, 1274, 112, 189, 1444, 1625, 1167, 1190, 2209, 131, 118, 114, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [15]:
print(tokenizer.decode(batch["input_ids"][0]))
print(tokenizer.decode(batch["input_ids"][1]))

[CLS] Using Transformers is easy! [SEP] Its not that hard to use Transformers! [SEP]
[CLS] Attention is indeed all you need : - ) [SEP] You don't need anything more than attention : - ) [SEP]


#### Tokens as Tensors
So far we have seen that the `input_ids`, `attention_mask`, etc. are being retunred as a list of integers. But before we pass these to a model we'll need to convert these to tensors, `t.tensor(batch["input_ids"])`, but this can get tedious to recreate the dict. HF has a convenience param called `return_tensors` where we can specify whether we want PyTorch, Tensorflow, or numpy tensors.

In [16]:
toks = tokenizer("Using a Transformers is easy!", return_tensors="pt")
toks

{'input_ids': tensor([[  101,  7993,   170, 25267,  1110,  3123,   106,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [17]:
print(toks["input_ids"].shape)
print(toks["token_type_ids"].shape)
print(toks["attention_mask"].shape)

torch.Size([1, 8])
torch.Size([1, 8])
torch.Size([1, 8])


#### Padding and Truncating

A lot of models have a max number of tokens that they can take. It is possible to tell the tokenizer to clip or pad the tokens to confirm to this max size. I can also specify my own max size for each batch of tokens that the tokenizer generates. This is useful when I want each batch to have the same size but am ok with different batches having different sizes. Padding/truncating is a requirement when I want the tokenizer to return tensors, because tensors cannot have different sized rows.

In [18]:
sentences = [
    "One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them.",
    "You shall not pass!"    
]
batch = tokenizer(sentences)
print(len(batch["input_ids"][0]), len(batch["input_ids"][1]))

28 7


In [19]:
try:
    tokenizer(sentences, return_tensors="pt")
except ValueError as ve:
    print("ERROR MESSAGE: ", ve)


ERROR MESSAGE:  Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).


In [20]:
batch = tokenizer(sentences, padding=True, truncation=True)
print(len(batch["input_ids"][0]), len(batch["input_ids"][1]))

28 28


In [21]:
batch = tokenizer(sentences, padding=True, truncation=True, max_length=12)
print(len(batch["input_ids"][0]), len(batch["input_ids"][1]))

12 12


In [22]:
tokenizer(sentences, padding=True, truncation=True, max_length=5, return_tensors="pt")

{'input_ids': tensor([[ 101, 1448, 3170, 1106,  102],
        [ 101, 1192, 4103, 1136,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}

#### Fast Tokenizers
HF has a bunch of tokenizers that are backed by some sort of Rust tokenization lib. This makes them really fast when processing batches. There are other tokenizers that are pure Python and they are very slow. There are two ways of figuring out whether a tokenizer is fast or not -

In [25]:
print(tokenizer.is_fast)
print(batch.is_fast)

True
True


In [42]:
input = "My name is Avilay and I am here to program!"
encoded = tokenizer(input)
encoded

{'input_ids': [101, 1422, 1271, 1110, 138, 23909, 1183, 1105, 146, 1821, 1303, 1106, 1788, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

There are two ways to see what tokens the ids map to. 

The first one is to `decode` the ids. The output in this case is the original text along with any special tokens that were inserted. This means that characters like the exclamation point, which is actually a seperate token is stuck back to the last word like it appeared in the original senetence. Similarly, the word "Avilay" is actually three tokens, but they are all merged together back as in the original sentence.

The second way is to calle the `.tokens()` method on the output. This will give me a list of actual tokens that the tokenizer broke this sentence into.

In [43]:
tokenizer.decode(encoded["input_ids"])

'[CLS] My name is Avilay and I am here to program! [SEP]'

In [44]:
encoded.tokens()

['[CLS]',
 'My',
 'name',
 'is',
 'A',
 '##vila',
 '##y',
 'and',
 'I',
 'am',
 'here',
 'to',
 'program',
 '!',
 '[SEP]']

There is a convenience method that tells me which original word each token maps to. I can see that "Avilay" is broken down into three tokens "A", "##vila", and "##y". So these three tokens will map to the same word.

In [45]:
encoded.word_ids()

[None, 0, 1, 2, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, None]

There is another convenience API that tells me which word was at idx 3. This API gives me the [start, end) indexes of the original sentence that this word idx maps to.

In [47]:
start, end = encoded.word_to_chars(3)
input[start:end]

'Avilay'