# Tokenizer

In [19]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


## Tokenize the text

Most of the tokenizer use a subword tokenization algorithm

In [20]:
raw_text = "I'm right at the point where I'm lossing everything"
tokens = tokenizer.tokenize(raw_text)
print(tokens)

['i', "'", 'm', 'right', 'at', 'the', 'point', 'where', 'i', "'", 'm', 'loss', '##ing', 'everything']


## Map the tokens to their respective id's

In [21]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[1045, 1005, 1049, 2157, 2012, 1996, 2391, 2073, 1045, 1005, 1049, 3279, 2075, 2673]


## Add the special characters
For the token to be prepared for the model, we need it to add the special characters.

101 -> Beginning of the text

102 -> End of the text

In [22]:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs['input_ids'])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[101, 1045, 1005, 1049, 2157, 2012, 1996, 2391, 2073, 1045, 1005, 1049, 3279, 2075, 2673, 102]


Let's print token with special characters

In [23]:
inputs = tokenizer(raw_text)
print(tokenizer.decode(inputs['input_ids']))

[CLS] i'm right at the point where i'm lossing everything [SEP]


## The important part
We've seen how the tokenizer works, but we only have to call the inputs

In [24]:
inputs = tokenizer(raw_text)
print(inputs)

{'input_ids': [101, 1045, 1005, 1049, 2157, 2012, 1996, 2391, 2073, 1045, 1005, 1049, 3279, 2075, 2673, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


# Batch Inputs
In general the sentences we want to pass through our model won't all have the same lengths. 

Let's see an example with sentyment analysys pipeline

In [94]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sentences = [
"This is what I'been expecting my hole life",
"I rather have a smaller sallary but keep learning a lot"    
]
tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
for id in ids:
    print(id)

[2023, 2003, 2054, 1045, 1005, 2042, 8074, 2026, 4920, 2166]
[1045, 2738, 2031, 1037, 3760, 16183, 28221, 2021, 2562, 4083, 1037, 2843]


We get to lists of different lengths.

Trying to create a tensor or a NumPy array from those two lists will result in an error, because all array and tensors should be rectangular. One way to overcome this limitation, is to make the second sentence the same length as the first, by adding a special token as many times as necessary. 

The other option would be to truncate the first sentence to be as short as the second one, but by doing we may loose some information inmportant to classify the sentence.

In general we only truncate sentences when they are longer than the maximum length the model can handle. 


The value used to pad the second sentence should not be picked randomly. The model has been pretrained with a certain padding ID-> tokenizer.pad_token_id

In [95]:
# Same length sentence
import copy

ids_same_size = copy.deepcopy(ids)

max_len = max(len(id) for id in ids_same_size) 

for index,id in enumerate(ids_same_size):
    while len(id) < max_len:
        ids_same_size[index].append(tokenizer.pad_token_id)

ids[0]

[2023, 2003, 2054, 1045, 1005, 2042, 8074, 2026, 4920, 2166]

Now that we padded the sentences, we can make a batch with them

In [96]:
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf
ids1 = tf.constant(
    [ids[0]]
)

ids2 = tf.constant([ids[1]])

all_ids = tf.constant(
    [ids_same_size[0],ids_same_size[1]]
)
ids[0]

[2023, 2003, 2054, 1045, 1005, 2042, 8074, 2026, 4920, 2166]

In [97]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
print(model(ids1).logits)
print(model(ids2).logits)
print(model(all_ids).logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tf.Tensor([[ 2.6742957 -2.1284556]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 1.4748219 -1.3111227]], shape=(1, 2), dtype=float32)
tf.Tensor(
[[ 2.8111033 -2.2949333]
 [ 1.474822  -1.3111233]], shape=(2, 2), dtype=float32)


We get different results for id[0]. This is because transformers models make heavy use of attention layers, looking all of the words in a sentence. In ids_same_size[0] is also counting de [PAD] values. 

To get the same results with or without padding, we need to indicate to the attention layers that they should ignore those padding tokens.

This is done by creating an attention mask. This is a tensor with the same inputs as the input IDs, with values 1 (to pay attention) and 0 (to pay no attention).


In [111]:
attention_mask_0=[]
attention_mask_1 =[]
attention_mask = [attention_mask_0,attention_mask_1]

for index,id_same_size in enumerate(ids_same_size):
    for element in id_same_size:
        if (element != tokenizer.pad_token_id):
            attention_mask[index].append(1)
        else:
            attention_mask[index].append(0)


In [115]:
attention_mask = tf.constant(attention_mask)

tensorflow.python.framework.ops.EagerTensor

In [119]:
output1=model(ids1)
print(output1.logits)
output2 = model(ids2)
print(output2.logits)

tf.Tensor([[ 2.6742957 -2.1284556]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 1.4748219 -1.3111227]], shape=(1, 2), dtype=float32)


In [120]:
output_all = model(all_ids,attention_mask=attention_mask)
print(output_all.logits)

tf.Tensor(
[[ 2.6742976 -2.1284575]
 [ 1.474822  -1.3111233]], shape=(2, 2), dtype=float32)


Same result!

THIS IS ALL DONE BY THE TOKENIZER WHEN WHRN YOU APPLY IT TO SEVERAL SENTENCES WITH THE FLAG padding=True.

IT WILL APPLY THE PADDING TO THE PROPER VALUE FOR THE SMALLER SENTENCES AND CREATE THE APPROPIATE ATTENTION MASK