While using a pretrained model, use the associated pretrained tokenizer: it will split the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence token to index (that we usually call a vocab) as during pretraining.

### To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the from_pretrained() method

In [1]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


### tokenizer is a pipeline that do
### Rawtext ---> Token -----> Special Tokens -----> Inputids for the model

In [7]:
## Rawtext to Tokens
tokens = tokenizer.tokenize("Hello, I'm a single sentence!")
print(tokens)

['Hello', ',', 'I', "'", 'm', 'a', 'single', 'sentence', '!']


In [9]:
## Rawtext to Inputids
inputids = tokenizer.convert_tokens_to_ids(tokens)
print(inputids)

[8667, 117, 146, 112, 182, 170, 1423, 5650, 106]


In [10]:
## Inputids to Special Tokens and final inputs 
fnalinputids = tokenizer.prepare_for_model(inputids)
print(fnalinputids)

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [11]:
print(tokenizer.decode(fnalinputids['input_ids']))

"[CLS] Hello, I'm a single sentence! [SEP]"

### With multiple sentences : to be processed in batch.
It will return dictionary 

In [13]:
batch_sentences = ["Hello I'm a single sentence",
                       "And another sentence",
                       "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


### Padding and truncation to build a model. Instead of list getting tensors

In [18]:
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf",max_length=8)
print(batch)

{'input_ids': <tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[ 101, 8667,  146,  112,  182,  170, 1423,  102],
       [ 101, 1262, 1330, 5650,  102,    0,    0,    0],
       [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102]])>, 'token_type_ids': <tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1]])>}


### Check. for 1st sentence. truncation happens. And for 2nd sentence padding happened
Also, attention_mask points out which tokens the model should pay attention to and which ones it should not (because they represent padding in this case)

In [22]:
encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)

{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [25]:
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences,padding=True, truncation=True, return_tensors="tf")
print(encoded_inputs)

{'input_ids': <tf.Tensor: shape=(3, 21), dtype=int32, numpy=
array([[  101,  8667,   146,   112,   182,   170,  1423,  5650,   102,
          146,   112,   182,   170,  5650,  1115,  2947,  1114,  1103,
         1148,  5650,   102],
       [  101,  1262,  1330,  5650,   102,  1262,   146,  1431,  1129,
        12544,  1114,  1103,  1248,  5650,   102,     0,     0,     0,
            0,     0,     0],
       [  101,  1262,  1103,  1304,  1304,  1314,  1141,   102,  1262,
          146,  1301,  1114,  1103,  1304,  1314,  1141,   102,     0,
            0,     0,     0]])>, 'token_type_ids': <tf.Tensor: shape=(3, 21), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(3, 21), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],

### Here you can see token_type_ids is 0 for 1st sentence tokens and 1 for 2nd sentence token. 
Not every model require token_type_ids, you can force it by using return_input_ids or return_token_type_ids