# Tokenizer class

https://huggingface.co/docs/transformers/main_classes/tokenizer

Try out various methods:

https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer

* call("text")
* tokenize("text")
* encode("text") : converts to IDs of the tokens
* decode(encoded_text) : converts IDs to token

In [24]:
from transformers import BertTokenizer

## 1. Create an instance

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

## 2. Tokenize  the text

In [25]:
text = 'he is a storyteller'

tokenizer.tokenize(text)

['he', 'is', 'a', 'story', '##tell', '##er']

# 3. Encode

In [26]:
encoded_text = tokenizer.encode(text)

encoded_text

[101, 2002, 2003, 1037, 2466, 23567, 2121, 102]

## 4. Decode

In [9]:
decoded_text = tokenizer.decode(encoded_text)

decoded_text

'[CLS] he is a storyteller [SEP]'

## 5. Create tensors for input to the model

In [27]:
# Returns Python Lists
inputs = tokenizer(text)

inputs

{'input_ids': [101, 2002, 2003, 1037, 2466, 23567, 2121, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [28]:
# Returns tensors in desired format (PyTorch=pt, TensorFlow=tf, Flax=jax, Numpy=np)
return_tensors = 'pt' 

inputs = tokenizer(text, return_tensors=return_tensors)

inputs

{'input_ids': tensor([[  101,  2002,  2003,  1037,  2466, 23567,  2121,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [29]:
type(inputs['input_ids'])

torch.Tensor