<a href="https://colab.research.google.com/github/elise-chin-adway/transformers-course/blob/main/2-using-transformers/03_tokenizers_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [2]:
%cd /content/gdrive/MyDrive/Formations/transformers-course

/content/gdrive/MyDrive/Formations/transformers-course


In [None]:
!pip install -r requirements.txt
#!pip install datasets evaluate transformers[sentencepiece]

## Types of tokenizers
---

| Tokenization | Description | Pros | Cons |
| ------------ | ----------- | ---- | ---- |
| **Word-based** | Text --> words | | <ul><li>Large vocabulary</li><li>Singular and plural nouns or verbal forms have different IDs</li><li>Lots of unknown tokens</li></ul> |
| **Character-based** | Text --> characters | <ul><li>Smaller vocabulary size</li><li>Fewer unknown tokens</li></ul> | <ul><li>Spaces and punctuations?</li><li>Character less meaningful in Latin language</li><li>Large amount of tokens to be processed by our model</li></ul> | 
| **Subword-based** | <ul><li>Frequently used words not split into smaller subwords</li><li>Rare words --> subwords</li></ul> | | |
| **Byte-level BPE** (GPT-2) | | | |
| **WordPiece** (BERT) | | | |
| **SentencePiece or Unigram** | | | |

In [4]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


## Loading and saving
---

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the `BertTokenizer` class:

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Similar to `AutoModel`, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

We can now use the tokenizer as shown in the previous section:

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokenizer is identical to saving a model:

In [7]:
tokenizer.save_pretrained("tokenizers")

('tokernizers/tokenizer_config.json',
 'tokernizers/special_tokens_map.json',
 'tokernizers/vocab.txt',
 'tokernizers/added_tokens.json',
 'tokernizers/tokenizer.json')

## Encoding
---

Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).

### Tokenization

The tokenization process is done by the `tokenize()` method of the tokenizer:

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: `transform` and `##er`.

### From tokens to input IDs

The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [9]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


## Decoding
---

Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method as follows:

In [10]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).