<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#GPT2-own-tokenizer-(use-Byte-level-BPE)" data-toc-modified-id="GPT2-own-tokenizer-(use-Byte-level-BPE)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>GPT2 own tokenizer (use Byte level BPE)</a></span></li><li><span><a href="#HuggingFace-preprocessing-(tokenizer)" data-toc-modified-id="HuggingFace-preprocessing-(tokenizer)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HuggingFace preprocessing (tokenizer)</a></span></li><li><span><a href="#Tokenizers-in-details-(Subword-tokenization)" data-toc-modified-id="Tokenizers-in-details-(Subword-tokenization)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Tokenizers in details (Subword tokenization)</a></span><ul class="toc-item"><li><span><a href="#Pre-tokenizer" data-toc-modified-id="Pre-tokenizer-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Pre-tokenizer</a></span></li><li><span><a href="#Byte-pair-encoding-(BPE)" data-toc-modified-id="Byte-pair-encoding-(BPE)-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Byte-pair encoding (BPE)</a></span><ul class="toc-item"><li><span><a href="#Byte-level-BPE" data-toc-modified-id="Byte-level-BPE-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Byte-level BPE</a></span></li></ul></li><li><span><a href="#WordPiece" data-toc-modified-id="WordPiece-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>WordPiece</a></span></li><li><span><a href="#Unigram" data-toc-modified-id="Unigram-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Unigram</a></span></li><li><span><a href="#SentencePiece" data-toc-modified-id="SentencePiece-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>SentencePiece</a></span></li></ul></li></ul></div>

In [1]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

In [2]:
# 12-layer, 768-hidden, 12-heads, 117M parameters.
# OpenAI GPT-2 English model
pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

# GPT2 own tokenizer (use Byte level BPE)

In [3]:
type(tokenizer)

transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast

In [4]:
ids = tokenizer.encode('This is an example of text, this is another example of text. :), :/')
print(ids)

[1212, 318, 281, 1672, 286, 2420, 11, 428, 318, 1194, 1672, 286, 2420, 13, 1058, 828, 1058, 14]


In [5]:
tokenizer.decode(ids)

'This is an example of text, this is another example of text. :), :/'

In [6]:
print([tokenizer.decode([i]) for i in ids])
# ',' and ', ' are tokenized differently. No decoding for emoji

['This', ' is', ' an', ' example', ' of', ' text', ',', ' this', ' is', ' another', ' example', ' of', ' text', '.', ' :', '),', ' :', '/']


In [12]:
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

['Don',
 "'t",
 'Ġyou',
 'Ġlove',
 'ĠðŁ',
 '¤',
 'Ĺ',
 'ĠTransformers',
 '?',
 'ĠWe',
 'Ġsure',
 'Ġdo',
 '.']

In [8]:
tokenizer.tokenize("discover discovering discovered disco disc disk discord disconnect disconnected disconnecting redis radish")


['d',
 'iscover',
 'Ġdiscovering',
 'Ġdiscovered',
 'Ġdisco',
 'Ġdisc',
 'Ġdisk',
 'Ġdiscord',
 'Ġdisconnect',
 'Ġdisconnected',
 'Ġdisconnect',
 'ing',
 'Ġred',
 'is',
 'Ġrad',
 'ish']

# HuggingFace preprocessing (tokenizer)

https://huggingface.co/transformers/preprocessing.html

In [8]:
text = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]

In [9]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
batch = tmp_token(text)
print(batch)

{'input_ids': [[15496, 314, 1101, 257, 2060, 6827], [1870, 1194, 6827], [1870, 262, 845, 845, 938, 530]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1, 1, 1]]}


In [10]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tmp_token.pad_token = tmp_token.eos_token
batch = tmp_token(text,padding=True,truncation=True,max_length=100,return_tensors="pt")
# with padding. Default GPT2 padding is to the right
print(batch)

{'input_ids': tensor([[15496,   314,  1101,   257,  2060,  6827],
        [ 1870,  1194,  6827, 50256, 50256, 50256],
        [ 1870,   262,   845,   845,   938,   530]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1]])}


In [11]:
for i in batch['input_ids']:
    print(tmp_token.decode(i))

Hello I'm a single sentence
And another sentence<|endoftext|><|endoftext|><|endoftext|>
And the very very last one


In [11]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tmp_token.pad_token = tmp_token.eos_token
batch = tmp_token(text,padding=True,truncation=True,max_length=4,return_tensors="pt")
print(batch)
# truncation is also to the right
for i in batch['input_ids']:
    print(tmp_token.decode(i))

{'input_ids': tensor([[15496,   314,  1101,   257],
        [ 1870,   281,   313,    63],
        [ 1870,   262,   845,   845]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]])}
Hello I'm a
And anot`
And the very very


With a pair of sentences (useful for BERT), but we will play around with truncation and max_length here

In [12]:
batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                   "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]

In [13]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tmp_token.pad_token = tmp_token.eos_token
batch = tmp_token(batch_sentences,batch_of_second_sentences,padding=True,return_tensors="pt")
print(batch)
print(batch['input_ids'].shape)
for i in batch['input_ids']:
    print(tmp_token.decode(i))

{'input_ids': tensor([[15496,   314,  1101,   257,  2060,  6827,    40,  1101,   257,  6827,
           326,  2925,   351,   262,   717,  6827],
        [ 1870,  1194,  6827,  1870,   314,   815,   307, 30240,   351,   262,
          1218,  6827, 50256, 50256, 50256, 50256],
        [ 1870,   262,   845,   845,   938,   530,  1870,   314,   467,   351,
           262,   845,   938,   530, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
torch.Size([3, 16])
Hello I'm a single sentenceI'm a sentence that goes with the first sentence
And another sentenceAnd I should be encoded with the second sentence<|endoftext|><|endoftext|><|endoftext|><|endoftext|>
And the very very last oneAnd I go with the very last one<|endoftext|><|endoftext|>


In [14]:
batch['input_ids'].shape

torch.Size([3, 16])

In [15]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tmp_token.pad_token = tmp_token.eos_token
batch = tmp_token(batch_sentences,batch_of_second_sentences,padding=True,truncation='only_first',max_length=12,return_tensors="pt")
print(batch)
print(batch['input_ids'].shape)
# truncate only the first sentence. Still truncate from the right
for i in batch['input_ids']:
    print(tmp_token.decode(i))

print(batch['input_ids'].shape)

{'input_ids': tensor([[15496,   314,    40,  1101,   257,  6827,   326,  2925,   351,   262,
           717,  6827],
        [ 1870,  1194,  6827,  1870,   314,   815,   307, 30240,   351,   262,
          1218,  6827],
        [ 1870,   262,   845,   845,  1870,   314,   467,   351,   262,   845,
           938,   530]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([3, 12])
Hello II'm a sentence that goes with the first sentence
And another sentenceAnd I should be encoded with the second sentence
And the very veryAnd I go with the very last one
torch.Size([3, 12])


In [16]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tmp_token.pad_token = tmp_token.eos_token
batch = tmp_token(batch_sentences,batch_of_second_sentences,padding=True,truncation='only_second',max_length=12,return_tensors="pt")
print(batch)
print(batch['input_ids'].shape)
# truncate only the second sentence. Still truncate from the right

for i in batch['input_ids']:
    print(tmp_token.decode(i))

{'input_ids': tensor([[15496,   314,  1101,   257,  2060,  6827,    40,  1101,   257,  6827,
           326,  2925],
        [ 1870,  1194,  6827,  1870,   314,   815,   307, 30240,   351,   262,
          1218,  6827],
        [ 1870,   262,   845,   845,   938,   530,  1870,   314,   467,   351,
           262,   845]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([3, 12])
Hello I'm a single sentenceI'm a sentence that goes
And another sentenceAnd I should be encoded with the second sentence
And the very very last oneAnd I go with the very


In [17]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tmp_token.pad_token = tmp_token.eos_token
batch = tmp_token(batch_sentences,batch_of_second_sentences,padding=True,truncation='longest_first',max_length=12,return_tensors="pt")
print(batch)
print(batch['input_ids'].shape)
# truncate the longest sentence of the two. Still truncate from the right

for i in batch['input_ids']:
    print(tmp_token.decode(i))

{'input_ids': tensor([[15496,   314,  1101,   257,  2060,  6827,    40,  1101,   257,  6827,
           326,  2925],
        [ 1870,  1194,  6827,  1870,   314,   815,   307, 30240,   351,   262,
          1218,  6827],
        [ 1870,   262,   845,   845,   938,   530,  1870,   314,   467,   351,
           262,   845]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([3, 12])
Hello I'm a single sentenceI'm a sentence that goes
And another sentenceAnd I should be encoded with the second sentence
And the very very last oneAnd I go with the very


Can also work with pre-tokenized inputs (where sentence has already split into words), good for NER or POS

In [18]:
tmp_token = GPT2TokenizerFast.from_pretrained(pretrained_weights,add_prefix_space=True)

batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
                   ["And", "another", "sentence"],
                   ["And", "the", "very", "very", "last", "one"]]
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
batch = tmp_token(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
print(batch)
# print(batch['input_ids'].shape)

for i in batch['input_ids']:
    print(tmp_token.decode(i))

{'input_ids': [[18435, 314, 1101, 257, 2060, 6827, 314, 1101, 257, 6827, 326, 2925, 351, 262, 717, 6827], [843, 1194, 6827, 843, 314, 815, 307, 30240, 351, 262, 1218, 6827], [843, 262, 845, 845, 938, 530, 843, 314, 467, 351, 262, 845, 938, 530]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
 Hello I'm a single sentence I'm a sentence that goes with the first sentence
 And another sentence And I should be encoded with the second sentence
 And the very very last one And I go with the very last one


# Tokenizers in details (Subword tokenization)

https://huggingface.co/transformers/tokenizer_summary.html

Subword tokenization algorithms rely on the principle that 
- frequently used words should not be split into smaller subwords
- rare words should be decomposed into meaningful subwords. 

E.g.: For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly"

In [25]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


In [26]:
tokenizer.tokenize('annoyingly')

['annoying', '##ly']

In [27]:
tokenizer.tokenize("I have a new GPU!")
# "##" means that the rest of the token should be attached to the previous one, 
# without space (for decoding or reversal of the tokenization)

['i', 'have', 'a', 'new', 'gp', '##u', '!']

In [28]:
tokenizer.tokenize("discover discovering discovered disco disc disk discord disconnect disconnected disconnecting redis radish")
# common words are kept, rare words are broken down

['discover',
 'discovering',
 'discovered',
 'disco',
 'disc',
 'disk',
 'disco',
 '##rd',
 'disco',
 '##nne',
 '##ct',
 'disconnected',
 'disco',
 '##nne',
 '##ting',
 'red',
 '##is',
 'ra',
 '##dis',
 '##h']

In [23]:
from transformers import XLNetTokenizer
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=798011.0), HTML(value='')))




['▁Don',
 "'",
 't',
 '▁you',
 '▁love',
 '▁',
 '🤗',
 '▁',
 'Transform',
 'ers',
 '?',
 '▁We',
 '▁sure',
 '▁do',
 '.']

In [24]:
tokenizer.tokenize("discover discovering discovered disco disc disk discord disconnect disconnected disconnecting redis radish")


['▁discover',
 '▁discovering',
 '▁discovered',
 '▁disco',
 '▁disc',
 '▁disk',
 '▁discord',
 '▁disconnect',
 '▁disconnected',
 '▁disc',
 'onne',
 'ting',
 '▁red',
 'is',
 '▁',
 'rad',
 'ish']

## Pre-tokenizer

Pre-tokenizer: splits the training data into words. 

- Pretokenization can be as simple as space tokenization, e.g. GPT-2, Roberta. 
- More advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.



## Byte-pair encoding (BPE)

- Pre-tokenized. After this, **a set of unique words has been created** and the **frequency of each word it occurred in the training data has been determined.** 
- **BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words** 
- Learns **merge rules** to form a new symbol from two symbols of the base vocabulary.: Most frequent ngram pairs ↦ a new ngram
- It does so **until the vocabulary has attained the desired vocabulary size.** 

E.g.:

- ```("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)```

- the base vocabulary is ["b", "g", "h", "n", "p", "s", "u"]
    - ```("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)```
- BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently.
    - ```("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)```
- BPE then identifies the next most common symbol pair.
    - ```("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)```
    
    
At this stage, the vocabulary is ```["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]``` (3 merges)

**Unknown word**

For instance, the word "bug" would be tokenized to ```["b", "ug"]``` but "mug" would be tokenized as ```["<unk>", "ug"]``` since the symbol ```"m"``` is not in the base vocabulary. 

(In general, single letters such as ```"m"``` are not replaced by the ```"<unk>"``` symbol because the training data usually includes at least one occurrence of each letter, but it is likely to happen for very special characters like emojis.)

**Size of vocabulary**

As mentioned earlier, the vocabulary size, i.e. **the base vocabulary size + the number of merges**, is a **hyperparameter to choose**. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

### Byte-level BPE

A base vocabulary that includes **all possible base characters can be quite large** if e.g. **all unicode characters are considered as base characters**. 

To have a better base vocabulary, **GPT-2 uses bytes as the base vocabulary**, which is a clever trick to **force the base vocabulary to be of size 256** while ensuring that every base character is included in the vocabulary.

E.g. With some additional rules to deal with punctuation, the **GPT2’s tokenizer can tokenize every text without the need for the ```<unk>``` symbol**. GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.

## WordPiece

- WordPiece is the subword tokenization algorithm used for **BERT, DistilBERT, and Electra.**
- very similar to BPE: WordPiece first initializes the vocabulary to include every character present in the training data and progressively learn a given number of merge rules. 
- **In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that MAXIMIZES THE LIKELIHOOD OF THE TRAINING DATA once added to the vocabulary.**
    - maximizing the likelihood of the training data is equivalent to **finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs.** E.g. ```"u"```, followed by ```"g"``` would have only been merged if the probability of ```"ug"``` divided by ```"u"```, ```"g"``` would have been greater than for any other symbol pair

## Unigram

- Unigram initializes its base vocabulary to a large number of symbols (such as all pre-tokenized words and the most common substrings) and progressively trims down each symbol to obtain a smaller vocabulary.
- not used directly for any of the models in the transformers, **but it’s used in conjunction with SentencePiece.**

**Steps**

- At each training step, the Unigram algorithm defines a **loss (often defined as the log-likelihood)** over the training data given the current vocabulary and **a unigram language model**. 
- Then, for each symbol in the vocabulary, the algorithm computes **how much the overall loss would increase if the symbol was to be removed from the vocabulary.** Unigram then **removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest**, i.e. those symbols that least affect the overall loss over the training data. 
- This process is repeated **until the vocabulary has reached the desired size**. 
- The Unigram algorithm **always keeps the base CHARACTERS** so that any word can be tokenized.


**How to tokenize words after training**

- the algorithm has several ways of tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary: ```["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],```, there are several way to tokenize the word ```hugs```
- which one to choose? Unigram **saves the probability of each token in the training corpus on top of saving the vocabulary** so that the **probability of each possible tokenization can be computed after training**. The algorithm simply **picks the most likely tokenization in practice**, but also offers the possibility to sample a possible tokenization according to their probabilities.

Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of the words x1,…,xN and that the set of all possible tokenizations for a word xi is defined as S(xi), then the overall loss is defined as

$$
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
$$

## SentencePiece

**All tokenization algorithms described so far** have the same problem: It is assumed that the input text **uses spaces to separate words.**

- To solve this problem more generally, use SentencePiece (**language independent subword tokenizer**)
- SentencePiece **treats the input as a raw input stream**, thus **including the space** in the set of characters to use. 
- It then uses the **BPE or unigram algorithm** to construct the appropriate vocabulary.

In the example below the "▁" character (for space) was included in the vocabulary

**All transformers models in the library that use SentencePiece** use it in combination with **unigram**. 
- Examples of models using SentencePiece are **ALBERT, XLNet, Marian, and T5.**

In [15]:
from transformers import XLNetTokenizer
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

tokenizer.tokenize("discover discovering discovered disco disc disk discord disconnect disconnected disconnecting redis radish")


['▁discover',
 '▁discovering',
 '▁discovered',
 '▁disco',
 '▁disc',
 '▁disk',
 '▁discord',
 '▁disconnect',
 '▁disconnected',
 '▁disconnect',
 'ing',
 '▁red',
 'is',
 '▁',
 'rad',
 'ish']