## Comparison of tokenizers

- Subword tokenization breaks words into smaller units to better handle OOV words and use repetable chunks in vocabulary. 

#### Word piece
- Split words into longest possible subword from pretrained vocab. 
- Pairs are created to maximize likelihood of training data during vocab creation 
- Needs pretokenization, splitting into white spaces
- Replaces unknown words with UNK

#### BPE
- Merges most frequent characters iteratively to build subwords
- Focuses on frequency based merging 
- Needs pretokenization, splitting into white spaces
- Splits into characters if unknown

#### Sentence piece
- Treats text as raw byte stream. Ideal for scripts without spaces, language agnostic. 
- Does not need pretokenization, treats space as a character
- Determines what tokenization schema maximizes likelihood of dataset occuring using EM 
    - Repeat till convergence
        - E step - Estimate probability of token occurences
        - M step - Find best tokenization scheme for each sentence by maximizing likelihood of occurence, viterbi decoding. 
- Rarely uses UNK. 


In [1]:
import transformers
transformers.__version__

'4.51.0'

#### Word piece

In [8]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')

example_sentence = "Hello, I think aliens are real, the government is hiding them. !! I want to know the truth."
math_example_sentence = "34+1=35. 128*245=31360. 2^10=1024."
python_example_sentence = """
"
    def add(a, b): 
        return a + b
"""

print(bert_tokenizer.tokenize(example_sentence))
print(bert_tokenizer.tokenize(math_example_sentence))
print(bert_tokenizer.tokenize(python_example_sentence))

['hello', ',', 'i', 'think', 'aliens', 'are', 'real', ',', 'the', 'government', 'is', 'hiding', 'them', '.', '!', '!', 'i', 'want', 'to', 'know', 'the', 'truth', '.']
['34', '+', '1', '=', '35', '.', '128', '*', '245', '=', '313', '##60', '.', '2', '^', '10', '=', '102', '##4', '.']
['"', 'def', 'add', '(', 'a', ',', 'b', ')', ':', 'return', 'a', '+', 'b']


#### BPE

- GPT 2 tokenizer treats each space seperately for python, space inefficient. 

In [9]:
from transformers import GPT2Tokenizer

bpe_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
print(bpe_tokenizer.tokenize(example_sentence))
print(bpe_tokenizer.tokenize(math_example_sentence))
print(bpe_tokenizer.tokenize(python_example_sentence))

['Hello', ',', 'ĠI', 'Ġthink', 'Ġaliens', 'Ġare', 'Ġreal', ',', 'Ġthe', 'Ġgovernment', 'Ġis', 'Ġhiding', 'Ġthem', '.', 'Ġ!!', 'ĠI', 'Ġwant', 'Ġto', 'Ġknow', 'Ġthe', 'Ġtruth', '.']
['34', '+', '1', '=', '35', '.', 'Ġ128', '*', '245', '=', '313', '60', '.', 'Ġ2', '^', '10', '=', '1024', '.']
['Ċ', '"', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġdef', 'Ġadd', '(', 'a', ',', 'Ġb', '):', 'Ġ', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb', 'Ċ']


In [16]:

# load Xenova/gpt-4 from transformers
gpt4_tokenizer = GPT2Tokenizer.from_pretrained('Xenova/gpt-4')
print(gpt4_tokenizer.tokenize(example_sentence))
print(gpt4_tokenizer.tokenize(math_example_sentence))
print(gpt4_tokenizer.tokenize(python_example_sentence))

['Hello', ',', 'ĠI', 'Ġthink', 'Ġaliens', 'Ġare', 'Ġreal', ',', 'Ġthe', 'Ġgovernment', 'Ġis', 'Ġhiding', 'Ġthem', '.', 'Ġ!!', 'ĠI', 'Ġwant', 'Ġto', 'Ġknow', 'Ġthe', 'Ġtruth', '.']
['34', '+', '1', '=', '35', '.', 'Ġ', '128', '*', '245', '=', '313', '60', '.', 'Ġ', '2', '^', '10', '=', '10', '24', '.']
['Ċ', '"', 'Ċ', 'ĠĠĠ', 'Ġdef', 'Ġadd', '(', 'a', ',', 'Ġb', '):', 'ĠĊ', 'ĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb', 'Ċ']


#### SentencePiece
- Treats 4 spaces as one token, space efficient
- Consistency in parsing math tokens helps improve performance

In [10]:
from transformers import XLNetTokenizer

sentence_pieces_tokenizer = XLNetTokenizer.from_pretrained('xlnet/xlnet-base-cased')

print(sentence_pieces_tokenizer.tokenize(example_sentence))
print(sentence_pieces_tokenizer.tokenize(math_example_sentence))
print(sentence_pieces_tokenizer.tokenize(python_example_sentence))

['▁', 'Hello', ',', '▁I', '▁think', '▁aliens', '▁are', '▁real', ',', '▁the', '▁government', '▁is', '▁hiding', '▁them', '.', '▁', '!!', '▁I', '▁want', '▁to', '▁know', '▁the', '▁truth', '.']
['▁34', '+', '1', '=', '35', '.', '▁128', '*', '24', '5', '=', '313', '60', '.', '▁2', '^', '10', '=', '10', '24', '.']
['▁', '"', '▁def', '▁add', '(', 'a', ',', '▁', 'b', ')', ':', '▁return', '▁a', '▁+', '▁', 'b']
