In this chapter we build GPT tokenizer

byte-pair encoding is pretty simple:
* we determine a most popular pair of bytes
* we invent a new byte and replace the pair with the new byte
* repeat

In [54]:
import collections
import itertools

Pair = tuple[int, int]
Merges = dict[Pair, int]


def build_merges(text: str, num_iterations: int) -> Merges:
    merges = {}
    encoded = list(text.encode('utf-8'))
    new_bytes = itertools.count(256)
    for _ in range(num_iterations):
        if len(encoded) < 2:
            break
        pair = find_most_popular_pair(encoded)
        replacement = next(new_bytes)
        encoded = replace(encoded, pair, replacement)
        merges[pair] = replacement
    return merges


def find_most_popular_pair(encoded: list[int]) -> tuple[int, int]:
    counts = collections.Counter()
    prev = None
    for cur in encoded:
        if prev is not None:
            pair = (prev, cur)
            counts[pair] += 1
        prev = cur
    return counts.most_common(1)[0][0]


def replace(encoded: list[int], what: tuple[int, int], replacement: int):
    result = []
    state = []
    for cur in encoded:
        # invariant: len(state) < 2
        state.append(cur)
        if tuple(state) == what:
            result.append(replacement)
            state = []
        elif len(state) == 2:
            # not a match, we can add the first element, since it's not part of the `what` pair
            result.append(state[0])
            state.pop(0)

    # invariant: len(state) < 2
    result.extend(state)
    return result

In [59]:

ascii_text = "aaabdaaabac"
ascii_merges = build_merges(ascii_text, num_iterations=3)

In [58]:
russian_text = "приветики вам, хочу проверить byte-pair encoding"
russian_merges = build_merges(russian_text, num_iterations=20)

In [67]:
def byte_pair_encode(text: str, merges: Merges) -> list[int]:
    """
    we need to replace each pair in the same order
    merges are in topological order
    same merge can't be applied twice
    this is probably not the most efficient way to do it
    karpathy's implementation is faster
    """
    encoded = list(text.encode("utf-8"))
    for (a, b), m in merges.items():
        encoded = replace(encoded, (a, b), m)
    return encoded


def byte_pair_decode(encoded: list[int], merges: Merges) -> str:
    # all simple tokens are in topological order (they don't depend on each other)
    # all composite tokens are in topological order:
    # dictionary is ordered
    # and each new token depends on already defined tokens
    # already defined token can appear as a new value in a merges
    token_values = {i: [i] for i in range(256)}
    for (a, b), m in merges.items():
        token_values[m] = token_values[a] + token_values[b]
    decoded = []
    for token in encoded:
        decoded.extend(token_values[token])
    return bytes(decoded).decode('utf-8')

In [68]:
ascii_encoded = byte_pair_encode(ascii_text, merges=ascii_merges)
print(byte_pair_decode(ascii_encoded, ascii_merges))

aaabdaaabac


In [69]:
russian_encoded = byte_pair_encode(russian_text, merges=russian_merges)
print(byte_pair_decode(russian_encoded, russian_merges))

приветики вам, хочу проверить byte-pair encoding


Real implementation of byte pair encoding/decoding in OpenAI GPT are doing extra processing: they split text into different categories of characters (letters, numbers, punctuation) and bpe can't cross this boundaries during merges.

The reason for that is that it's not right to mix punctuation & letters:
let's say you have a separate tokens for "dog", "dog.", "dog?". It's the same concept but it will be represented by different tokens, which can't help.

## Limitations
Some of the limitations of LLMs are due to tokenization. Token is not a character, it's a sequence of characters. That's why it's hard for LLMs to do character-level manipulations (like count number of characters, reverse a string, etc). Same with arithmetic: e.g 4 digit number can be any of the combinations of tokens (one token of length 4; or one token of length 3 and one token of length 1; etc), so it's a miracle LLM can do arithmetic at all.

## Tiktoken
Tiktoken is a library from OpenAI for tokenization, it can encode/decode text <-> tokens

In [71]:
import tiktoken

encoding = tiktoken.get_encoding("o200k_base")



In [72]:
encoding.encode("some text SolidGoldMagikarp")

[25231, 2201, 35764, 30717, 20101, 507, 11784]

As you can see text was pretty long, but was tokenized just to 7 tokens.
We can decode it back.

In [73]:
encoding.decode(encoding.encode("some text SolidGoldMagikarp"))

'some text SolidGoldMagikarp'

tiktoken provides only inference code - you can't train tokenizer with it. It's just a tokenizer used in OpenAI models.

## Sentencepiece

sentencepiece library also provides tokenizer with training capabilities.

In [74]:
import sentencepiece as spm

In [79]:
spm.SentencePieceTrainer.train(sentence_iterator=iter(["Call me Ishmael", "Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world"]), model_prefix = 'm', vocab_size = 30)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 30
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0


In [81]:
sp = spm.SentencePieceProcessor(model_file="m.model")

In [84]:
sp.encode("howdy")

[3, 10, 8, 22, 12, 14]

That was very weird result (tokenization is longer than initial text), but looks like sentencepiece is pretty messy.