Heavily inspired by Karpathy's, here's my late night take on a simple, clean, and fast BPE implementation.

I'm borrowing a lot from Karpathy's code, but we'll use more efficient data-structures:
1. We'll transform our sequence into a sequence of pairs, which we will hold in a [*Leap*](datastructures/leap.py). This will allow efficient merging.
2. We'll calculate the pair counts only once, and hold them in a [*Multiset*](datastructures/multiset.py). This will allow efficiently finding the next pair to merge.

The [*Leap*](datastructures/leap.py) is a data structure that I came up with for this (please lmk if it already has a name!). It represents an ordered list, and allows efficient iteration over items both by position and by value. It also allows for constant time insertion/appending.

The [*Multiset*](datastructures/multiset.py) is functionally equivalent to the built-in `collections.Counter`. The cost of initialization and updates are a bit higher for the Multiset, but finding the top element takes constant time, making it drastically faster for our purposes (see [multiset_tests.ipynb](datastructures/multiset_tests.ipynb)).

# Why is minbpe slow?

If we are to perform N merges, and the length of the training sequence is L, Karpathy's original impl does (I think):
```python
for i in range(N):
    calc_stats()        # O(L)
    find_max()          # O(L)
    do_merges()         # O(L)
```
For a total complexity of O(N*L) (maybe I'm neglecting some factors).

# Why is fast_minbpe fast?

Using our `Leap` and `Multiset`, we instead get:
```python
stats = calc_stats()              # O(L)
for i in range(N):
    find_max()                    # O(1)
    do_merges_and_update_stats()  # O(M_i + log(L))
```
Where M_i denotes the actual number of merges we perform at the ith iteration. Note that M_1+M_2+...+M_n <= L - 1, so the overall complexity of evertyhing (again neglecting logarithmic factors) is O(L)!

We'll unfortunately have to give up some lovely code from Karpathy's implementation, such as:
```python
pair = max(stats, key=stats.get)
```
That said, armed with the implementations of the `Leap` and the `Multiset`, the rest of our code remains concise and clean I think.

Note that I only implement the functionality of Karpathy's `BasicTokenizer`.

You can find some more details in [this post](https://yanivle.github.io/ai/2024/02/23/fast_minbpe.html).

In [1]:
from util.mytimeit import timeit
from bpe import train, tokenize, detokenize

In [2]:
text = open("data/taylorswift.txt", "r", encoding="utf-8").read()
print(f'Source text is of length: {len(text):,}')
print(f'First 100 chars: {repr(text[:100])}')

Source text is of length: 185,561
First 100 chars: 'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaTh'


In [3]:
merge_tree, vocab = train(text, 512, verbose=True)

Training tokenizer on text of length 185,561 with vocab of size 512.
build_pairs_leap took 0.29 seconds.
init_pairs_stats took 0.02 seconds.
Merge 1/256: (101, 32) -> 256 (b'e ') had 2981 occurrences
Merge 2/256: (44, 32) -> 257 (b', ') had 2961 occurrences
Merge 3/256: (100, 32) -> 258 (b'd ') had 2617 occurrences
Merge 4/256: (46, 32) -> 259 (b'. ') had 2560 occurrences
Merge 5/256: (114, 32) -> 260 (b'r ') had 2428 occurrences
Merge 6/256: (50, 48) -> 261 (b'20') had 2365 occurrences
Merge 7/256: (115, 32) -> 262 (b's ') had 2053 occurrences
Merge 8/256: (105, 110) -> 263 (b'in') had 2006 occurrences
Merge 9/256: (111, 110) -> 264 (b'on') had 1815 occurrences
Merge 10/256: (114, 105) -> 265 (b'ri') had 1805 occurrences
Merge 11/256: (116, 32) -> 266 (b't ') had 1802 occurrences
Merge 12/256: (116, 104) -> 267 (b'th') had 1737 occurrences
Merge 13/256: (101, 258) -> 268 (b'ed ') had 1736 occurrences
Merge 14/256: (257, 261) -> 269 (b', 20') had 1705 occurrences
Merge 15/256: (97, 110

Timing time!

In [4]:
for vocab_size in [300, 1000, 10_000]:
    merge_tree, vocab = timeit(lambda: train(text, vocab_size), 'Training')
    tokenized_text = timeit(lambda: tokenize(text, merge_tree), 'Tokenization')
    print(f'Tokenized text has {len(tokenized_text)} tokens.')
    detokenized_text = timeit(lambda: detokenize(tokenized_text, vocab), 'Detokenize')
    assert detokenized_text == text
    print()

Training tokenizer on text of length 185,561 with vocab of size 300.
build_pairs_leap took 0.39 seconds.
init_pairs_stats took 0.03 seconds.
Training took 0.64 seconds.
Tokenization took 0.60 seconds.
Tokenized text has 128451 tokens.
Detokenize took 0.00 seconds.

Training tokenizer on text of length 185,561 with vocab of size 1,000.
build_pairs_leap took 0.49 seconds.
init_pairs_stats took 0.01 seconds.
Training took 1.02 seconds.
Tokenization took 0.73 seconds.
Tokenized text has 58337 tokens.
Detokenize took 0.00 seconds.

Training tokenizer on text of length 185,561 with vocab of size 10,000.
build_pairs_leap took 0.40 seconds.
init_pairs_stats took 0.02 seconds.
Training took 1.32 seconds.
Tokenization took 0.78 seconds.
Tokenized text has 24302 tokens.
Detokenize took 0.00 seconds.



In [5]:
# Let's inspect our tokenized text:
def debug(tokenized_text, vocab):
    print('🍔'.join([vocab[t].decode('utf-8') for t in tokenized_text]))

debug(tokenized_text[:100], vocab)

C🍔op🍔y 🍔p🍔ast🍔e 🍔of the 🍔Wikipe🍔dia 🍔article 🍔on 🍔Taylor Swift, 🍔as of Feb🍔 🍔16🍔, 2024🍔.
🍔--🍔-🍔

🍔Main 🍔m🍔enu🍔

🍔Wikipedia🍔The F🍔ree 🍔Enc🍔yclopedia
🍔
🍔Search
🍔C🍔re🍔ate 🍔account🍔
🍔L🍔og🍔 🍔in🍔

🍔Personal tool🍔s
🍔Cont🍔ents 🍔 🍔h🍔ide🍔
🍔(🍔Top🍔)
🍔Life and career🍔
Toggle 🍔Life and 🍔career 🍔subsection
🍔Artistry🍔
Toggle 🍔Artist🍔ry 🍔subsection
🍔Accolades and achievements
🍔Cultural status🍔
Toggle 🍔Cultural 🍔status 🍔subsection
🍔Wealth🍔
Toggle 🍔W🍔ealth subsection
🍔Discography
Filmography
🍔Tours
🍔See also🍔
F🍔ootnotes
References
🍔Toggle 🍔Referenc🍔es 🍔subsection
🍔External links
Taylor Swift
🍔
🍔13🍔6 🍔l🍔ang🍔u🍔ag🍔es
🍔Ar🍔tic🍔le
🍔Tal🍔k🍔
🍔Read🍔
View 🍔source🍔
View 


In [6]:
# What about a GPT-4-like vocabulary with 100K tokens?
merge_tree, vocab = timeit(lambda: train(text, 100_000), 'Training')
tokenized_text = timeit(lambda: tokenize(text, merge_tree), 'Tokenization')
print(f'Tokenized text has {len(tokenized_text)} tokens.')
detokenized_text = timeit(lambda: detokenize(tokenized_text, vocab), 'Detokenize')
assert detokenized_text == text

Training tokenizer on text of length 185,561 with vocab of size 100,000.
build_pairs_leap took 0.36 seconds.
init_pairs_stats took 0.03 seconds.
Training took 1.99 seconds.
Tokenization took 1.10 seconds.
Tokenized text has 1 tokens.
Detokenize took 0.00 seconds.


1.99 seconds, but with a vocabulary of 100K, only 1 token remains... Let's try something longer...

[The Guiness book of world records recognizes](https://www.guinnessworldrecords.com/world-records/longest-novel) Marcel Proust's "A la recherche du temps perdu" as the world's longest novel. Turns out it's comprised of several volumes. I found a translated version of the first volume - "Swann's Way" - on [the website](https://gutenberg.net.au/plusfifty-n-z.html#proust) for Project Gutenberg. Specifically [this file](https://gutenberg.net.au/ebooks03/0300511.txt). It's just over 1MB - perfect!
Let's try training a GPT-4 sized tokenizer with a vocabulary of 100K tokens on that:

In [7]:
text = open("data/0300511.txt", "r", encoding="utf-8").read()
print(f'Source text is of length: {len(text):,}')
print(f'First 100 chars: {repr(text[:100])}')

Source text is of length: 1,088,320
First 100 chars: "\nProject Gutenberg Australia\n\nTitle:      Swann's Way\n            (Du côté de chez Swann)\n          "


In [8]:
merge_tree, vocab = timeit(lambda: train(text, 100_000), 'Training')
tokenized_text = timeit(lambda: tokenize(text, merge_tree), 'Tokenization')
print(f'Tokenized text has {len(tokenized_text)} tokens.')
detokenized_text = timeit(lambda: detokenize(tokenized_text, vocab), 'Detokenize')
assert detokenized_text == text

Training tokenizer on text of length 1,088,320 with vocab of size 100,000.
build_pairs_leap took 1.84 seconds.
init_pairs_stats took 0.12 seconds.
Training took 9.72 seconds.
Tokenization took 6.27 seconds.
Tokenized text has 86693 tokens.
Detokenize took 0.03 seconds.


In [9]:
# Let's inspect our tokenized text:
debug(tokenized_text[:100], vocab)


🍔Project Gutenberg Australia🍔

Title:      Swann's Way
            (Du côté de chez Swann)
            [Vol. 1 of Remembrance of Things Past—
            (À la Recherche du temps perdu)]
Author:     Marcel Proust
            Translated from the French by C. K. Scott Moncrieff🍔
🍔* A Project Gutenberg of Australia 🍔eBook *🍔
🍔eBook 🍔No🍔.🍔:  03005🍔1🍔1🍔.🍔t🍔x🍔t
🍔L🍔angu🍔age🍔: 🍔  English🍔
Date 🍔first 🍔posted:          Mar🍔ch 🍔20🍔03🍔
Date 🍔most recently 🍔upd🍔ated🍔: 🍔S🍔ept 2022🍔

🍔Produ🍔ction not🍔es: 🍔W🍔or🍔ds 🍔in ital🍔ics in the 🍔book
🍔        🍔        🍔  🍔are 🍔enclos🍔ed by 🍔underscores (🍔_) in this eBook

🍔Project Gutenberg of Australia 🍔eBook🍔s are created 🍔from printed edition🍔s
which are in the 🍔public domain in 🍔Australi🍔a, unless a 🍔copy🍔right 🍔notice
🍔is 🍔included. We do 🍔NOT 🍔keep 🍔any 🍔eBook🍔s in 🍔compliance 🍔with a 🍔particular
🍔paper 🍔edition.

Copyright laws are 🍔changing 🍔all over the 🍔world. Be sure to 🍔check the
copyright laws for your country before down🍔loading or redist🍔ribut🍔i