Heavily inspired by Karpathy's, here's my late night take on a simple, clean, and fast BPE implementation.

I'm borrowing a lot from Karpathy's code, but we'll use more efficient data-structures:
1. We'll hold our sequence in an [*IndexedList*](datastructures/indexedlist.py) which is a simple linked list (so allows efficiently deleting elements) and maintains an index from a pair to all of its occurrences in the list (so allows fast iteration).
3. We'll calculate the pair counts only once, and maintain them in a [*Multiset*](datastructures/multiset.py). This will allow efficiently finding the next pair to merge.

Note that the [*IndexedList*](datastructures/leap.py) maintains a _possibly stale_ index. That means that when iterating on pairs we have to check that each accessed pair indeed still holds the desired value. Note that datastructures that are "lazy" like this are often more efficient.

The [*Multiset*](datastructures/multiset.py) is functionally equivalent to the built-in `collections.Counter`, but finding the most common element is drastically faster (see [multiset_tests.ipynb](datastructures/multiset_tests.ipynb)).

# Why is minbpe slow?

If we are to perform N merges, and the length of the training sequence is L, Karpathy's original impl does (I think):
```python
for i in range(N):
    calc_stats()        # O(L)
    find_max()          # O(L)
    do_merges()         # O(L)
```
For a total complexity of O(N*L) (maybe I'm neglecting some factors).

# Why is fast_minbpe fast?

Using our `IndexedList` and `Multiset`, we instead get:
```python
calc_stats()                      # O(L)
for i in range(N):
    find_max()                    # O(log(L))
    do_merges_and_update_stats()  # O(M_i)
```
Where M_i denotes the actual number of merges we perform at the ith iteration. Note that M_1+M_2+...+M_n <= L - 1, so the overall complexity of evertyhing (again neglecting logarithmic factors) is O(L)!

We'll unfortunately have to give up some lovely code from Karpathy's implementation, such as:
```python
pair = max(stats, key=stats.get)
```
That said, armed with the `IndexedList` and the `Multiset`, our code remains concise and clean I think.

Note that I only implement the functionality of Karpathy's `BasicTokenizer`.

You can find some more details in [this post](https://yanivle.github.io/ai/2024/02/23/fast_minbpe.html).

In [4]:
from util.mytimeit import timeit
from bpe import train, tokenize, detokenize

In [7]:
def train_and_test(filename, vocab_size):
    text = open(filename, "r", encoding="utf-8").read()
    print(f'Source text is of length: {len(text):,}')
    print(f'First 100 chars: {repr(text[:100])}')
    merge_tree, vocab = timeit(lambda: train(text, vocab_size), 'Training')
    tokenized_text = timeit(lambda: tokenize(text, merge_tree), 'Tokenization')
    print(f'Tokenized text has {len(tokenized_text)} tokens.')
    detokenized_text = timeit(lambda: detokenize(tokenized_text, vocab), 'Detokenize')
    assert detokenized_text == text
    return tokenized_text, vocab

Timing time!

In [9]:
for vocab_size in [300, 1000, 10_000]:
    tokenized_text, vocab = train_and_test(r'data\taylorswift.txt', vocab_size)
    print()

Source text is of length: 185,561
First 100 chars: 'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaTh'
Training tokenizer on text of length 185,561 with vocab of size 300.
build_indexed_list took 0.21 seconds.
init_pairs_stats took 0.01 seconds.
Training took 0.36 seconds.
Tokenization took 0.33 seconds.
Tokenized text has 128451 tokens.
Detokenize took 0.01 seconds.

Source text is of length: 185,561
First 100 chars: 'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaTh'
Training tokenizer on text of length 185,561 with vocab of size 1,000.
build_indexed_list took 0.32 seconds.
init_pairs_stats took 0.02 seconds.
Training took 0.82 seconds.
Tokenization took 0.38 seconds.
Tokenized text has 58337 tokens.
Detokenize took 0.00 seconds.

Source text is of length: 185,561
First 100 chars: 'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\

In [10]:
# Let's inspect our tokenized text:
def debug(tokenized_text, vocab):
    print('🍔'.join([vocab[t].decode('utf-8') for t in tokenized_text]))

debug(tokenized_text[:100], vocab)

C🍔op🍔y 🍔p🍔ast🍔e 🍔of the 🍔Wikipe🍔dia 🍔article 🍔on 🍔Taylor Swift, 🍔as of 🍔F🍔eb🍔 🍔16🍔, 2024🍔.
🍔--🍔-🍔

🍔Main 🍔m🍔enu🍔

🍔Wikipedia🍔The F🍔ree 🍔Enc🍔yclopedia
🍔
🍔Search🍔
🍔C🍔re🍔ate 🍔account🍔
🍔L🍔og🍔 🍔in🍔

🍔Personal 🍔tool🍔s
🍔Cont🍔ents 🍔 🍔h🍔ide🍔
🍔(🍔Top🍔)
🍔Life and career🍔
Toggle 🍔Life and 🍔career 🍔subsection
🍔Artistry🍔
Toggle 🍔Artist🍔ry 🍔subsection
🍔Accolades and achievements
🍔Cultural status🍔
Toggle 🍔Cultural 🍔status 🍔subsection
🍔Wealth🍔
Toggle 🍔Weal🍔th 🍔subsection
🍔Discography
🍔Filmography
🍔Tours
🍔See also🍔
F🍔ootnotes
🍔References
🍔Toggle 🍔Referenc🍔es 🍔subsection
🍔External links
Taylor Swift
🍔
🍔13🍔6 🍔l🍔ang🍔u🍔ag🍔es
🍔Ar🍔tic🍔le



In [11]:
# What about a GPT-4-like vocabulary with 100K tokens?
tokenized_text, vocab = train_and_test(r'data\taylorswift.txt', 100_000)

Source text is of length: 185,561
First 100 chars: 'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaTh'
Training tokenizer on text of length 185,561 with vocab of size 100,000.
build_indexed_list took 0.22 seconds.
init_pairs_stats took 0.02 seconds.
Training took 1.61 seconds.
Tokenization took 0.60 seconds.
Tokenized text has 1 tokens.
Detokenize took 0.00 seconds.


1.61 seconds, but with a vocabulary of 100K, only 1 token remains... Let's try something longer...

[The Guiness book of world records recognizes](https://www.guinnessworldrecords.com/world-records/longest-novel) Marcel Proust's "A la recherche du temps perdu" as the world's longest novel. Turns out it's comprised of several volumes. I found a translated version of the first volume - "Swann's Way" - on [the website](https://gutenberg.net.au/plusfifty-n-z.html#proust) for Project Gutenberg. Specifically [this file](https://gutenberg.net.au/ebooks03/0300511.txt). It's just over 1MB - perfect!
Let's try training a GPT-4 sized tokenizer with a vocabulary of 100K tokens on that:

In [12]:
tokenized_text, vocab = train_and_test(r'data\0300511.txt', 100_000)

Source text is of length: 1,088,320
First 100 chars: "\nProject Gutenberg Australia\n\nTitle:      Swann's Way\n            (Du côté de chez Swann)\n          "
Training tokenizer on text of length 1,088,320 with vocab of size 100,000.
build_indexed_list took 1.24 seconds.
init_pairs_stats took 0.12 seconds.
Training took 8.15 seconds.
Tokenization took 3.90 seconds.
Tokenized text has 86707 tokens.
Detokenize took 0.02 seconds.


In [13]:
# Let's inspect our tokenized text:
debug(tokenized_text[:100], vocab)


🍔Project Gutenberg Australia🍔

🍔Title:      Swann's Way
            (Du côté de chez Swann)
            [Vol. 1 of Remembrance of Things Past—
            (À la Recherche du temps perdu)]
Author:     Marcel Proust
            Translated from the French by C. K. Scott Moncrieff🍔
🍔* 🍔A Project Gutenberg of Australia 🍔eBook 🍔*🍔
🍔eBook 🍔No🍔.🍔: 🍔 🍔03🍔0🍔0🍔5🍔1🍔1🍔.🍔t🍔x🍔t
🍔L🍔angu🍔age🍔: 🍔  🍔English🍔
Date 🍔first 🍔posted🍔: 🍔        🍔 🍔Mar🍔ch 🍔20🍔03🍔
Date 🍔most 🍔recently 🍔up🍔d🍔ated🍔: 🍔S🍔ept 🍔20🍔2🍔2🍔

🍔P🍔rodu🍔ction 🍔not🍔es: 🍔W🍔or🍔ds 🍔in it🍔al🍔ic🍔s in the 🍔book
🍔        🍔        🍔  🍔are 🍔enclos🍔ed by 🍔under🍔sco🍔re🍔s (🍔_) 🍔in this 🍔eBook

🍔Project Gutenberg of Australia 🍔eBook🍔s are 🍔created 🍔from 🍔printed 🍔edition🍔s
🍔which are 🍔in the 🍔public 🍔domain 🍔in 🍔Australi🍔a, 🍔unless 🍔a 🍔copy🍔right 🍔notic


What about the [bible](data/bible.txt) (~4.3MB)?

In [14]:
tokenized_text, vocab = train_and_test(r'data\bible.txt', 100_000)

Source text is of length: 4,351,186
First 100 chars: '1:1 In the beginning God created the heaven and the earth.\n\n1:2 And the earth was without form, and '
Training tokenizer on text of length 4,351,186 with vocab of size 100,000.
build_indexed_list took 5.62 seconds.
init_pairs_stats took 0.47 seconds.
Training took 24.49 seconds.
Tokenization took 14.17 seconds.
Tokenized text has 454086 tokens.
Detokenize took 0.15 seconds.


And finally, let's try a json file containing a large corpus of jokes from Reddit:

In [15]:
tokenized_text, vocab = train_and_test(r'data\reddit_jokes.json', 100_000)

Source text is of length: 68,662,116
First 100 chars: '[\n    {\n        "body": "Now I have to say \\"Leroy can you please paint the fence?\\"",\n        "id":'
Training tokenizer on text of length 68,662,116 with vocab of size 100,000.
build_indexed_list took 76.74 seconds.
init_pairs_stats took 7.86 seconds.
Training took 293.86 seconds.
Tokenization took 216.13 seconds.
Tokenized text has 8064036 tokens.
Detokenize took 2.82 seconds.
