Polish benchmark and include some figures in README by hendrikvanantwerpen · Pull Request #20 · github/rust-gems

hendrikvanantwerpen · 2024-10-01T10:39:53Z

Polishes the benchmark and adds some of the result figures in the README.

I considered adding the HuggingFace tokenizers to the benchmark as well, but they don't have cl100k and/or o200k readily available. I can figure out how to build a tokenizer from the tiktoken tokens. That would require computing some merge lists, if I understand it correctly. But I'm not sure it's worth the effort. E.g. this suggests their tokenizer is quite a bit slower than tiktoken anyway.

Rendered README.

aneubeck · 2024-10-01T13:32:22Z

+
+![counting runtime comparison](./benches/result/counting-o200k.svg)
+
+### Encoding results


maybe add a worst case example for tiktoken?
(some string without whitespaces)

I tried that. This was the same as the encoding benchmark but all inputs were taken from a random ascii string without whitespace. The factor increased a bit (close to 6x) but the curves seem fairly similar to the encoding results.

mmm. What I tested some time ago was a string which was the concatenation of all unicode characters. That input never finished with the tiktoken lib... I think the regex simply returned a super large sub-chunk on which the quadratic encoder was then running. This obviously will take forever...

Maybe the ascii is too simple then. I found a way to sample unicode. I'll try that and see if it makes a difference.

I could not replicate what you're describing using random Unicode strings. I'll leave this for now and maybe get back to it if we want to highlight this on the blog post.

well, I didn't use random unicode strings... 🤷

aneubeck · 2024-10-01T13:36:22Z

+
+![encoding runtime comparison](./benches/result/encoding-o200k.svg)
+
+### Incremental encoding results


Nit: maybe merge with the previous section?

The important point here is not so much the incremental encoding, but the reverse encoding aspect I think.
I guess it requires a bit of explanation...

Oh... did I check that it returns the same result for "random" input?
E.g. when we have all whitespace, then the reverse encoder must move the longest merged token correctly to the front.

Not sure what you mean? The appending encoder doesn't do reverse encoding afaik. We also have the prepending one, although that one's not included in the benchmark now.

aneubeck

Depending on what we will focus on in the blog post, we might need to add more numbers (like worst-case inputs for tiktoken)

Co-authored-by: Alexander Neubeck <aneubeck@github.com>

hendrikvanantwerpen self-assigned this Oct 1, 2024

Hendrik van Antwerpen added 4 commits October 1, 2024 12:41

Reorganize bpe benchmark

8bb37e3

Rename benchmark

641d546

Markdown warnings

81a119f

Add figures to README

eaf4f7f

hendrikvanantwerpen force-pushed the update-benchmark branch from 1281e4c to eaf4f7f Compare October 1, 2024 10:41

Update figures

32d4c76

hendrikvanantwerpen force-pushed the update-benchmark branch from f21653f to 32d4c76 Compare October 1, 2024 11:32

Hendrik van Antwerpen added 2 commits October 1, 2024 13:34

Set image background

0c66cab

Add script to copy and process benchmark figures

f05019b

hendrikvanantwerpen force-pushed the update-benchmark branch from 2f6e8a3 to 6ea06fa Compare October 1, 2024 11:54

Add benchmark instructions

10d1784

hendrikvanantwerpen force-pushed the update-benchmark branch from 6ea06fa to 10d1784 Compare October 1, 2024 11:54

hendrikvanantwerpen requested a review from aneubeck October 1, 2024 11:58

aneubeck reviewed Oct 1, 2024

View reviewed changes