Polish benchmark and include some figures in README#20
Polish benchmark and include some figures in README#20hendrikvanantwerpen merged 15 commits intomainfrom
Conversation
1281e4c to
eaf4f7f
Compare
f21653f to
32d4c76
Compare
2f6e8a3 to
6ea06fa
Compare
6ea06fa to
10d1784
Compare
|
|
||
|  | ||
|
|
||
| ### Encoding results |
There was a problem hiding this comment.
maybe add a worst case example for tiktoken?
(some string without whitespaces)
There was a problem hiding this comment.
I tried that. This was the same as the encoding benchmark but all inputs were taken from a random ascii string without whitespace. The factor increased a bit (close to 6x) but the curves seem fairly similar to the encoding results.
There was a problem hiding this comment.
mmm. What I tested some time ago was a string which was the concatenation of all unicode characters. That input never finished with the tiktoken lib... I think the regex simply returned a super large sub-chunk on which the quadratic encoder was then running. This obviously will take forever...
There was a problem hiding this comment.
Maybe the ascii is too simple then. I found a way to sample unicode. I'll try that and see if it makes a difference.
There was a problem hiding this comment.
I could not replicate what you're describing using random Unicode strings. I'll leave this for now and maybe get back to it if we want to highlight this on the blog post.
There was a problem hiding this comment.
well, I didn't use random unicode strings... 🤷
|
|
||
|  | ||
|
|
||
| ### Incremental encoding results |
There was a problem hiding this comment.
Nit: maybe merge with the previous section?
The important point here is not so much the incremental encoding, but the reverse encoding aspect I think.
I guess it requires a bit of explanation...
Oh... did I check that it returns the same result for "random" input?
E.g. when we have all whitespace, then the reverse encoder must move the longest merged token correctly to the front.
There was a problem hiding this comment.
Not sure what you mean? The appending encoder doesn't do reverse encoding afaik. We also have the prepending one, although that one's not included in the benchmark now.
aneubeck
left a comment
There was a problem hiding this comment.
Depending on what we will focus on in the blog post, we might need to add more numbers (like worst-case inputs for tiktoken)
Co-authored-by: Alexander Neubeck <aneubeck@github.com>
Polishes the benchmark and adds some of the result figures in the README.
I considered adding the HuggingFace tokenizers to the benchmark as well, but they don't have cl100k and/or o200k readily available. I can figure out how to build a tokenizer from the tiktoken tokens. That would require computing some merge lists, if I understand it correctly. But I'm not sure it's worth the effort. E.g. this suggests their tokenizer is quite a bit slower than tiktoken anyway.
Rendered README.