bpe-openai: use Regex::find instead of captures for pretokenization#121
Merged
Conversation
The Splits iterator only needs each match's extent and pattern id; it never reads a
capture group (the patterns contain none beyond the implicit whole match). It still
called captures(), which clears and fills a Captures for every piece and forces a
capture-reporting search. find() returns the same overall match and can use a faster
DFA-based search that only reports the match start and end.
Output is identical — find and captures agree on the overall match — verified by the
tiktoken-equivalence tests (cl100k + o200k).
Adds a pretokenization-cl100k benchmark that isolates Tokenizer::split (the existing
comparison benchmark only measures full encode, where pretokenization is a minority of
the time on the random-token corpus). Measured with it (cargo bench, M1), split
throughput improves at every input size, most on short inputs where the per-piece
engine overhead dominates:
split/10 +57%
split/100 +16%
split/1000 +11%
split/10000 +9%
Full Tokenizer::encode (comparison-cl100k) is unchanged within noise, since
pretokenization is a minority of encode time on that corpus.
aneubeck
approved these changes
Jun 5, 2026
Collaborator
aneubeck
left a comment
There was a problem hiding this comment.
good catch and thanks a lot for the contribution!
aneubeck
reviewed
Jun 5, 2026
trying to trigger codeql which is blocking the merge :(
aneubeck
reviewed
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace Regex
captureswithfindfor pretokenization.Output is identical and ran the tiktoken-equivalence tests (cl100k + o200k) on my M1 Macbook. Also I added a pretokenization-cl100k benchmark that isolates
Tokenizer::split. Results:Full
Tokenizer::encode(comparison-cl100k) is unchanged within noise since pretokenization is a minority of encode time on that corpus.Reasoning for the change:
The Splits iterator only needs each match's extent and pattern id but it never reads a capture group (the patterns contain none beyond the implicit whole match). It still called captures(), which clears and fills a Captures for every piece and forces a capture-reporting search. find() returns the same overall match and can use a faster DFA-based search that only reports the match start and end.