Skip to content

bpe-openai: use Regex::find instead of captures for pretokenization#121

Merged
aneubeck merged 3 commits into
github:mainfrom
dmatth1:pretok-find-fast-split
Jun 5, 2026
Merged

bpe-openai: use Regex::find instead of captures for pretokenization#121
aneubeck merged 3 commits into
github:mainfrom
dmatth1:pretok-find-fast-split

Conversation

@dmatth1
Copy link
Copy Markdown
Contributor

@dmatth1 dmatth1 commented Jun 4, 2026

Replace Regex captures with find for pretokenization.

Output is identical and ran the tiktoken-equivalence tests (cl100k + o200k) on my M1 Macbook. Also I added a pretokenization-cl100k benchmark that isolates Tokenizer::split. Results:

split/ throughput delta
10 B +57%
100 B +16%
1 000 B +11%
10 000 B +9%

Full Tokenizer::encode (comparison-cl100k) is unchanged within noise since pretokenization is a minority of encode time on that corpus.

Reasoning for the change:
The Splits iterator only needs each match's extent and pattern id but it never reads a capture group (the patterns contain none beyond the implicit whole match). It still called captures(), which clears and fills a Captures for every piece and forces a capture-reporting search. find() returns the same overall match and can use a faster DFA-based search that only reports the match start and end.

The Splits iterator only needs each match's extent and pattern id; it never reads a
capture group (the patterns contain none beyond the implicit whole match). It still
called captures(), which clears and fills a Captures for every piece and forces a
capture-reporting search. find() returns the same overall match and can use a faster
DFA-based search that only reports the match start and end.

Output is identical — find and captures agree on the overall match — verified by the
tiktoken-equivalence tests (cl100k + o200k).

Adds a pretokenization-cl100k benchmark that isolates Tokenizer::split (the existing
comparison benchmark only measures full encode, where pretokenization is a minority of
the time on the random-token corpus). Measured with it (cargo bench, M1), split
throughput improves at every input size, most on short inputs where the per-piece
engine overhead dominates:

    split/10     +57%
    split/100    +16%
    split/1000   +11%
    split/10000   +9%

Full Tokenizer::encode (comparison-cl100k) is unchanged within noise, since
pretokenization is a minority of encode time on that corpus.
@dmatth1 dmatth1 requested a review from a team as a code owner June 4, 2026 22:19
Copy link
Copy Markdown
Collaborator

@aneubeck aneubeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch and thanks a lot for the contribution!

@aneubeck aneubeck enabled auto-merge June 5, 2026 07:02
Comment thread crates/bpe/benchmarks/performance.rs
trying to trigger codeql which is blocking the merge :(
Comment thread crates/bpe/benchmarks/performance.rs Outdated
@aneubeck aneubeck disabled auto-merge June 5, 2026 13:47
@aneubeck aneubeck merged commit 5bf020e into github:main Jun 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants