bpe-openai: use Regex::find instead of captures for pretokenization by dmatth1 · Pull Request #121 · github/rust-gems

dmatth1 · 2026-06-04T22:19:49Z

Replace Regex captures with find for pretokenization.

Output is identical and ran the tiktoken-equivalence tests (cl100k + o200k) on my M1 Macbook. Also I added a pretokenization-cl100k benchmark that isolates Tokenizer::split. Results:

split/	throughput delta
10 B	+57%
100 B	+16%
1 000 B	+11%
10 000 B	+9%

Full Tokenizer::encode (comparison-cl100k) is unchanged within noise since pretokenization is a minority of encode time on that corpus.

Reasoning for the change:
The Splits iterator only needs each match's extent and pattern id but it never reads a capture group (the patterns contain none beyond the implicit whole match). It still called captures(), which clears and fills a Captures for every piece and forces a capture-reporting search. find() returns the same overall match and can use a faster DFA-based search that only reports the match start and end.

The Splits iterator only needs each match's extent and pattern id; it never reads a capture group (the patterns contain none beyond the implicit whole match). It still called captures(), which clears and fills a Captures for every piece and forces a capture-reporting search. find() returns the same overall match and can use a faster DFA-based search that only reports the match start and end. Output is identical — find and captures agree on the overall match — verified by the tiktoken-equivalence tests (cl100k + o200k). Adds a pretokenization-cl100k benchmark that isolates Tokenizer::split (the existing comparison benchmark only measures full encode, where pretokenization is a minority of the time on the random-token corpus). Measured with it (cargo bench, M1), split throughput improves at every input size, most on short inputs where the per-piece engine overhead dominates: split/10 +57% split/100 +16% split/1000 +11% split/10000 +9% Full Tokenizer::encode (comparison-cl100k) is unchanged within noise, since pretokenization is a minority of encode time on that corpus.

aneubeck

good catch and thanks a lot for the contribution!

trying to trigger codeql which is blocking the merge :(

dmatth1 requested a review from a team as a code owner June 4, 2026 22:19

aneubeck approved these changes Jun 5, 2026

View reviewed changes

aneubeck enabled auto-merge June 5, 2026 07:02

aneubeck reviewed Jun 5, 2026

View reviewed changes

Comment thread crates/bpe/benchmarks/performance.rs

Apply suggestion from @aneubeck

4b3cda5

trying to trigger codeql which is blocking the merge :(

aneubeck reviewed Jun 5, 2026

View reviewed changes

Comment thread crates/bpe/benchmarks/performance.rs Outdated

Apply suggestion from @aneubeck

9c65f5a

aneubeck disabled auto-merge June 5, 2026 13:47

aneubeck merged commit 5bf020e into github:main Jun 5, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpe-openai: use Regex::find instead of captures for pretokenization#121

bpe-openai: use Regex::find instead of captures for pretokenization#121
aneubeck merged 3 commits into
github:mainfrom
dmatth1:pretok-find-fast-split

dmatth1 commented Jun 4, 2026

Uh oh!

aneubeck left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmatth1 commented Jun 4, 2026

Uh oh!

aneubeck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants