Move equivalence testing to bpe-openai #34

hendrikvanantwerpen · 2024-10-18T14:29:58Z

Changes:

Moved the tiktoken / bpe-openai equivalence tests into the bpe-openai crate, instead of the benchmarks crate.
- The test fails for p50k. I only noticed now, since we were only testing cl100k & o200k before.
Added the create_test_string function to bpe so it's easier to reuse and updated it so it can also generate strings with multi-byte characters now.

crates/bpe/src/byte_pair_encoding.rs

aneubeck · 2024-10-21T05:32:41Z

crates/bpe/src/byte_pair_encoding.rs

+        for _ in 0..8 {
+            // pick a random token and provisionally add it
+            let i = thread_rng().gen_range(0..bpe.num_tokens());
+            bytes.extend(bpe.token_bytes(i as u32));
+            // test if the additional bytes are valid utf-8
+            // the last character is not included, because it may be incomplete
+            let last = bytes
+                .iter()
+                .rev()
+                .find_position(|b| is_char_boundary(**b))
+                .map_or(0, |(offset, _)| bytes.len() - (offset + 1));
+            assert!(last >= valid_bytes);
+            if std::str::from_utf8(&bytes[valid_bytes..last]).is_ok() {
+                tokens.push(i);
+                valid_bytes = last;
+                continue 'keep;
+            } else {
+                bytes.truncate(bytes.len() - bpe.token_len(i as u32));
+            }
+        }


First time actually looking at this code :)
I'm not so sure that this procedure has any reasonable chance of constructing valid utf8 from "broken" tokens.
Essentially there are VERY few of those tokens and the chance of picking the correct one is extremely low.
The original idea of picking tokens was that the chance of picking longer interesting token sequences is higher.
But, this doesn't really work with broken utf8 sequences so easily. You would have to store valid token pairs. But there are just too many pairs that this is computationally reasonable.

I see. I did a quick test and for both cl100k and o200k around 99% of the tokens are valid utf-8. o200k has more tokens containing multi-byte characters. For cl100k, almost 95% match the old condition token.iter().all(|b| is_char_boundary(*b)), but only 64% of o200k tokens does.

I guess I can simplify this function a bit by only testing for valid utf-8 tokens.

aneubeck · 2024-10-21T05:34:07Z

crates/bpe/tests/src/lib.rs

-use bpe::byte_pair_encoding::BytePairEncoding;
-use rand::{thread_rng, Rng};
-
-pub fn create_test_bytes(bpe: &BytePairEncoding, tokens: usize) -> Vec<u8> {


I "think" this function was testing more corner cases than the create_test_string variation 🤷

Probably true, yes. I'll see where I can reinstate that.

Replace look-ahead with multiple patterns ==> 3x speedup

Co-authored-by: Alexander Neubeck <aneubeck@github.com>

…e possible

Hendrik van Antwerpen added 3 commits October 18, 2024 16:26

Move equivalence tests to bpe-openai

42a11fb

Generate test strings with multi-byte characters

438c54e

Use create_test_string everywhere

ad953d9

hendrikvanantwerpen self-assigned this Oct 18, 2024

Hendrik van Antwerpen and others added 6 commits October 18, 2024 18:07

Drop legacy token sets

f183341

Replace look-ahead with multiple patterns ==> 3x speedup

35c047d

fix eof negative look-ahead

d86dcde

fix lint

d94a195

linter

b543057

Be explicit about lookahead patterns

ec14d05

hendrikvanantwerpen marked this pull request as ready for review October 18, 2024 16:49

Update benchmark

1fca5e9

aneubeck reviewed Oct 21, 2024

View reviewed changes

crates/bpe/src/byte_pair_encoding.rs Outdated Show resolved Hide resolved

aneubeck reviewed Oct 21, 2024

View reviewed changes

aneubeck approved these changes Oct 21, 2024

View reviewed changes

aneubeck and others added 3 commits October 21, 2024 07:36

Merge pull request #33 from github/aneubeck/regex

0ab3095

Replace look-ahead with multiple patterns ==> 3x speedup

Fix typo

c75707f

Co-authored-by: Alexander Neubeck <aneubeck@github.com>

Simplify test string generation and reinstate random byte inputs wher…

b42989e

…e possible

hendrikvanantwerpen merged commit e20fc1a into main Oct 21, 2024
3 checks passed

hendrikvanantwerpen deleted the move-equivalence-tests branch October 21, 2024 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move equivalence testing to bpe-openai #34

Move equivalence testing to bpe-openai #34

Uh oh!

hendrikvanantwerpen commented Oct 18, 2024

Uh oh!

Uh oh!

aneubeck Oct 21, 2024

Uh oh!

hendrikvanantwerpen Oct 21, 2024

Uh oh!

aneubeck Oct 21, 2024

Uh oh!

hendrikvanantwerpen Oct 21, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Move equivalence testing to bpe-openai #34

Move equivalence testing to bpe-openai #34

Uh oh!

Conversation

hendrikvanantwerpen commented Oct 18, 2024

Uh oh!

Uh oh!

aneubeck Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

aneubeck Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants