Skip to content

Conversation

@hendrikvanantwerpen
Copy link
Contributor

Changes:

  • Moved the tiktoken / bpe-openai equivalence tests into the bpe-openai crate, instead of the benchmarks crate.
    • The test fails for p50k. I only noticed now, since we were only testing cl100k & o200k before.
  • Added the create_test_string function to bpe so it's easier to reuse and updated it so it can also generate strings with multi-byte characters now.

@hendrikvanantwerpen hendrikvanantwerpen self-assigned this Oct 18, 2024
@hendrikvanantwerpen hendrikvanantwerpen marked this pull request as ready for review October 18, 2024 16:49
Comment on lines 578 to 597
for _ in 0..8 {
// pick a random token and provisionally add it
let i = thread_rng().gen_range(0..bpe.num_tokens());
bytes.extend(bpe.token_bytes(i as u32));
// test if the additional bytes are valid utf-8
// the last character is not included, because it may be incomplete
let last = bytes
.iter()
.rev()
.find_position(|b| is_char_boundary(**b))
.map_or(0, |(offset, _)| bytes.len() - (offset + 1));
assert!(last >= valid_bytes);
if std::str::from_utf8(&bytes[valid_bytes..last]).is_ok() {
tokens.push(i);
valid_bytes = last;
continue 'keep;
} else {
bytes.truncate(bytes.len() - bpe.token_len(i as u32));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time actually looking at this code :)
I'm not so sure that this procedure has any reasonable chance of constructing valid utf8 from "broken" tokens.
Essentially there are VERY few of those tokens and the chance of picking the correct one is extremely low.
The original idea of picking tokens was that the chance of picking longer interesting token sequences is higher.
But, this doesn't really work with broken utf8 sequences so easily. You would have to store valid token pairs. But there are just too many pairs that this is computationally reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I did a quick test and for both cl100k and o200k around 99% of the tokens are valid utf-8. o200k has more tokens containing multi-byte characters. For cl100k, almost 95% match the old condition token.iter().all(|b| is_char_boundary(*b)), but only 64% of o200k tokens does.

I guess I can simplify this function a bit by only testing for valid utf-8 tokens.

use bpe::byte_pair_encoding::BytePairEncoding;
use rand::{thread_rng, Rng};

pub fn create_test_bytes(bpe: &BytePairEncoding, tokens: usize) -> Vec<u8> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I "think" this function was testing more corner cases than the create_test_string variation 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably true, yes. I'll see where I can reinstate that.

aneubeck and others added 3 commits October 21, 2024 07:36
Replace look-ahead with multiple patterns ==> 3x speedup
Co-authored-by: Alexander Neubeck <aneubeck@github.com>
@hendrikvanantwerpen hendrikvanantwerpen merged commit e20fc1a into main Oct 21, 2024
3 checks passed
@hendrikvanantwerpen hendrikvanantwerpen deleted the move-equivalence-tests branch October 21, 2024 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants