Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Handle unicode on tokenizer #515

Merged
merged 2 commits into from Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions config/settings.yml
Expand Up @@ -171,6 +171,9 @@ discourse_ai:
default: ""
hidden: true
ai_llava_api_key: ""
ai_strict_token_counting:
default: false
hidden: true

composer_ai_helper_enabled:
default: false
Expand Down
11 changes: 8 additions & 3 deletions lib/tokenizer/basic_tokenizer.rb
Expand Up @@ -17,14 +17,19 @@ def size(text)
end

def truncate(text, max_length)
# Fast track the common case where the text is already short enough.
return text if text.size < max_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can still fast track for text.size < max_length / 2 , I think it looks pretty safe, looking at examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gives some could be 3 tokens long or maybe for ... we can shortcut on / 3 extremely safely for now. I worry about all the extra work for ultra short strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gives some could be 3 tokens long or maybe for ... we can shortcut on / 3 extremely safely for now. I worry about all the extra work for ultra short strings.

I'm concerned that there might be some special cases causing exceptions, considering that the token count calculation is not simply based on character length. Perhaps we could use environment variables to configure whether this acceleration is necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow, maybe we just merge this for now and see

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wozulong DISCOURSE_AI_STRICT_TOKEN_COUNTING: true will the ENV to disable acceleration.

# fast track common case, /2 to handle unicode chars
# than can take more than 1 token per char
return text if !SiteSetting.ai_strict_token_counting && text.size < max_length / 2

tokenizer.decode(tokenizer.encode(text).ids.take(max_length))
end

def can_expand_tokens?(text, addition, max_length)
return true if text.size + addition.size < max_length
# fast track common case, /2 to handle unicode chars
# than can take more than 1 token per char
if !SiteSetting.ai_strict_token_counting && text.size + addition.size < max_length / 2
return true
end

tokenizer.encode(text).ids.length + tokenizer.encode(addition).ids.length < max_length
end
Expand Down
11 changes: 8 additions & 3 deletions lib/tokenizer/open_ai_tokenizer.rb
Expand Up @@ -13,8 +13,9 @@ def tokenize(text)
end

def truncate(text, max_length)
# Fast track the common case where the text is already short enough.
return text if text.size < max_length
# fast track common case, /2 to handle unicode chars
# than can take more than 1 token per char
return text if !SiteSetting.ai_strict_token_counting && text.size < max_length / 2

tokenizer.decode(tokenize(text).take(max_length))
rescue Tiktoken::UnicodeError
Expand All @@ -23,7 +24,11 @@ def truncate(text, max_length)
end

def can_expand_tokens?(text, addition, max_length)
return true if text.size + addition.size < max_length
# fast track common case, /2 to handle unicode chars
# than can take more than 1 token per char
if !SiteSetting.ai_strict_token_counting && text.size + addition.size < max_length / 2
return true
end

tokenizer.encode(text).length + tokenizer.encode(addition).length < max_length
end
Expand Down
25 changes: 25 additions & 0 deletions spec/shared/tokenizer_spec.rb
Expand Up @@ -81,6 +81,31 @@
sentence = "foo bar 👨🏿‍👩🏿‍👧🏿‍👧🏿 baz qux quux corge grault garply waldo fred plugh xyzzy thud"
expect(described_class.truncate(sentence, 7)).to eq("foo bar 👨🏿")
end

it "truncates unicode characters properly when they use more than one token per char" do
sentence = "我喜欢吃比萨"
original_size = described_class.size(sentence)
expect(described_class.size(described_class.truncate(sentence, original_size - 1))).to be <
original_size
end
end

describe "#can_expand_tokens?" do
it "returns true when the tokens can be expanded" do
expect(described_class.can_expand_tokens?("foo bar", "baz qux", 6)).to eq(true)
end

it "returns false when the tokens cannot be expanded" do
expect(described_class.can_expand_tokens?("foo bar", "baz qux", 3)).to eq(false)
end

it "returns false when the tokens cannot be expanded due to multibyte unicode characters" do
expect(described_class.can_expand_tokens?("foo bar 👨🏿", "baz qux", 6)).to eq(false)
end

it "handles unicode characters properly when they use more than one token per char" do
expect(described_class.can_expand_tokens?("我喜欢吃比萨", "萨", 10)).to eq(false)
end
end
end

Expand Down