-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38
Comments
The garbled output is due to Unicode replacement characters. This issue occurs during the decoding process of the tokenizer and can be reproduced with the following code: from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokens = tokenizer.encode('私')
# tokens = [163, 100, 223]
print(tokenizer.decode(tokens[0:1]))# �
print(tokenizer.decode(tokens[0:2]))# �
print(tokenizer.decode(tokens[0:3]))# 私 #39 In my environment, implementing the following fix prevents the appearance of replacement characters. However, I am unsure if this modification is the most suitable solution for the tokenizer. |
This should be fixed after merging #39. However, I found problems with other multibyte characters, e.g.
Gives |
Another instance of this issue occurs with the Ø character:
|
On the generation side, LMQL now implements full multibyte character support, including the original example with Japanese characters, Emojis and special math symbols. What remains is some UI-only issue in the playground with rendering emojis in particular. However, the application side of things (e.g. use in Python) should note be affected by this. I created a new issue to track this UI issue here: #119. |
I have encountered an issue with LMQL not supporting multibyte characters. While it works fine for English queries, I am facing garbled output when trying to run Japanese queries.
It appears that the problem stems from the code not being able to handle multibyte characters when separating the strings returned from LLMs character by character.
Is there any plan to support multibyte characters in the future?
I am more than willing to help and cooperate in any way necessary to resolve this issue.
The text was updated successfully, but these errors were encountered: