Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38

Closed
mahm opened this issue Apr 22, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@mahm
Copy link

mahm commented Apr 22, 2023

I have encountered an issue with LMQL not supporting multibyte characters. While it works fine for English queries, I am facing garbled output when trying to run Japanese queries.

@lmql.query
async def en_query():
    '''
    argmax
        "Q: Who are you?\n"
        "A: [WHAT]"
    from
        "chatgpt"
    '''

response = await en_query()
print(response[0].prompt)

スクリーンショット 2023-04-22 13 12 47

@lmql.query
async def ja_query():
    '''
    argmax
        "Q: あなたは誰?\n"
        "A: [WHAT]"
    from
        "chatgpt"
    '''

response = await ja_query()
print(response[0].prompt)

スクリーンショット 2023-04-22 13 13 04

It appears that the problem stems from the code not being able to handle multibyte characters when separating the strings returned from LLMs character by character.

Is there any plan to support multibyte characters in the future?
I am more than willing to help and cooperate in any way necessary to resolve this issue.

laiso added a commit to laiso/lmql that referenced this issue Apr 22, 2023
@laiso
Copy link
Contributor

laiso commented Apr 22, 2023

The garbled output is due to Unicode replacement characters. This issue occurs during the decoding process of the tokenizer and can be reproduced with the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokens = tokenizer.encode('私')
# tokens = [163, 100, 223]
print(tokenizer.decode(tokens[0:1]))# �
print(tokenizer.decode(tokens[0:2]))# �
print(tokenizer.decode(tokens[0:3]))# 私

#39 In my environment, implementing the following fix prevents the appearance of replacement characters. However, I am unsure if this modification is the most suitable solution for the tokenizer.

lbeurerkellner added a commit that referenced this issue May 3, 2023
multibyte characters issue #38
@lbeurerkellner
Copy link
Collaborator

This should be fixed after merging #39. However, I found problems with other multibyte characters, e.g.

argmax
    "A circle has a radius of 3cm. What is the Area? The area of a circle is computed with the following formula:[formula]"
from
    "openai/text-davinci-003"

Gives bytesbytes for what probably are math unicode symbols.

@lbeurerkellner
Copy link
Collaborator

Another instance of this issue occurs with the Ø character:

argmax
   """
   French: Sonde V18, Ø18 x 85mm, 33kHz, 8m, raccord ressort & taraudé M12, livré avec 2 piles 3.6V LS14250
   French:[REPEAT] Ø [REPEAT]
   """
from
   'openai/text-davinci-003'
where
   len(TOKENS(REPEAT)) < 10 

@lbeurerkellner lbeurerkellner added the bug Something isn't working label May 29, 2023
@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Jul 17, 2023

On the generation side, LMQL now implements full multibyte character support, including the original example with Japanese characters, Emojis and special math symbols.

What remains is some UI-only issue in the playground with rendering emojis in particular. However, the application side of things (e.g. use in Python) should note be affected by this.

I created a new issue to track this UI issue here: #119.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants