LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38

mahm · 2023-04-22T04:17:02Z

I have encountered an issue with LMQL not supporting multibyte characters. While it works fine for English queries, I am facing garbled output when trying to run Japanese queries.

@lmql.query
async def en_query():
    '''
    argmax
        "Q: Who are you?\n"
        "A: [WHAT]"
    from
        "chatgpt"
    '''

response = await en_query()
print(response[0].prompt)

@lmql.query
async def ja_query():
    '''
    argmax
        "Q: あなたは誰？\n"
        "A: [WHAT]"
    from
        "chatgpt"
    '''

response = await ja_query()
print(response[0].prompt)

It appears that the problem stems from the code not being able to handle multibyte characters when separating the strings returned from LLMs character by character.

Is there any plan to support multibyte characters in the future?
I am more than willing to help and cooperate in any way necessary to resolve this issue.

laiso · 2023-04-22T09:06:51Z

The garbled output is due to Unicode replacement characters. This issue occurs during the decoding process of the tokenizer and can be reproduced with the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokens = tokenizer.encode('私')
# tokens = [163, 100, 223]
print(tokenizer.decode(tokens[0:1]))# �
print(tokenizer.decode(tokens[0:2]))# �
print(tokenizer.decode(tokens[0:3]))# 私

#39 In my environment, implementing the following fix prevents the appearance of replacement characters. However, I am unsure if this modification is the most suitable solution for the tokenizer.

multibyte characters issue #38

lbeurerkellner · 2023-05-04T10:47:28Z

This should be fixed after merging #39. However, I found problems with other multibyte characters, e.g.

argmax
    "A circle has a radius of 3cm. What is the Area? The area of a circle is computed with the following formula:[formula]"
from
    "openai/text-davinci-003"

Gives bytesbytes for what probably are math unicode symbols.

lbeurerkellner · 2023-05-24T12:54:59Z

Another instance of this issue occurs with the Ø character:

argmax
   """
   French: Sonde V18, Ø18 x 85mm, 33kHz, 8m, raccord ressort & taraudé M12, livré avec 2 piles 3.6V LS14250
   French:[REPEAT] Ø [REPEAT]
   """
from
   'openai/text-davinci-003'
where
   len(TOKENS(REPEAT)) < 10

lbeurerkellner · 2023-07-17T09:58:55Z

On the generation side, LMQL now implements full multibyte character support, including the original example with Japanese characters, Emojis and special math symbols.

What remains is some UI-only issue in the playground with rendering emojis in particular. However, the application side of things (e.g. use in Python) should note be affected by this.

I created a new issue to track this UI issue here: #119.

laiso added a commit to laiso/lmql that referenced this issue Apr 22, 2023

multibyte characters issue eth-sri#38

07f184e

lbeurerkellner added a commit that referenced this issue May 3, 2023

Merge pull request #39 from laiso/fix_38

2b3ea60

multibyte characters issue #38

lbeurerkellner added the bug Something isn't working label May 29, 2023

lbeurerkellner added a commit that referenced this issue Jun 4, 2023

Merge remote-tracking branch 'public/GH-38' into lmtp

b428c63

lbeurerkellner closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38

LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38

mahm commented Apr 22, 2023

laiso commented Apr 22, 2023 •

edited

Loading

lbeurerkellner commented May 4, 2023

lbeurerkellner commented May 24, 2023

lbeurerkellner commented Jul 17, 2023 •

edited

Loading

LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38

LMQL not supporting multibyte characters, causing garbled output for Japanese queries #38

Comments

mahm commented Apr 22, 2023

laiso commented Apr 22, 2023 • edited Loading

lbeurerkellner commented May 4, 2023

lbeurerkellner commented May 24, 2023

lbeurerkellner commented Jul 17, 2023 • edited Loading

laiso commented Apr 22, 2023 •

edited

Loading

lbeurerkellner commented Jul 17, 2023 •

edited

Loading