Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK character (3 Byte) is split into two tokens in json output. #1798

Open
HaujetZhao opened this issue Jan 22, 2024 · 1 comment · May be fixed by #1768
Open

CJK character (3 Byte) is split into two tokens in json output. #1798

HaujetZhao opened this issue Jan 22, 2024 · 1 comment · May be fixed by #1768
Labels
bug Something isn't working enhancement New feature or request

Comments

@HaujetZhao
Copy link

Detail

Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8.

When the output characters are all made of ascii characters, it works perfect.

But when the output CJK characters, a small issue arises.

A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3).

The encoding rules for UTF-8 are as follows:

  • Single-byte encoding ranges from 0x00 to 0x7F (corresponding to ASCII characters).
  • Double-byte encoding ranges from 0xC2 80 to 0xDF BF.
  • Triple-byte encoding ranges from 0xE0 A0 80 to 0xEF BF BF.

In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding.

image

Reproduce

I used the v1.5.4 Windows binary Release.

Here is the zipped wav sound file:

test-zh.wav.zip

Using the command:

main.exe --model ../model/medium.bin --language zh -otxt -ojf  test-zh.wav

A txt and a json-full result is produced:

Possible solution

A possible solution is to check if the token is a valid utf-8 character and concat the broken tokens before output to json file.

@bobqianic bobqianic added bug Something isn't working enhancement New feature or request labels Jan 22, 2024
@bobqianic bobqianic linked a pull request Jan 22, 2024 that will close this issue
11 tasks
@bobqianic
Copy link
Collaborator

This is a known issue where it outputs tokens directly instead of words. I mean, even when using English, sometimes it will give you broken words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants