CJK character (3 Byte) is split into two tokens in json output. #1798

HaujetZhao · 2024-01-22T16:48:59Z

Detail

Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8.

When the output characters are all made of ascii characters, it works perfect.

But when the output CJK characters, a small issue arises.

A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3).

The encoding rules for UTF-8 are as follows:

Single-byte encoding ranges from 0x00 to 0x7F (corresponding to ASCII characters).
Double-byte encoding ranges from 0xC2 80 to 0xDF BF.
Triple-byte encoding ranges from 0xE0 A0 80 to 0xEF BF BF.

In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding.

Reproduce

I used the v1.5.4 Windows binary Release.

Here is the zipped wav sound file:

test-zh.wav.zip

Using the command:

main.exe --model ../model/medium.bin --language zh -otxt -ojf  test-zh.wav

A txt and a json-full result is produced:

Possible solution

A possible solution is to check if the token is a valid utf-8 character and concat the broken tokens before output to json file.

The text was updated successfully, but these errors were encountered:

bobqianic · 2024-01-22T17:48:31Z

This is a known issue where it outputs tokens directly instead of words. I mean, even when using English, sometimes it will give you broken words.

bobqianic added bug Something isn't working enhancement New feature or request labels Jan 22, 2024

bobqianic linked a pull request Jan 22, 2024 that will close this issue

Fix the decoding issues #1768

Open

11 tasks

raivisdejus mentioned this issue May 10, 2024

Adding fix for multi-byte segments in whisper.cpp chidiwilliams/buzz#734

Merged

raivisdejus mentioned this issue Jul 16, 2024

Malformed multi-byte UTF8 characters abdeladim-s/pywhispercpp#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK character (3 Byte) is split into two tokens in json output. #1798

CJK character (3 Byte) is split into two tokens in json output. #1798

HaujetZhao commented Jan 22, 2024

bobqianic commented Jan 22, 2024

CJK character (3 Byte) is split into two tokens in json output. #1798

CJK character (3 Byte) is split into two tokens in json output. #1798

Comments

HaujetZhao commented Jan 22, 2024

Detail

Reproduce

Possible solution

bobqianic commented Jan 22, 2024