You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8.
When the output characters are all made of ascii characters, it works perfect.
But when the output CJK characters, a small issue arises.
A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3).
The encoding rules for UTF-8 are as follows:
Single-byte encoding ranges from 0x00 to 0x7F (corresponding to ASCII characters).
Double-byte encoding ranges from 0xC2 80 to 0xDF BF.
Triple-byte encoding ranges from 0xE0 A0 80 to 0xEF BF BF.
In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding.
Detail
Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8.
When the output characters are all made of ascii characters, it works perfect.
But when the output CJK characters, a small issue arises.
A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3).
The encoding rules for UTF-8 are as follows:
In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding.
Reproduce
I used the v1.5.4 Windows binary Release.
Here is the zipped wav sound file:
test-zh.wav.zip
Using the command:
A txt and a json-full result is produced:
Possible solution
A possible solution is to check if the token is a valid utf-8 character and concat the broken tokens before output to json file.
The text was updated successfully, but these errors were encountered: