Better handling for invalid utf-8 #21

rroohhh · 2023-03-10T10:26:15Z

Currently we store all text of a Atom as utf-8. This interacts badly with whisper, as not every token generated by it is a valid utf-8 sequence.
There are two cases:

Back to back tokens generated by whisper are valid utf-8, but they are not valid utf-8 on their own.
The tokens are just completely invalid.

The first case is currently handled by combining the tokens generated by whisper into a Atom until the combined text is valid utf-8. This does however not solve the second case and will just cause the whole Paragraph to be empty. (As no Atom will ever be emitted for a segment.)

Furthermore the handling for the first case assumes these issues are always contained in a single Segment. This might not always be true.

I see two ways forward:

Save all text as bytes and only decode as utf-8 string whenever necessary.
Add more sophisticated handling for the cases where the generated byte stream is not valid utf-8.

The first option would allow "lossless" storage of everything generated by whisper, however it is unclear how to interpret the invalid utf-8 sequences.

The text was updated successfully, but these errors were encountered:

rroohhh added worker discuss labels Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling for invalid utf-8 #21

Better handling for invalid utf-8 #21

rroohhh commented Mar 10, 2023

Better handling for invalid utf-8 #21

Better handling for invalid utf-8 #21

Comments

rroohhh commented Mar 10, 2023