You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we store all text of a Atom as utf-8. This interacts badly with whisper, as not every token generated by it is a valid utf-8 sequence.
There are two cases:
Back to back tokens generated by whisper are valid utf-8, but they are not valid utf-8 on their own.
The tokens are just completely invalid.
The first case is currently handled by combining the tokens generated by whisper into a Atom until the combined text is valid utf-8. This does however not solve the second case and will just cause the whole Paragraph to be empty. (As no Atom will ever be emitted for a segment.)
Furthermore the handling for the first case assumes these issues are always contained in a single Segment. This might not always be true.
I see two ways forward:
Save all text as bytes and only decode as utf-8 string whenever necessary.
Add more sophisticated handling for the cases where the generated byte stream is not valid utf-8.
The first option would allow "lossless" storage of everything generated by whisper, however it is unclear how to interpret the invalid utf-8 sequences.
The text was updated successfully, but these errors were encountered:
Currently we store all text of a
Atom
as utf-8. This interacts badly with whisper, as not every token generated by it is a valid utf-8 sequence.There are two cases:
The first case is currently handled by combining the tokens generated by whisper into a
Atom
until the combined text is valid utf-8. This does however not solve the second case and will just cause the wholeParagraph
to be empty. (As noAtom
will ever be emitted for a segment.)Furthermore the handling for the first case assumes these issues are always contained in a single
Segment
. This might not always be true.I see two ways forward:
The first option would allow "lossless" storage of everything generated by whisper, however it is unclear how to interpret the invalid utf-8 sequences.
The text was updated successfully, but these errors were encountered: