Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling for invalid utf-8 #21

Open
rroohhh opened this issue Mar 10, 2023 · 0 comments
Open

Better handling for invalid utf-8 #21

rroohhh opened this issue Mar 10, 2023 · 0 comments

Comments

@rroohhh
Copy link
Member

rroohhh commented Mar 10, 2023

Currently we store all text of a Atom as utf-8. This interacts badly with whisper, as not every token generated by it is a valid utf-8 sequence.
There are two cases:

  1. Back to back tokens generated by whisper are valid utf-8, but they are not valid utf-8 on their own.
  2. The tokens are just completely invalid.

The first case is currently handled by combining the tokens generated by whisper into a Atom until the combined text is valid utf-8. This does however not solve the second case and will just cause the whole Paragraph to be empty. (As no Atom will ever be emitted for a segment.)

Furthermore the handling for the first case assumes these issues are always contained in a single Segment. This might not always be true.

I see two ways forward:

  1. Save all text as bytes and only decode as utf-8 string whenever necessary.
  2. Add more sophisticated handling for the cases where the generated byte stream is not valid utf-8.

The first option would allow "lossless" storage of everything generated by whisper, however it is unclear how to interpret the invalid utf-8 sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant