Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper -> wav2vec2 conversion #30

Open
rroohhh opened this issue Mar 10, 2023 · 0 comments
Open

whisper -> wav2vec2 conversion #30

rroohhh opened this issue Mar 10, 2023 · 0 comments
Labels

Comments

@rroohhh
Copy link
Member

rroohhh commented Mar 10, 2023

For alignment we use wav2vec2, with a character level alphabet. The transcript generated by whisper has to be converted into a transcript using only the characters know to the wav2vec2 model to perform alignment.

The currently conversion performed by the worker for this is very unsophisticated.

  1. First every space is replace with |, which is the token used by wav2vec2 as a word separator
  2. Any character not present in the wav2vec2 alphabet is simply dropped.

There are several improvements to this:

  1. Use something like unicode normalization (for example NFKD) to try to replace characters in the whisper transcript, that are not present in the wav2vec2 with ones that could be present.
  2. Handle languages with no word separators (chinese, japanese, etc)
  3. Add handling for punctuation (. is also a word separator. What about words joined using a -?)
@rroohhh rroohhh added the worker label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant