whisper -> wav2vec2 conversion #30

rroohhh · 2023-03-10T10:55:05Z

For alignment we use wav2vec2, with a character level alphabet. The transcript generated by whisper has to be converted into a transcript using only the characters know to the wav2vec2 model to perform alignment.

The currently conversion performed by the worker for this is very unsophisticated.

First every space is replace with |, which is the token used by wav2vec2 as a word separator
Any character not present in the wav2vec2 alphabet is simply dropped.

There are several improvements to this:

Use something like unicode normalization (for example NFKD) to try to replace characters in the whisper transcript, that are not present in the wav2vec2 with ones that could be present.
Handle languages with no word separators (chinese, japanese, etc)
Add handling for punctuation (. is also a word separator. What about words joined using a -?)

The text was updated successfully, but these errors were encountered:

rroohhh added the worker label Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper -> wav2vec2 conversion #30

whisper -> wav2vec2 conversion #30

rroohhh commented Mar 10, 2023 •

edited

whisper -> wav2vec2 conversion #30

whisper -> wav2vec2 conversion #30

Comments

rroohhh commented Mar 10, 2023 • edited

rroohhh commented Mar 10, 2023 •

edited