You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For alignment we use wav2vec2, with a character level alphabet. The transcript generated by whisper has to be converted into a transcript using only the characters know to the wav2vec2 model to perform alignment.
The currently conversion performed by the worker for this is very unsophisticated.
First every space is replace with |, which is the token used by wav2vec2 as a word separator
Any character not present in the wav2vec2 alphabet is simply dropped.
There are several improvements to this:
Use something like unicode normalization (for example NFKD) to try to replace characters in the whisper transcript, that are not present in the wav2vec2 with ones that could be present.
Handle languages with no word separators (chinese, japanese, etc)
Add handling for punctuation (. is also a word separator. What about words joined using a -?)
The text was updated successfully, but these errors were encountered:
For alignment we use
wav2vec2
, with a character level alphabet. The transcript generated by whisper has to be converted into a transcript using only the characters know to thewav2vec2
model to perform alignment.The currently conversion performed by the worker for this is very unsophisticated.
|
, which is the token used bywav2vec2
as a word separatorwav2vec2
alphabet is simply dropped.There are several improvements to this:
NFKD
) to try to replace characters in the whisper transcript, that are not present in thewav2vec2
with ones that could be present..
is also a word separator. What about words joined using a-
?)The text was updated successfully, but these errors were encountered: