-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token decoding issue - some characters are missing #25
Comments
Can you provide a short audio sample that fails? |
If it helps I can give an example for Croatian. It also happens for Croatian sometimes.
I do not know what the missing letter is / what the word means. In another audio file, which is not on youtube anymore, I got dovi��enja instead of doviđenja. I produced the input file with:
On a side note: I am very impressed. With the normal whisper code on my CPU 1 minute of audio took about 1 hour of runtime with the large model. With your C++ project it is much less, maybe a few minutes per audio minute. |
@ggerganov this is the sample audio, hope it can help. PS: it zip file, unzip it first. |
So I found the reason why it fails to transcribe these characters, but I don't know how to fix it. The tokenizer is more complicated than I expected. I thought that each token corresponds to a certain text and you simply have to convert each token separately and join the texts. However, it turns out that there are certain tokens that can be "chained" to produce a text. I tried to understand the decoding algorithm of the tokenizer, using the original Python implementation, but I get lost in the code and cannot figure it out. What I need to understand is how the following example works: https://github.com/ggerganov/whisper.cpp/blob/tokenizer/tokenizer-test.py Notice that the 2 tokens Anyway, hopefully someone can give me some tips how this decoding process works. For now, I am not able to fix this. |
This is the python code to decode it, from https://github.com/openai/gpt-2/blob/master/src/encoder.py you need the vocab.json file import json
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
_chr = unichr if sys.version_info[0] == 2 else chr
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
cs = [_chr(n) for n in cs]
return dict(zip(bs, cs))
with open('./vocab.json', 'r', encoding='utf-8') as f:
vocab = json.loads(f.read())
rev = {v: k for k, v in vocab.items()}
byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}
def decode(tokens):
text = ''.join([rev[token] for token in tokens])
text = bytearray([byte_decoder[c] for c in text]).decode('utf-8')
return text
print(decode([2415, 229])) # '宇' |
Thank you for this! |
For my part, I can confirm this fixes the issue for me. Before fix:
After fix:
|
it's fixed in some case. |
output
with OpenAI
whisper
cliThe text was updated successfully, but these errors were encountered: