Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token decoding issue - some characters are missing #25

Closed
yujinqiu opened this issue Oct 6, 2022 · 9 comments
Closed

Token decoding issue - some characters are missing #25

yujinqiu opened this issue Oct 6, 2022 · 9 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@yujinqiu
Copy link

yujinqiu commented Oct 6, 2022

./main -m models/ggml-medium.bin -l zh -f ~/Movies/samplecn16k.wav

output

[00:00.000 --> 00:16.000]  元����,其实就是����世界,而且要用����世界这个��来定��元����的话,要比元����本身更加����。到这里就出现问题了。那它为什么不叫����世界呢?最��单的原因就是,����世界这个说法大家已经听��了,而元������得更为新��,又包��成为了一个新的概念。
[00:16.000 --> 00:44.000]  现在的元����技��,����没有我们想象中那么先进。按照目前世界第一元����公司,Roblox公司对于元����的定��来看,它起��要具��8个要素,分别是身份、社交、成进、����、多元、��地、经��、文明。身份就是一个����身份,��现实中的角色无关,这个比��好理解。社交也就是社交系��。成进就是感知����的升��,要做到和现实世界的体��完全相同。����就��������,不会有卡��,多元就多元化,

with OpenAI whisper cli

whisper --language zh ~/Movies/samplecn16k.wav
[00:00.000 --> 00:01.760] 元宇宙其实就虚拟世界
[00:01.760 --> 00:04.400] 而且要用虚拟世界这个词来定义元宇宙的话
[00:04.400 --> 00:06.400] 要比元宇宙本身更加准确
[00:06.400 --> 00:07.680] 但这里就出现问题了
[00:07.680 --> 00:09.360] 那它为什么不叫虚拟世界呢?
[00:09.360 --> 00:10.720] 最简单的原因就是
[00:10.720 --> 00:12.880] 虚拟世界这个说法大家已经听腻了
[00:12.880 --> 00:14.320] 而元宇宙显得更为吸引
[00:14.320 --> 00:16.200] 又包装成为了一个新的概念
[00:16.200 --> 00:17.440] 现在的元宇宙技术
[00:17.440 --> 00:19.160] 原有没有我们想象中那么先进
[00:19.160 --> 00:21.320] 按照目前世界第一元宇宙公司
[00:21.320 --> 00:23.480] 罗布洛克斯公司对于元宇宙的定义来看
[00:23.480 --> 00:25.080] 它起码要具备8个要素
[00:25.080 --> 00:30.680] 分别是身份、社交、成敬、延迟、多元、随地、经济、文明
[00:30.680 --> 00:32.280] 身份就是一个虚拟身份
[00:32.280 --> 00:33.640] 与现实中的角色无关
[00:33.640 --> 00:34.640] 这个比较好理解
[00:34.640 --> 00:36.200] 社交也就是社交系统
[00:36.200 --> 00:38.320] 成敬就是感知设备的升级
[00:38.320 --> 00:40.800] 要做到和现实世界的体验完全相同
[00:40.800 --> 00:42.080] 延迟就网络延迟
[00:42.080 --> 00:43.080] 不会有卡顿
[00:43.080 --> 00:44.200] 多元就多元化
[00:44.200 --> 00:45.600] 比如可以在里面玩游戏
@ggerganov ggerganov added the bug Something isn't working label Oct 6, 2022
@ggerganov
Copy link
Owner

Can you provide a short audio sample that fails?

@ghost
Copy link

ghost commented Oct 8, 2022

If it helps I can give an example for Croatian. It also happens for Croatian sometimes.

Input Video

[01:11.000 --> 01:15.000]   Sanadar u završnjoj riječi i nasu��enju za HIPO rekao da nije kriv presuda sljedećeg tjedna.

I do not know what the missing letter is / what the word means. In another audio file, which is not on youtube anymore, I got dovi��enja instead of doviđenja.

I produced the input file with:

youtube-dl -x --audio-format=mp3 $video_url
ffmpeg -i $mp3_file -ar 16000 -ac 1 -c:a pcm_s16le whisper_input.wav

On a side note: I am very impressed. With the normal whisper code on my CPU 1 minute of audio took about 1 hour of runtime with the large model. With your C++ project it is much less, maybe a few minutes per audio minute.

@yujinqiu
Copy link
Author

samplecn16k.wav.zip

@ggerganov this is the sample audio, hope it can help. PS: it zip file, unzip it first.

@ggerganov ggerganov added the help wanted Extra attention is needed label Oct 10, 2022
@ggerganov
Copy link
Owner

So I found the reason why it fails to transcribe these characters, but I don't know how to fix it.

The tokenizer is more complicated than I expected. I thought that each token corresponds to a certain text and you simply have to convert each token separately and join the texts. However, it turns out that there are certain tokens that can be "chained" to produce a text.

I tried to understand the decoding algorithm of the tokenizer, using the original Python implementation, but I get lost in the code and cannot figure it out.

What I need to understand is how the following example works:

https://github.com/ggerganov/whisper.cpp/blob/tokenizer/tokenizer-test.py

Notice that the 2 tokens 2415 and 229 individually are decoded to garbage, while together they are decoded as .
I think the tokenizer somehow uses the merges.txt data, which I currently completely ignore.

Anyway, hopefully someone can give me some tips how this decoding process works. For now, I am not able to fix this.

@ggerganov ggerganov changed the title Some Chinese character is not show up Token decoding issue - some characters are missing Oct 10, 2022
@wcchoi
Copy link

wcchoi commented Oct 12, 2022

This is the python code to decode it, from https://github.com/openai/gpt-2/blob/master/src/encoder.py

you need the vocab.json file

import json

def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    _chr = unichr if sys.version_info[0] == 2 else chr
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [_chr(n) for n in cs]
    return dict(zip(bs, cs))

with open('./vocab.json', 'r', encoding='utf-8') as f:
    vocab = json.loads(f.read())
rev = {v: k for k, v in vocab.items()}

byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}

def decode(tokens):
    text = ''.join([rev[token] for token in tokens])
    text = bytearray([byte_decoder[c] for c in text]).decode('utf-8')
    return text

print(decode([2415, 229])) # '宇'

@ggerganov
Copy link
Owner

Thank you for this!
It is clear now how it works and I will be able to fix it.

@ggerganov
Copy link
Owner

@yujinqiu @aufziehvogel
Thanks to @r0y6a3n0 this should be resolved now.
Download the model files again and give it a try.

@ChristopherFritz
Copy link

For my part, I can confirm this fixes the issue for me.

Before fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん運動神��もすっごくいいし、バトンもうまいんだけど。

After fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん運動神経もすっごくいいし、バトンもうまいんだけど。

@yujinqiu
Copy link
Author

it's fixed in some case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants