Token decoding issue - some characters are missing #25

yujinqiu · 2022-10-06T00:52:49Z

./main -m models/ggml-medium.bin -l zh -f ~/Movies/samplecn16k.wav

output

[00:00.000 --> 00:16.000]  元����,其实就是����世界,而且要用����世界这个��来定��元����的话,要比元����本身更加����。到这里就出现问题了。那它为什么不叫����世界呢?最��单的原因就是,����世界这个说法大家已经听��了,而元������得更为新��,又包��成为了一个新的概念。
[00:16.000 --> 00:44.000]  现在的元����技��,����没有我们想象中那么先进。按照目前世界第一元����公司,Roblox公司对于元����的定��来看,它起��要具��8个要素,分别是身份、社交、成进、����、多元、��地、经��、文明。身份就是一个����身份,��现实中的角色无关,这个比��好理解。社交也就是社交系��。成进就是感知����的升��,要做到和现实世界的体��完全相同。����就��������,不会有卡��,多元就多元化,

with OpenAI whisper cli

whisper --language zh ~/Movies/samplecn16k.wav

[00:00.000 --> 00:01.760] 元宇宙其实就虚拟世界
[00:01.760 --> 00:04.400] 而且要用虚拟世界这个词来定义元宇宙的话
[00:04.400 --> 00:06.400] 要比元宇宙本身更加准确
[00:06.400 --> 00:07.680] 但这里就出现问题了
[00:07.680 --> 00:09.360] 那它为什么不叫虚拟世界呢?
[00:09.360 --> 00:10.720] 最简单的原因就是
[00:10.720 --> 00:12.880] 虚拟世界这个说法大家已经听腻了
[00:12.880 --> 00:14.320] 而元宇宙显得更为吸引
[00:14.320 --> 00:16.200] 又包装成为了一个新的概念
[00:16.200 --> 00:17.440] 现在的元宇宙技术
[00:17.440 --> 00:19.160] 原有没有我们想象中那么先进
[00:19.160 --> 00:21.320] 按照目前世界第一元宇宙公司
[00:21.320 --> 00:23.480] 罗布洛克斯公司对于元宇宙的定义来看
[00:23.480 --> 00:25.080] 它起码要具备8个要素
[00:25.080 --> 00:30.680] 分别是身份、社交、成敬、延迟、多元、随地、经济、文明
[00:30.680 --> 00:32.280] 身份就是一个虚拟身份
[00:32.280 --> 00:33.640] 与现实中的角色无关
[00:33.640 --> 00:34.640] 这个比较好理解
[00:34.640 --> 00:36.200] 社交也就是社交系统
[00:36.200 --> 00:38.320] 成敬就是感知设备的升级
[00:38.320 --> 00:40.800] 要做到和现实世界的体验完全相同
[00:40.800 --> 00:42.080] 延迟就网络延迟
[00:42.080 --> 00:43.080] 不会有卡顿
[00:43.080 --> 00:44.200] 多元就多元化
[00:44.200 --> 00:45.600] 比如可以在里面玩游戏

The text was updated successfully, but these errors were encountered:

ggerganov · 2022-10-06T10:28:05Z

Can you provide a short audio sample that fails?

ghost · 2022-10-08T13:19:49Z

If it helps I can give an example for Croatian. It also happens for Croatian sometimes.

Input Video

[01:11.000 --> 01:15.000]   Sanadar u završnjoj riječi i nasu��enju za HIPO rekao da nije kriv presuda sljedećeg tjedna.

I do not know what the missing letter is / what the word means. In another audio file, which is not on youtube anymore, I got dovi��enja instead of doviđenja.

I produced the input file with:

youtube-dl -x --audio-format=mp3 $video_url
ffmpeg -i $mp3_file -ar 16000 -ac 1 -c:a pcm_s16le whisper_input.wav

On a side note: I am very impressed. With the normal whisper code on my CPU 1 minute of audio took about 1 hour of runtime with the large model. With your C++ project it is much less, maybe a few minutes per audio minute.

yujinqiu · 2022-10-10T00:00:32Z

samplecn16k.wav.zip

@ggerganov this is the sample audio, hope it can help. PS: it zip file, unzip it first.

ggerganov · 2022-10-10T17:26:50Z

So I found the reason why it fails to transcribe these characters, but I don't know how to fix it.

The tokenizer is more complicated than I expected. I thought that each token corresponds to a certain text and you simply have to convert each token separately and join the texts. However, it turns out that there are certain tokens that can be "chained" to produce a text.

I tried to understand the decoding algorithm of the tokenizer, using the original Python implementation, but I get lost in the code and cannot figure it out.

What I need to understand is how the following example works:

https://github.com/ggerganov/whisper.cpp/blob/tokenizer/tokenizer-test.py

Notice that the 2 tokens 2415 and 229 individually are decoded to garbage, while together they are decoded as 宇.
I think the tokenizer somehow uses the merges.txt data, which I currently completely ignore.

Anyway, hopefully someone can give me some tips how this decoding process works. For now, I am not able to fix this.

wcchoi · 2022-10-12T14:54:13Z

This is the python code to decode it, from https://github.com/openai/gpt-2/blob/master/src/encoder.py

you need the vocab.json file

import json

def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    _chr = unichr if sys.version_info[0] == 2 else chr
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [_chr(n) for n in cs]
    return dict(zip(bs, cs))

with open('./vocab.json', 'r', encoding='utf-8') as f:
    vocab = json.loads(f.read())
rev = {v: k for k, v in vocab.items()}

byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}

def decode(tokens):
    text = ''.join([rev[token] for token in tokens])
    text = bytearray([byte_decoder[c] for c in text]).decode('utf-8')
    return text

print(decode([2415, 229])) # '宇'

ggerganov · 2022-10-12T17:17:45Z

Thank you for this!
It is clear now how it works and I will be able to fix it.

ggerganov · 2022-10-17T21:17:19Z

@yujinqiu @aufziehvogel
Thanks to @r0y6a3n0 this should be resolved now.
Download the model files again and give it a try.

ChristopherFritz · 2022-10-18T01:57:06Z

For my part, I can confirm this fixes the issue for me.

Before fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん運動神��もすっごくいいし、バトンもうまいんだけど。

After fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん運動神経もすっごくいいし、バトンもうまいんだけど。

yujinqiu · 2022-10-18T07:02:40Z

it's fixed in some case.

ggerganov added the bug Something isn't working label Oct 6, 2022

ggerganov added the help wanted Extra attention is needed label Oct 10, 2022

ggerganov changed the title ~~Some Chinese character is not show up~~ Token decoding issue - some characters are missing Oct 10, 2022

tazz4843 mentioned this issue Oct 16, 2022

Unicode/Encoding Issue with Japanese Text #55

Closed

r0y6a3n0 mentioned this issue Oct 17, 2022

fix decode missing token issue #58

Merged

yujinqiu closed this as completed Oct 18, 2022

chenqianhe mentioned this issue Jan 11, 2023

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

Closed

warkcod mentioned this issue Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel #996

Open

iceychris mentioned this issue Jul 18, 2023

Switch to BPE tokenization impl from openai/tiktoken #1118

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token decoding issue - some characters are missing #25

Token decoding issue - some characters are missing #25

yujinqiu commented Oct 6, 2022

ggerganov commented Oct 6, 2022

ghost commented Oct 8, 2022 •

edited by ghost

yujinqiu commented Oct 10, 2022

ggerganov commented Oct 10, 2022

wcchoi commented Oct 12, 2022

ggerganov commented Oct 12, 2022

ggerganov commented Oct 17, 2022

ChristopherFritz commented Oct 18, 2022

yujinqiu commented Oct 18, 2022

Token decoding issue - some characters are missing #25

Token decoding issue - some characters are missing #25

Comments

yujinqiu commented Oct 6, 2022

ggerganov commented Oct 6, 2022

ghost commented Oct 8, 2022 • edited by ghost

yujinqiu commented Oct 10, 2022

ggerganov commented Oct 10, 2022

wcchoi commented Oct 12, 2022

ggerganov commented Oct 12, 2022

ggerganov commented Oct 17, 2022

ChristopherFritz commented Oct 18, 2022

yujinqiu commented Oct 18, 2022

ghost commented Oct 8, 2022 •

edited by ghost