When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

chenqianhe · 2023-01-11T05:09:49Z

just like #25, when transcribed in zh(chinese), there are still some characters missing, and the model is from ggml-large.bin in hugging-face(https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/main).

Maybe the large model and the large-v1 model still need to be repaired?

error example:

[00:00:13.000 --> 00:00:15.000] 各种AI的应用��出不��

ground truth:

[00:00:13.000 --> 00:00:15.000] 各种AI的应用层出不穷

ggerganov · 2023-01-15T09:39:26Z

It's already converted using the solution in #25. Maybe there is some other problem with the encoding - not sure

chenqianhe · 2023-01-15T11:32:36Z

I'm very sorry that I may have caused some misunderstanding. Through experiments, I found that using -pc output on the terminal will lead to garbled code; And I initially determined that this was caused by the fixed length of char:

const char * text = whisper_full_get_token_text(ctx, i, j);

Like #25, sometimes the Chinese code length should be 2. I guess this should be the problem.
And I modified the output statement and added spaces to get such results, which may prove my guess.

printf("%s%s %s%s", speaker.c_str(), k_colors[3].c_str(), text, "\033[0m");

[00:00:13.000 --> 00:00:15.000] 各种 AI 的应用 � � 出不 � �

Finally, I would like to apologize for the trouble that my wrong description may cause again.

ggerganov · 2023-01-15T12:04:02Z

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

chenqianhe · 2023-01-15T12:05:40Z

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

Yes, it outputs normally

chenqianhe · 2023-01-15T12:07:22Z

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

At the same time, when I output to the file, it is normal.
Call Whisper_full_get_segment_text(ctx, i); I will get the correct output

giannhskp · 2023-02-06T13:22:31Z

Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.

chenqianhe · 2023-02-06T13:26:20Z

Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.

I judge that this is caused by the uncertainty of the number correspondence between words and tokens. I think this will also lead to errors in obtaining the timestamp of each word. But I'm sorry, I haven't found a good way to fix this problem

giannhskp · 2023-02-08T09:06:06Z

I indeed intend to get word-level timestamps. This error does not occur during normal transcription (without -ml argument).

@ggerganov Any idea on what is causing this bug, or how it may be fixed?

ggerganov · 2023-02-11T07:12:30Z

If you add the --split-on-word argument does it fix the issue?

BancoLin · 2023-04-14T09:28:26Z

const char * text = whisper_full_get_token_text(ctx, i, j);
...
printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), text, "\033[0m");

The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.

To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.

So quick fix steps are (Ubuntu):

install icu library
- sudo apt-get install libicu-dev
modify LDFLAGS in ./Makefile
- LDFLAGS = -licuuc
modify examples/main/main.cpp

// include `icu` header
#include <unicode/ustring.h>

...

// add `is_valid_utf8()` function to check the string is a valid utf-8 string or not
int is_valid_utf8(const char *str) {
    UErrorCode error = U_ZERO_ERROR;
    u_strFromUTF8(NULL, 0, NULL, str, -1, &error);
    return error != U_INVALID_CHAR_FOUND;
}

...

// modify `if(params.print_colors)` loop
        if (params.print_colors) {
            // temp char buffer
            char tmp[1024]; 
            tmp[0] = '\0';

            for (int j = 0; j < whisper_full_n_tokens(ctx, i); ++j) {
                if (params.print_special == false) {
                    const whisper_token id = whisper_full_get_token_id(ctx, i, j);
                    if (id >= whisper_token_eot(ctx)) {
                        continue;
                    }
                }

                const char * text = whisper_full_get_token_text(ctx, i, j);
                const float  p    = whisper_full_get_token_p   (ctx, i, j);

                const int col = std::max(0, std::min((int) k_colors.size() - 1, (int) (std::pow(p, 3)*float(k_colors.size()))));

                // push to the temp char buffer
                strcat(tmp, text); 

                // check the buffer
                if( is_valid_utf8(tmp) ){
                    printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), tmp, "\033[0m");
                    tmp[0]='\0';
                }
            }
        } else {
            const char * text = whisper_full_get_segment_text(ctx, i);

            printf("%s%s", speaker.c_str(), text);
        }

before:

after:

One issue of the current implementation is the color for the temp char buffer is the same as last added token.

Note: I only ran some tests on Chinese, not sure the fix is applicable to other language as well.

inspired by ggerganov/whisper.cpp#399 (comment)

bobqianic · 2023-09-28T04:12:37Z

Fixed in #1313

ggerganov added the bug Something isn't working label Jan 15, 2023

chenqianhe changed the title ~~Token decoding issue - some characters are still missing when using large model~~ When using -pc output in the terminal, some Chinese characters cannot be displayed normally Jan 15, 2023

ggerganov added enhancement New feature or request good first issue Good for newcomers and removed bug Something isn't working labels Jan 15, 2023

Ovler-Young mentioned this issue Mar 12, 2023

Mojibake in Debug Window Const-me/Whisper#37

Open

Ovler-Young added a commit to Ovler-Young/Whisper that referenced this issue Jun 12, 2023

buffer added to avoid splitted chatacter

5570289

inspired by ggerganov/whisper.cpp#399 (comment)

This was referenced Jun 12, 2023

buffer added to avoid splitted chatacter Const-me/Whisper#122

Open

buffer added to avoid splitted chatacter Ovler-Young/Whisper#1

Merged

bobqianic mentioned this issue Sep 20, 2023

examples: Fix the encoding issues on Windows #1313

Closed

4 tasks

bobqianic added a commit to bobqianic/whisper.cpp that referenced this issue Sep 28, 2023

Fix issue ggerganov#399

88c8976

bobqianic added a commit to bobqianic/whisper.cpp that referenced this issue Sep 28, 2023

Fix issue ggerganov#399

013d434

bobqianic closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

chenqianhe commented Jan 11, 2023 •

edited

Loading

ggerganov commented Jan 15, 2023

chenqianhe commented Jan 15, 2023

ggerganov commented Jan 15, 2023

chenqianhe commented Jan 15, 2023

chenqianhe commented Jan 15, 2023

giannhskp commented Feb 6, 2023

chenqianhe commented Feb 6, 2023

giannhskp commented Feb 8, 2023

ggerganov commented Feb 11, 2023

BancoLin commented Apr 14, 2023

bobqianic commented Sep 28, 2023

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

Comments

chenqianhe commented Jan 11, 2023 • edited Loading

ggerganov commented Jan 15, 2023

chenqianhe commented Jan 15, 2023

ggerganov commented Jan 15, 2023

chenqianhe commented Jan 15, 2023

chenqianhe commented Jan 15, 2023

giannhskp commented Feb 6, 2023

chenqianhe commented Feb 6, 2023

giannhskp commented Feb 8, 2023

ggerganov commented Feb 11, 2023

BancoLin commented Apr 14, 2023

bobqianic commented Sep 28, 2023

chenqianhe commented Jan 11, 2023 •

edited

Loading