Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

Closed
chenqianhe opened this issue Jan 11, 2023 · 11 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@chenqianhe
Copy link
Contributor

chenqianhe commented Jan 11, 2023

just like #25, when transcribed in zh(chinese), there are still some characters missing, and the model is from ggml-large.bin in hugging-face(https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/main).

Maybe the large model and the large-v1 model still need to be repaired?

error example:

[00:00:13.000 --> 00:00:15.000] 各种AI的应用��出不��

ground truth:

[00:00:13.000 --> 00:00:15.000] 各种AI的应用层出不穷

@ggerganov ggerganov added the bug Something isn't working label Jan 15, 2023
@ggerganov
Copy link
Owner

It's already converted using the solution in #25. Maybe there is some other problem with the encoding - not sure

@chenqianhe
Copy link
Contributor Author

I'm very sorry that I may have caused some misunderstanding. Through experiments, I found that using -pc output on the terminal will lead to garbled code; And I initially determined that this was caused by the fixed length of char:

const char * text = whisper_full_get_token_text(ctx, i, j);

Like #25, sometimes the Chinese code length should be 2. I guess this should be the problem.
And I modified the output statement and added spaces to get such results, which may prove my guess.

printf("%s%s %s%s", speaker.c_str(), k_colors[3].c_str(), text, "\033[0m");

[00:00:13.000 --> 00:00:15.000] 各 种 AI 的 应 用 � � 出 不 � �

Finally, I would like to apologize for the trouble that my wrong description may cause again.

@chenqianhe chenqianhe changed the title Token decoding issue - some characters are still missing when using large model When using -pc output in the terminal, some Chinese characters cannot be displayed normally Jan 15, 2023
@ggerganov
Copy link
Owner

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

@chenqianhe
Copy link
Contributor Author

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

Yes, it outputs normally

@chenqianhe
Copy link
Contributor Author

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

At the same time, when I output to the file, it is normal.
Call Whisper_full_get_segment_text(ctx, i); I will get the correct output

@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers and removed bug Something isn't working labels Jan 15, 2023
@giannhskp
Copy link

Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.

@chenqianhe
Copy link
Contributor Author

Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.

I judge that this is caused by the uncertainty of the number correspondence between words and tokens. I think this will also lead to errors in obtaining the timestamp of each word. But I'm sorry, I haven't found a good way to fix this problem

@giannhskp
Copy link

I indeed intend to get word-level timestamps. This error does not occur during normal transcription (without -ml argument).

@ggerganov Any idea on what is causing this bug, or how it may be fixed?

@ggerganov
Copy link
Owner

If you add the --split-on-word argument does it fix the issue?

@BancoLin
Copy link

const char * text = whisper_full_get_token_text(ctx, i, j);
...
printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), text, "\033[0m");

The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.

To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.

So quick fix steps are (Ubuntu):

  1. install icu library
    • sudo apt-get install libicu-dev
  2. modify LDFLAGS in ./Makefile
    • LDFLAGS = -licuuc
  3. modify examples/main/main.cpp
// include `icu` header
#include <unicode/ustring.h>

...

// add `is_valid_utf8()` function to check the string is a valid utf-8 string or not
int is_valid_utf8(const char *str) {
    UErrorCode error = U_ZERO_ERROR;
    u_strFromUTF8(NULL, 0, NULL, str, -1, &error);
    return error != U_INVALID_CHAR_FOUND;
}

...

// modify `if(params.print_colors)` loop
        if (params.print_colors) {
            // temp char buffer
            char tmp[1024]; 
            tmp[0] = '\0';

            for (int j = 0; j < whisper_full_n_tokens(ctx, i); ++j) {
                if (params.print_special == false) {
                    const whisper_token id = whisper_full_get_token_id(ctx, i, j);
                    if (id >= whisper_token_eot(ctx)) {
                        continue;
                    }
                }

                const char * text = whisper_full_get_token_text(ctx, i, j);
                const float  p    = whisper_full_get_token_p   (ctx, i, j);

                const int col = std::max(0, std::min((int) k_colors.size() - 1, (int) (std::pow(p, 3)*float(k_colors.size()))));

                // push to the temp char buffer
                strcat(tmp, text); 

                // check the buffer
                if( is_valid_utf8(tmp) ){
                    printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), tmp, "\033[0m");
                    tmp[0]='\0';
                }
            }
        } else {
            const char * text = whisper_full_get_segment_text(ctx, i);

            printf("%s%s", speaker.c_str(), text);
        }

before:
2023-04-14 17-16-29 的螢幕擷圖

after:
2023-04-14 17-14-42 的螢幕擷圖

One issue of the current implementation is the color for the temp char buffer is the same as last added token.

Note: I only ran some tests on Chinese, not sure the fix is applicable to other language as well.

Ovler-Young added a commit to Ovler-Young/Whisper that referenced this issue Jun 12, 2023
bobqianic added a commit to bobqianic/whisper.cpp that referenced this issue Sep 28, 2023
bobqianic added a commit to bobqianic/whisper.cpp that referenced this issue Sep 28, 2023
@bobqianic
Copy link
Collaborator

Fixed in #1313

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants