-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399
Comments
It's already converted using the solution in #25. Maybe there is some other problem with the encoding - not sure |
I'm very sorry that I may have caused some misunderstanding. Through experiments, I found that using -pc output on the terminal will lead to garbled code; And I initially determined that this was caused by the fixed length of char: const char * text = whisper_full_get_token_text(ctx, i, j); Like #25, sometimes the Chinese code length should be 2. I guess this should be the problem. printf("%s%s %s%s", speaker.c_str(), k_colors[3].c_str(), text, "\033[0m");
Finally, I would like to apologize for the trouble that my wrong description may cause again. |
No problems. Just to make sure I understand - when you remove the |
Yes, it outputs normally |
At the same time, when I output to the file, it is normal. |
Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using |
I judge that this is caused by the uncertainty of the number correspondence between words and tokens. I think this will also lead to errors in obtaining the timestamp of each word. But I'm sorry, I haven't found a good way to fix this problem |
I indeed intend to get word-level timestamps. This error does not occur during normal transcription (without @ggerganov Any idea on what is causing this bug, or how it may be fixed? |
If you add the |
The issue stems from the possibility that the token To resolve the issue I use So quick fix steps are (Ubuntu):
// include `icu` header
#include <unicode/ustring.h>
...
// add `is_valid_utf8()` function to check the string is a valid utf-8 string or not
int is_valid_utf8(const char *str) {
UErrorCode error = U_ZERO_ERROR;
u_strFromUTF8(NULL, 0, NULL, str, -1, &error);
return error != U_INVALID_CHAR_FOUND;
}
...
// modify `if(params.print_colors)` loop
if (params.print_colors) {
// temp char buffer
char tmp[1024];
tmp[0] = '\0';
for (int j = 0; j < whisper_full_n_tokens(ctx, i); ++j) {
if (params.print_special == false) {
const whisper_token id = whisper_full_get_token_id(ctx, i, j);
if (id >= whisper_token_eot(ctx)) {
continue;
}
}
const char * text = whisper_full_get_token_text(ctx, i, j);
const float p = whisper_full_get_token_p (ctx, i, j);
const int col = std::max(0, std::min((int) k_colors.size() - 1, (int) (std::pow(p, 3)*float(k_colors.size()))));
// push to the temp char buffer
strcat(tmp, text);
// check the buffer
if( is_valid_utf8(tmp) ){
printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), tmp, "\033[0m");
tmp[0]='\0';
}
}
} else {
const char * text = whisper_full_get_segment_text(ctx, i);
printf("%s%s", speaker.c_str(), text);
} One issue of the current implementation is the color for the temp char buffer is the same as last added token. Note: I only ran some tests on Chinese, not sure the fix is applicable to other language as well. |
Fixed in #1313 |
just like #25, when transcribed in zh(chinese), there are still some characters missing, and the model is from ggml-large.bin in hugging-face(https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/main).
Maybe the large model and the large-v1 model still need to be repaired?
error example:
ground truth:
The text was updated successfully, but these errors were encountered: