Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps for words instead of sentence possible? #49

Closed
moebiussurfing opened this issue Oct 13, 2022 · 16 comments
Closed

Timestamps for words instead of sentence possible? #49

moebiussurfing opened this issue Oct 13, 2022 · 16 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@moebiussurfing
Copy link

Do you think that could be possible in some way?

I would like to get the time stamp of each word instead the sentence (words bundle).
That could be useful to some kind of karaoke lyrics generator,
or just to text to “lip sync” in a kind of video clip or 3d character synchro.

Cheers

@ArtyomZemlyak
Copy link

Now exist only one good solution for timestamps for whisper:
https://github.com/jianfch/stable-ts

But dont know, how its portable to c++ version.

@ggerganov ggerganov added the question Further information is requested label Oct 13, 2022
@ggerganov
Copy link
Owner

The current whisper.cpp interface provides the following function:

WHISPER_API whisper_token whisper_sample_timestamp(struct whisper_context * ctx);

It picks the timestamp token with highest probability and based on this token you can compute a time offset for the current text token. I tried using this and I observed that near the end of a transcription segment, the timestamp tokens are usually accurate. However, at the start of a segment, they can be very wrong.

For example, one possible strategy that I have in mind is to sample timestamp tokens in parallel with text tokens, and then for each segment, perform some fit that takes into account the start and end time of the segment together with the sampled timestamp tokens. You have to take into account that there could be outliers due to the observation above. Probably, the text length could also be added as a parameter to the fit.

Overall, I think it is not obvious how to make a robust algorithm for word-level timestamps.

@ArtyomZemlyak How is the performance of stable-ts? Does it work really well or it still has room for improvement?

@ArtyomZemlyak
Copy link

I haven't tested the most recent updates, but even the older version had some good timestamps. These timestamps can be used to find where a particular word is roughly located.
But, here it is worth considering that initially the model itself predicts the time for the token, only its beginning. So getting the start and end times for a word is a rather problematic task. Perhaps the author of the repository was able to solve this in the latest updates, but this needs to be tested.

@ggerganov ggerganov added the enhancement New feature or request label Oct 30, 2022
ggerganov added a commit that referenced this issue Nov 2, 2022
This turned out pretty good overall. The algorithm has been moved from
main.cpp to whisper.cpp and can be reused for all subtitles types. This
means that now you can specify the maximum length of the generated
lines. Simply provide the "-ml" argument specifying the max length in
number of characters
@ggerganov
Copy link
Owner

ggerganov commented Nov 2, 2022

This is now possible. Simply use the --max-len 1 parameter to set the maximum line length to 1 character:

$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1

whisper_model_load: loading model from '../models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:00.320]  
[00:00:00.320 --> 00:00:00.370]   And
[00:00:00.370 --> 00:00:00.690]   so
[00:00:00.690 --> 00:00:00.850]   my
[00:00:00.850 --> 00:00:01.590]   fellow
[00:00:01.590 --> 00:00:02.850]   Americans
[00:00:02.850 --> 00:00:03.300]  ,
[00:00:03.300 --> 00:00:04.140]   ask
[00:00:04.140 --> 00:00:04.990]   not
[00:00:04.990 --> 00:00:05.410]   what
[00:00:05.410 --> 00:00:05.660]   your
[00:00:05.660 --> 00:00:06.260]   country
[00:00:06.260 --> 00:00:06.600]   can
[00:00:06.600 --> 00:00:06.840]   do
[00:00:06.840 --> 00:00:07.010]   for
[00:00:07.010 --> 00:00:08.170]   you
[00:00:08.170 --> 00:00:08.190]  ,
[00:00:08.190 --> 00:00:08.430]   ask
[00:00:08.430 --> 00:00:08.910]   what
[00:00:08.910 --> 00:00:09.040]   you
[00:00:09.040 --> 00:00:09.320]   can
[00:00:09.320 --> 00:00:09.440]   do
[00:00:09.440 --> 00:00:09.760]   for
[00:00:09.760 --> 00:00:10.020]   your
[00:00:10.020 --> 00:00:10.510]   country
[00:00:10.510 --> 00:00:11.000]  .


whisper_print_timings:     load time =   142.70 ms
whisper_print_timings:      mel time =    24.66 ms
whisper_print_timings:   sample time =     3.81 ms
whisper_print_timings:   encode time =   330.40 ms / 55.07 ms per layer
whisper_print_timings:   decode time =    84.88 ms / 14.15 ms per layer
whisper_print_timings:    total time =   595.52 ms

This also works with subtitles - SRT, VTT, etc.
You can also change the length of the line by using --max-len N where N is the length in number of characters.

@moebiussurfing
Copy link
Author

Cool Thanks!

@moebiussurfing
Copy link
Author

moebiussurfing commented Nov 3, 2022

Hey,
I am getting an assertion error:

moebiussurfing@surfingMachine MINGW64 /d/_CODE/_AI/whisper.cpp
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 24 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1
 ...

Assertion failed: rc == 0, file ggml.c, line 6840

EDIT:
It happens for the simpler command too:

moebiussurfing@surfingMachine MINGW64 /d/_CODE/_AI/whisper.cpp
$ ./main -f HUXLEY2_16K.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 24 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'HUXLEY2_16K.wav' (27223040 samples, 1701.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps =
 1 ...

Assertion failed: rc == 0, file ggml.c, line 6840

@ggerganov
Copy link
Owner

This error means that the program wasn't able to create a thread. This makes me think that there is something wrong in your environment. Try to rebuild (i.e. make clean + make) or running on another machine. Another explanation is the Windows build is broken again (I see that you are using MINGW) - I am not able to test this platform, so it is possible that there you are observing some regression on Windows.

@moebiussurfing
Copy link
Author

This error means that the program wasn't able to create a thread. This makes me think that there is something wrong in your environment. Try to rebuild (i.e. make clean + make) or running on another machine. Another explanation is the Windows build is broken again (I see that you are using MINGW) - I am not able to test this platform, so it is possible that there you are observing some regression on Windows.

ok. I'll try again.
i made make clean + make already.
is the same environment I think. but maybe I opened a wrong console.

@moebiussurfing
Copy link
Author

moebiussurfing commented Nov 5, 2022

Hey,
not related to this topic anymore, as the error comes on a simple ./main.exe call ...

But I re checked if I was using the same command prompt that worked before,
(regular command, powershell, msys 32/64 bits etc)
and (I think) I found that something changed/broken.
Some days ago it compiled well.
I think I was using MSYS2 MinGW x64.

What environment should I try,
or which do you recommend to use on Windows?

Thanks!

@ggerganov
Copy link
Owner

Do you have a linux environment? Alternatively, try running with -t 1 - this will use only one thread and hopefully it won't hit that error. It will be slow, but at least you will be able to give it a try.

I will close this issue - if the Windows problems persist, feel free to open another issue and hopefully somebody will be able to help.

@Jeronymous
Copy link

The following code permits to recover word timestamps from Whisper transcription, using python code:
https://github.com/Jeronymous/whisper-timestamped

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023
)

This turned out pretty good overall. The algorithm has been moved from
main.cpp to whisper.cpp and can be reused for all subtitles types. This
means that now you can specify the maximum length of the generated
lines. Simply provide the "-ml" argument specifying the max length in
number of characters
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
)

This turned out pretty good overall. The algorithm has been moved from
main.cpp to whisper.cpp and can be reused for all subtitles types. This
means that now you can specify the maximum length of the generated
lines. Simply provide the "-ml" argument specifying the max length in
number of characters
@rben01
Copy link

rben01 commented Nov 14, 2023

Is it possible to join subsequent punctuation with the preceding word? Here is how open-ai/whisper’s version of this works:

  --word_timestamps WORD_TIMESTAMPS
                        (experimental) extract word-level timestamps and refine the results based on them
                        (default: False)
  --prepend_punctuations PREPEND_PUNCTUATIONS
                        if word_timestamps is True, merge these punctuation symbols with the next word
                        (default: "'“¿([{-)
  --append_punctuations APPEND_PUNCTUATIONS
                        if word_timestamps is True, merge these punctuation symbols with the previous word
                        (default: "'.。,,!!??::”)]}、)

@canufarm
Copy link

--word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and refine the results based on them
(default: False)
--prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the next word
(default: "'“¿([{-)
--append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the previous word
(default: "'.。,,!!??::”)]}、)

Short question: How do these parameters work? If I use
... --word_timestamps True --append_punctuations "'.,;:- --...
in my command line, I get an error saying, that I have use only one char. Thanks.

@rben01
Copy link

rben01 commented Feb 17, 2024

You'll need to escape most of that punctuation for your shell, or surround it in quote (and then escape the quotes inside).

@canufarm
Copy link

Thank you, @rben01. Seems to work with
--append_punctuations "-\"\'.,;:"

I also tried
--append_punctuations "\-\"\'.,;:" (with "\" before "-")

But I still get it not to work with "-". If there is a word like Bla-Blubb with a dash, then with a line break the result should be

Bla Bla Bla Bla-
Blubb Blubb

But it is always this way:

Bla Bla Bla Bla
-Blubb Blubb

How can I get this dash right (in German language) at the end of the word instead the new line?

@ballerburg9005
Copy link

whisper.cpp -f in.wav -osrt

[00:00:00.000 --> 00:00:10.480]   Okay, got power now, you're fucked. Nice, fuck. *laughs* *laughs* *groans* man.

whisper.cpp -f in.wav -osrt --max-len 1

[00:00:00.000 --> 00:00:00.010]  
[00:00:00.010 --> 00:00:00.530]   Okay
[00:00:00.530 --> 00:00:00.810]  ,
[00:00:00.810 --> 00:00:01.370]   got
[00:00:01.370 --> 00:00:01.950]   power
[00:00:01.950 --> 00:00:02.300]   now
[00:00:02.300 --> 00:00:02.560]  ,
[00:00:02.560 --> 00:00:02.990]   you
[00:00:02.990 --> 00:00:03.360]  're
[00:00:03.360 --> 00:00:04.380]   fucked
[00:00:04.380 --> 00:00:04.570]  .
[00:00:04.570 --> 00:00:05.110]   Nice
[00:00:05.110 --> 00:00:05.430]  ,
[00:00:05.430 --> 00:00:05.920]   fuck
[00:00:05.920 --> 00:00:06.320]  .
[00:00:06.320 --> 00:00:06.450]   *
[00:00:06.450 --> 00:00:07.310]  laughs
[00:00:07.310 --> 00:00:07.390]  *
[00:00:07.390 --> 00:00:07.520]   *
[00:00:07.520 --> 00:00:08.250]  laughs
[00:00:08.250 --> 00:00:08.460]  *
[00:00:08.460 --> 00:00:08.690]   *
[00:00:08.690 --> 00:00:09.390]  gro
[00:00:09.390 --> 00:00:09.420]  ans
[00:00:09.420 --> 00:00:09.520]  *
[00:00:09.520 --> 00:00:09.910]   man
[00:00:09.910 --> 00:00:10.480]  .

whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word
whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word --dtw base.en

(both same result)

[00:00:00.000 --> 00:00:00.010]  
[00:00:00.010 --> 00:00:00.810]   Okay,
[00:00:00.810 --> 00:00:01.370]   got
[00:00:01.370 --> 00:00:01.950]   power
[00:00:01.950 --> 00:00:02.560]   now,
[00:00:02.560 --> 00:00:03.360]   you're
[00:00:03.360 --> 00:00:04.570]   fucked.
[00:00:04.570 --> 00:00:05.430]   Nice,
[00:00:05.430 --> 00:00:06.320]   fuck.
[00:00:06.320 --> 00:00:07.390]   *laughs*
[00:00:07.390 --> 00:00:08.460]   *laughs*
[00:00:08.460 --> 00:00:09.520]   *groans*
[00:00:09.520 --> 00:00:10.480]   man.

whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word true --dtw large.v3 --model models/ggml-large-v3.bin
whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word true --model models/ggml-large-v3.bin

(again identical with the large v3 model)

[00:00:00.000 --> 00:00:00.030]  
[00:00:00.030 --> 00:00:00.670]   Okay,
[00:00:00.670 --> 00:00:00.780]   I
[00:00:00.780 --> 00:00:01.150]   got
[00:00:01.150 --> 00:00:01.700]   power
[00:00:01.700 --> 00:00:02.240]   now,
[00:00:02.240 --> 00:00:02.930]   you're
[00:00:02.930 --> 00:00:04.410]   fucked.
[00:00:04.410 --> 00:00:07.140]   Nice.
[00:00:07.140 --> 00:00:07.920]   Oh
[00:00:07.920 --> 00:00:10.560]   man.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants