Timestamps for words instead of sentence possible? #49

moebiussurfing · 2022-10-13T01:49:34Z

Do you think that could be possible in some way?

I would like to get the time stamp of each word instead the sentence (words bundle).
That could be useful to some kind of karaoke lyrics generator,
or just to text to “lip sync” in a kind of video clip or 3d character synchro.

Cheers

ArtyomZemlyak · 2022-10-13T06:02:20Z

Now exist only one good solution for timestamps for whisper:
https://github.com/jianfch/stable-ts

But dont know, how its portable to c++ version.

ggerganov · 2022-10-13T15:59:41Z

The current whisper.cpp interface provides the following function:

whisper.cpp/whisper.h

Line 123 in b2f1600

    
           WHISPER_API whisper_token whisper_sample_timestamp(struct whisper_context * ctx);

It picks the timestamp token with highest probability and based on this token you can compute a time offset for the current text token. I tried using this and I observed that near the end of a transcription segment, the timestamp tokens are usually accurate. However, at the start of a segment, they can be very wrong.

For example, one possible strategy that I have in mind is to sample timestamp tokens in parallel with text tokens, and then for each segment, perform some fit that takes into account the start and end time of the segment together with the sampled timestamp tokens. You have to take into account that there could be outliers due to the observation above. Probably, the text length could also be added as a parameter to the fit.

Overall, I think it is not obvious how to make a robust algorithm for word-level timestamps.

@ArtyomZemlyak How is the performance of stable-ts? Does it work really well or it still has room for improvement?

ArtyomZemlyak · 2022-10-14T01:49:38Z

I haven't tested the most recent updates, but even the older version had some good timestamps. These timestamps can be used to find where a particular word is roughly located.
But, here it is worth considering that initially the model itself predicts the time for the token, only its beginning. So getting the start and end times for a word is a rather problematic task. Perhaps the author of the repository was able to solve this in the latest updates, but this needs to be tested.

This turned out pretty good overall. The algorithm has been moved from main.cpp to whisper.cpp and can be reused for all subtitles types. This means that now you can specify the maximum length of the generated lines. Simply provide the "-ml" argument specifying the max length in number of characters

ggerganov · 2022-11-02T19:51:34Z

This is now possible. Simply use the --max-len 1 parameter to set the maximum line length to 1 character:

$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1

whisper_model_load: loading model from '../models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:00.320]  
[00:00:00.320 --> 00:00:00.370]   And
[00:00:00.370 --> 00:00:00.690]   so
[00:00:00.690 --> 00:00:00.850]   my
[00:00:00.850 --> 00:00:01.590]   fellow
[00:00:01.590 --> 00:00:02.850]   Americans
[00:00:02.850 --> 00:00:03.300]  ,
[00:00:03.300 --> 00:00:04.140]   ask
[00:00:04.140 --> 00:00:04.990]   not
[00:00:04.990 --> 00:00:05.410]   what
[00:00:05.410 --> 00:00:05.660]   your
[00:00:05.660 --> 00:00:06.260]   country
[00:00:06.260 --> 00:00:06.600]   can
[00:00:06.600 --> 00:00:06.840]   do
[00:00:06.840 --> 00:00:07.010]   for
[00:00:07.010 --> 00:00:08.170]   you
[00:00:08.170 --> 00:00:08.190]  ,
[00:00:08.190 --> 00:00:08.430]   ask
[00:00:08.430 --> 00:00:08.910]   what
[00:00:08.910 --> 00:00:09.040]   you
[00:00:09.040 --> 00:00:09.320]   can
[00:00:09.320 --> 00:00:09.440]   do
[00:00:09.440 --> 00:00:09.760]   for
[00:00:09.760 --> 00:00:10.020]   your
[00:00:10.020 --> 00:00:10.510]   country
[00:00:10.510 --> 00:00:11.000]  .


whisper_print_timings:     load time =   142.70 ms
whisper_print_timings:      mel time =    24.66 ms
whisper_print_timings:   sample time =     3.81 ms
whisper_print_timings:   encode time =   330.40 ms / 55.07 ms per layer
whisper_print_timings:   decode time =    84.88 ms / 14.15 ms per layer
whisper_print_timings:    total time =   595.52 ms

This also works with subtitles - SRT, VTT, etc.
You can also change the length of the line by using --max-len N where N is the length in number of characters.

moebiussurfing · 2022-11-02T22:30:04Z

Cool Thanks!

moebiussurfing · 2022-11-03T00:09:59Z

Hey,
I am getting an assertion error:

moebiussurfing@surfingMachine MINGW64 /d/_CODE/_AI/whisper.cpp
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 24 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1
 ...

Assertion failed: rc == 0, file ggml.c, line 6840

EDIT:
It happens for the simpler command too:

moebiussurfing@surfingMachine MINGW64 /d/_CODE/_AI/whisper.cpp
$ ./main -f HUXLEY2_16K.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 24 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'HUXLEY2_16K.wav' (27223040 samples, 1701.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps =
 1 ...

Assertion failed: rc == 0, file ggml.c, line 6840

ggerganov · 2022-11-03T17:50:18Z

This error means that the program wasn't able to create a thread. This makes me think that there is something wrong in your environment. Try to rebuild (i.e. make clean + make) or running on another machine. Another explanation is the Windows build is broken again (I see that you are using MINGW) - I am not able to test this platform, so it is possible that there you are observing some regression on Windows.

moebiussurfing · 2022-11-03T18:02:05Z

This error means that the program wasn't able to create a thread. This makes me think that there is something wrong in your environment. Try to rebuild (i.e. make clean + make) or running on another machine. Another explanation is the Windows build is broken again (I see that you are using MINGW) - I am not able to test this platform, so it is possible that there you are observing some regression on Windows.

ok. I'll try again.
i made make clean + make already.
is the same environment I think. but maybe I opened a wrong console.

moebiussurfing · 2022-11-05T03:08:09Z

Hey,
not related to this topic anymore, as the error comes on a simple ./main.exe call ...

But I re checked if I was using the same command prompt that worked before,
(regular command, powershell, msys 32/64 bits etc)
and (I think) I found that something changed/broken.
Some days ago it compiled well.
I think I was using MSYS2 MinGW x64.

What environment should I try,
or which do you recommend to use on Windows?

Thanks!

ggerganov · 2022-11-07T18:20:10Z

Do you have a linux environment? Alternatively, try running with -t 1 - this will use only one thread and hopefully it won't hit that error. It will be slow, but at least you will be able to give it a try.

I will close this issue - if the Windows problems persist, feel free to open another issue and hopefully somebody will be able to help.

Jeronymous · 2023-01-14T12:45:18Z

The following code permits to recover word timestamps from Whisper transcription, using python code:
https://github.com/Jeronymous/whisper-timestamped

) This turned out pretty good overall. The algorithm has been moved from main.cpp to whisper.cpp and can be reused for all subtitles types. This means that now you can specify the maximum length of the generated lines. Simply provide the "-ml" argument specifying the max length in number of characters

rben01 · 2023-11-14T18:21:13Z

Is it possible to join subsequent punctuation with the preceding word? Here is how open-ai/whisper’s version of this works:

  --word_timestamps WORD_TIMESTAMPS
                        (experimental) extract word-level timestamps and refine the results based on them
                        (default: False)
  --prepend_punctuations PREPEND_PUNCTUATIONS
                        if word_timestamps is True, merge these punctuation symbols with the next word
                        (default: "'“¿([{-)
  --append_punctuations APPEND_PUNCTUATIONS
                        if word_timestamps is True, merge these punctuation symbols with the previous word
                        (default: "'.。,，!！?？:：”)]}、)

canufarm · 2024-02-17T08:04:40Z

--word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and refine the results based on them
(default: False)
--prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the next word
(default: "'“¿([{-)
--append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the previous word
(default: "'.。,，!！?？:：”)]}、)

Short question: How do these parameters work? If I use
... --word_timestamps True --append_punctuations "'.,;:- --...
in my command line, I get an error saying, that I have use only one char. Thanks.

rben01 · 2024-02-17T14:12:59Z

You'll need to escape most of that punctuation for your shell, or surround it in quote (and then escape the quotes inside).

canufarm · 2024-02-17T14:34:48Z

Thank you, @rben01. Seems to work with
--append_punctuations "-\"\'.,;:"

I also tried
--append_punctuations "\-\"\'.,;:" (with "\" before "-")

But I still get it not to work with "-". If there is a word like Bla-Blubb with a dash, then with a line break the result should be

Bla Bla Bla Bla-
Blubb Blubb

But it is always this way:

Bla Bla Bla Bla
-Blubb Blubb

How can I get this dash right (in German language) at the end of the word instead the new line?

ballerburg9005 · 2024-06-28T13:59:57Z

whisper.cpp -f in.wav -osrt

[00:00:00.000 --> 00:00:10.480]   Okay, got power now, you're fucked. Nice, fuck. *laughs* *laughs* *groans* man.

whisper.cpp -f in.wav -osrt --max-len 1

[00:00:00.000 --> 00:00:00.010]  
[00:00:00.010 --> 00:00:00.530]   Okay
[00:00:00.530 --> 00:00:00.810]  ,
[00:00:00.810 --> 00:00:01.370]   got
[00:00:01.370 --> 00:00:01.950]   power
[00:00:01.950 --> 00:00:02.300]   now
[00:00:02.300 --> 00:00:02.560]  ,
[00:00:02.560 --> 00:00:02.990]   you
[00:00:02.990 --> 00:00:03.360]  're
[00:00:03.360 --> 00:00:04.380]   fucked
[00:00:04.380 --> 00:00:04.570]  .
[00:00:04.570 --> 00:00:05.110]   Nice
[00:00:05.110 --> 00:00:05.430]  ,
[00:00:05.430 --> 00:00:05.920]   fuck
[00:00:05.920 --> 00:00:06.320]  .
[00:00:06.320 --> 00:00:06.450]   *
[00:00:06.450 --> 00:00:07.310]  laughs
[00:00:07.310 --> 00:00:07.390]  *
[00:00:07.390 --> 00:00:07.520]   *
[00:00:07.520 --> 00:00:08.250]  laughs
[00:00:08.250 --> 00:00:08.460]  *
[00:00:08.460 --> 00:00:08.690]   *
[00:00:08.690 --> 00:00:09.390]  gro
[00:00:09.390 --> 00:00:09.420]  ans
[00:00:09.420 --> 00:00:09.520]  *
[00:00:09.520 --> 00:00:09.910]   man
[00:00:09.910 --> 00:00:10.480]  .

whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word
whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word --dtw base.en

(both same result)

[00:00:00.000 --> 00:00:00.010]  
[00:00:00.010 --> 00:00:00.810]   Okay,
[00:00:00.810 --> 00:00:01.370]   got
[00:00:01.370 --> 00:00:01.950]   power
[00:00:01.950 --> 00:00:02.560]   now,
[00:00:02.560 --> 00:00:03.360]   you're
[00:00:03.360 --> 00:00:04.570]   fucked.
[00:00:04.570 --> 00:00:05.430]   Nice,
[00:00:05.430 --> 00:00:06.320]   fuck.
[00:00:06.320 --> 00:00:07.390]   *laughs*
[00:00:07.390 --> 00:00:08.460]   *laughs*
[00:00:08.460 --> 00:00:09.520]   *groans*
[00:00:09.520 --> 00:00:10.480]   man.

whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word true --dtw large.v3 --model models/ggml-large-v3.bin
whisper.cpp -f in.wav -osrt --max-len 1 --split-on-word true --model models/ggml-large-v3.bin

(again identical with the large v3 model)

[00:00:00.000 --> 00:00:00.030]  
[00:00:00.030 --> 00:00:00.670]   Okay,
[00:00:00.670 --> 00:00:00.780]   I
[00:00:00.780 --> 00:00:01.150]   got
[00:00:01.150 --> 00:00:01.700]   power
[00:00:01.700 --> 00:00:02.240]   now,
[00:00:02.240 --> 00:00:02.930]   you're
[00:00:02.930 --> 00:00:04.410]   fucked.
[00:00:04.410 --> 00:00:07.140]   Nice.
[00:00:07.140 --> 00:00:07.920]   Oh
[00:00:07.920 --> 00:00:10.560]   man.

ggerganov added the question Further information is requested label Oct 13, 2022

ggerganov mentioned this issue Oct 30, 2022

Word-level timestamps #114

Merged

ggerganov added the enhancement New feature or request label Oct 30, 2022

ggerganov mentioned this issue Nov 2, 2022

How to read/convert .wts to .srt? #120

Closed

ggerganov closed this as completed Nov 7, 2022

warkcod mentioned this issue Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel #996

Open

martinlexow mentioned this issue Jul 31, 2023

Add support for word-level timestamps --word_timestamps exPHAT/SwiftWhisper#17

Closed

FlrnDml mentioned this issue Nov 10, 2023

Make the transcription endpoints fit requirements shuffle-project/asr-api#16

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamps for words instead of sentence possible? #49

Timestamps for words instead of sentence possible? #49

moebiussurfing commented Oct 13, 2022

ArtyomZemlyak commented Oct 13, 2022

ggerganov commented Oct 13, 2022

ArtyomZemlyak commented Oct 14, 2022

ggerganov commented Nov 2, 2022 •

edited

Loading

moebiussurfing commented Nov 2, 2022

moebiussurfing commented Nov 3, 2022 •

edited

Loading

ggerganov commented Nov 3, 2022

moebiussurfing commented Nov 3, 2022

moebiussurfing commented Nov 5, 2022 •

edited

Loading

ggerganov commented Nov 7, 2022

Jeronymous commented Jan 14, 2023

rben01 commented Nov 14, 2023

canufarm commented Feb 17, 2024

rben01 commented Feb 17, 2024

canufarm commented Feb 17, 2024

ballerburg9005 commented Jun 28, 2024

Timestamps for words instead of sentence possible? #49

Timestamps for words instead of sentence possible? #49

Comments

moebiussurfing commented Oct 13, 2022

ArtyomZemlyak commented Oct 13, 2022

ggerganov commented Oct 13, 2022

ArtyomZemlyak commented Oct 14, 2022

ggerganov commented Nov 2, 2022 • edited Loading

moebiussurfing commented Nov 2, 2022

moebiussurfing commented Nov 3, 2022 • edited Loading

ggerganov commented Nov 3, 2022

moebiussurfing commented Nov 3, 2022

moebiussurfing commented Nov 5, 2022 • edited Loading

ggerganov commented Nov 7, 2022

Jeronymous commented Jan 14, 2023

rben01 commented Nov 14, 2023

canufarm commented Feb 17, 2024

rben01 commented Feb 17, 2024

canufarm commented Feb 17, 2024

ballerburg9005 commented Jun 28, 2024

ggerganov commented Nov 2, 2022 •

edited

Loading

moebiussurfing commented Nov 3, 2022 •

edited

Loading

moebiussurfing commented Nov 5, 2022 •

edited

Loading