-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamps for words instead of sentence possible? #49
Comments
Now exist only one good solution for timestamps for whisper: But dont know, how its portable to c++ version. |
The current Line 123 in b2f1600
It picks the timestamp token with highest probability and based on this token you can compute a time offset for the current text token. I tried using this and I observed that near the end of a transcription segment, the timestamp tokens are usually accurate. However, at the start of a segment, they can be very wrong. For example, one possible strategy that I have in mind is to sample timestamp tokens in parallel with text tokens, and then for each segment, perform some fit that takes into account the start and end time of the segment together with the sampled timestamp tokens. You have to take into account that there could be outliers due to the observation above. Probably, the text length could also be added as a parameter to the fit. Overall, I think it is not obvious how to make a robust algorithm for word-level timestamps. @ArtyomZemlyak How is the performance of |
I haven't tested the most recent updates, but even the older version had some good timestamps. These timestamps can be used to find where a particular word is roughly located. |
This turned out pretty good overall. The algorithm has been moved from main.cpp to whisper.cpp and can be reused for all subtitles types. This means that now you can specify the maximum length of the generated lines. Simply provide the "-ml" argument specifying the max length in number of characters
This is now possible. Simply use the
This also works with subtitles - SRT, VTT, etc. |
Cool Thanks! |
Hey, moebiussurfing@surfingMachine MINGW64 /d/_CODE/_AI/whisper.cpp
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 24 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1
...
Assertion failed: rc == 0, file ggml.c, line 6840
EDIT: moebiussurfing@surfingMachine MINGW64 /d/_CODE/_AI/whisper.cpp
$ ./main -f HUXLEY2_16K.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 506.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 24 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'HUXLEY2_16K.wav' (27223040 samples, 1701.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps =
1 ...
Assertion failed: rc == 0, file ggml.c, line 6840
|
This error means that the program wasn't able to create a thread. This makes me think that there is something wrong in your environment. Try to rebuild (i.e. |
ok. I'll try again. |
Hey, But I re checked if I was using the same command prompt that worked before, What environment should I try, Thanks! |
Do you have a linux environment? Alternatively, try running with I will close this issue - if the Windows problems persist, feel free to open another issue and hopefully somebody will be able to help. |
The following code permits to recover word timestamps from Whisper transcription, using python code: |
) This turned out pretty good overall. The algorithm has been moved from main.cpp to whisper.cpp and can be reused for all subtitles types. This means that now you can specify the maximum length of the generated lines. Simply provide the "-ml" argument specifying the max length in number of characters
) This turned out pretty good overall. The algorithm has been moved from main.cpp to whisper.cpp and can be reused for all subtitles types. This means that now you can specify the maximum length of the generated lines. Simply provide the "-ml" argument specifying the max length in number of characters
Is it possible to join subsequent punctuation with the preceding word? Here is how open-ai/whisper’s version of this works:
|
Short question: How do these parameters work? If I use |
You'll need to escape most of that punctuation for your shell, or surround it in quote (and then escape the quotes inside). |
Thank you, @rben01. Seems to work with I also tried But I still get it not to work with "-". If there is a word like
But it is always this way:
How can I get this dash right (in German language) at the end of the word instead the new line? |
(both same result)
(again identical with the large v3 model)
|
Do you think that could be possible in some way?
I would like to get the time stamp of each word instead the sentence (words bundle).
That could be useful to some kind of karaoke lyrics generator,
or just to text to “lip sync” in a kind of video clip or 3d character synchro.
Cheers
The text was updated successfully, but these errors were encountered: