Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated subtitles are too long #52

Closed
R4ZZ3 opened this issue Oct 15, 2022 · 8 comments
Closed

Generated subtitles are too long #52

R4ZZ3 opened this issue Oct 15, 2022 · 8 comments
Labels
question Further information is requested

Comments

@R4ZZ3
Copy link

R4ZZ3 commented Oct 15, 2022

Hi and first of all thanks for creating this implementation.

I am trying to create a new version out of this one https://huggingface.co/spaces/Finnish-NLP/Whisper-ASR-youtube-subtitles

I have this now working locally but now I spotted that generated subtitles are way too long.
On the right this cpp implementation. Original pytorch implementation (small model) on the left.
Is it a "feature" of this implementation or what is going on here?
image

@ggerganov ggerganov added the question Further information is requested label Oct 16, 2022
@ggerganov
Copy link
Owner

There isn't any logic that I added for determining the length of the segments. They are whatever the model produces.

I have a few guess of what could be causing the difference that you observe:

  • Currently whisper.cpp performs the GREEDY strategy for sampling the tokens. This is the most simple strategy - always picking the most likely token after the inference. You can try disabling the beam search strategy in the PyTorch implementation and see if this results in longer segments too. If the answer is yes, it means that we need to implement the beam search strategy in whisper.cpp to resolve this.

  • Another possible explanation is that my computation of the Mel spectrogram is slightly different compared to the PyTorch implementation. I've tried running it on the same audio sample and the results are not identical. It's possible that this difference propagates through the network and leads to the observed differences. This is possible to resolve by matching the 2 implementations.

  • The third explanation is that whisper.cpp does not produce identical results to the original implementation - it's just not possible (or very hard) to achieve the exact same results when working with floating point numbers. So all the rounding errors during the computation could eventually lead to a different result.

I think I've seen reports by some people that the original implementation also sometime produces very long segments. So it could be just a matter of chance - not much we can do here.

Additionally, I have tried adding logic for forcing the segments to not become very long by forcing a timestamp token to be sampled when the segment becomes long. In general it helps, but sometimes I observed errors and decided to not include it.

Hope this helps.

@Topping1
Copy link
Contributor

@ggerganov , @R4ZZ3 I'm having problems with long subtitles too, I am transcribing french and german and in those languages it happens too, specially if the person is speaking too fast.
I did a little digging and found out that modifying the resampling parameters of ffmpeg did work on mitigating this issue with sort of consistent results.
What I changed specifically was the parameters controlling the quality of the resampling and the dithering (see below).

ffmpeg -i input.mp3 -af aresample=16000:resampler=soxr:precision=33:dither_method=improved_e_weighted -ac 1 -c:a pcm_s16le output.wav

I will make more tests with different dithering parameters and report back.

@pateljoel
Copy link

I too have this issue the subtitles are far too long with the C++ version of Whisper, @Topping1 I played with this and is there a way to get the text to be shorter?

@ggerganov
Copy link
Owner

Does it help if you change the following line:

int n_take = std::min(whisper_n_text_ctx(ctx)/2, int(prompt_past.size()));

to:

int n_take = std::min(16, int(prompt_past.size()));

Make sure to try latest master - I've made a small change that might help with long subtitles (not sure).

@pateljoel
Copy link

@ggerganov I've tried the above and it doesn't make a difference, the text is still the same length.

ggerganov added a commit that referenced this issue Oct 18, 2022
Force timestamp token to be sampled if the probability sum over all
timestamp tokens is above the probability of any other token
@ggerganov
Copy link
Owner

I just implemented the timestamp probability rule used in the original whisper implementation:

https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/decoding.py#L430-L438

For my audio samples, the length of the segments has significantly become shorter.
Give it a try and let me know if this helps.

@R4ZZ3
Copy link
Author

R4ZZ3 commented Oct 18, 2022

Just fetched the updates and recompiled.
Seems promising! Small hickup at the end but it is perfectly ok!
Great job @ggerganov!

image

@pateljoel
Copy link

This one works better and doesn't generate long subtitles, (at least for my video) great job @ggerganov!

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023
Force timestamp token to be sampled if the probability sum over all
timestamp tokens is above the probability of any other token
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
Force timestamp token to be sampled if the probability sum over all
timestamp tokens is above the probability of any other token
kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024
Change get_logits to return a single slice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants