Generated subtitles are too long #52

R4ZZ3 · 2022-10-15T08:40:23Z

Hi and first of all thanks for creating this implementation.

I am trying to create a new version out of this one https://huggingface.co/spaces/Finnish-NLP/Whisper-ASR-youtube-subtitles

I have this now working locally but now I spotted that generated subtitles are way too long.
On the right this cpp implementation. Original pytorch implementation (small model) on the left.
Is it a "feature" of this implementation or what is going on here?

ggerganov · 2022-10-16T10:41:09Z

There isn't any logic that I added for determining the length of the segments. They are whatever the model produces.

I have a few guess of what could be causing the difference that you observe:

Currently whisper.cpp performs the GREEDY strategy for sampling the tokens. This is the most simple strategy - always picking the most likely token after the inference. You can try disabling the beam search strategy in the PyTorch implementation and see if this results in longer segments too. If the answer is yes, it means that we need to implement the beam search strategy in whisper.cpp to resolve this.
Another possible explanation is that my computation of the Mel spectrogram is slightly different compared to the PyTorch implementation. I've tried running it on the same audio sample and the results are not identical. It's possible that this difference propagates through the network and leads to the observed differences. This is possible to resolve by matching the 2 implementations.
The third explanation is that whisper.cpp does not produce identical results to the original implementation - it's just not possible (or very hard) to achieve the exact same results when working with floating point numbers. So all the rounding errors during the computation could eventually lead to a different result.

I think I've seen reports by some people that the original implementation also sometime produces very long segments. So it could be just a matter of chance - not much we can do here.

Additionally, I have tried adding logic for forcing the segments to not become very long by forcing a timestamp token to be sampled when the segment becomes long. In general it helps, but sometimes I observed errors and decided to not include it.

Hope this helps.

Topping1 · 2022-10-17T16:15:18Z

@ggerganov , @R4ZZ3 I'm having problems with long subtitles too, I am transcribing french and german and in those languages it happens too, specially if the person is speaking too fast.
I did a little digging and found out that modifying the resampling parameters of ffmpeg did work on mitigating this issue with sort of consistent results.
What I changed specifically was the parameters controlling the quality of the resampling and the dithering (see below).

ffmpeg -i input.mp3 -af aresample=16000:resampler=soxr:precision=33:dither_method=improved_e_weighted -ac 1 -c:a pcm_s16le output.wav

I will make more tests with different dithering parameters and report back.

pateljoel · 2022-10-17T18:34:02Z

I too have this issue the subtitles are far too long with the C++ version of Whisper, @Topping1 I played with this and is there a way to get the text to be shorter?

ggerganov · 2022-10-17T21:14:37Z

Does it help if you change the following line:

whisper.cpp/whisper.cpp

Line 2383 in b81a81d

int n_take = std::min(whisper_n_text_ctx(ctx)/2, int(prompt_past.size()));

to:

int n_take = std::min(16, int(prompt_past.size()));

Make sure to try latest master - I've made a small change that might help with long subtitles (not sure).

pateljoel · 2022-10-17T22:17:18Z

@ggerganov I've tried the above and it doesn't make a difference, the text is still the same length.

Force timestamp token to be sampled if the probability sum over all timestamp tokens is above the probability of any other token

ggerganov · 2022-10-18T16:50:48Z

I just implemented the timestamp probability rule used in the original whisper implementation:

https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/decoding.py#L430-L438

For my audio samples, the length of the segments has significantly become shorter.
Give it a try and let me know if this helps.

R4ZZ3 · 2022-10-18T20:38:27Z

Just fetched the updates and recompiled.
Seems promising! Small hickup at the end but it is perfectly ok!
Great job @ggerganov!

pateljoel · 2022-10-20T01:49:02Z

This one works better and doesn't generate long subtitles, (at least for my video) great job @ggerganov!

Force timestamp token to be sampled if the probability sum over all timestamp tokens is above the probability of any other token

Change get_logits to return a single slice

ggerganov added the question Further information is requested label Oct 16, 2022

ggerganov added a commit that referenced this issue Oct 18, 2022

ref #52 : improve greedy sampling strategy

7eeef03

Force timestamp token to be sampled if the probability sum over all timestamp tokens is above the probability of any other token

ggerganov closed this as completed Oct 26, 2022

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023

ref ggerganov#52 : improve greedy sampling strategy

ca9a38a

Force timestamp token to be sampled if the probability sum over all timestamp tokens is above the probability of any other token

warkcod mentioned this issue Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel #996

Open

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024

Merge pull request ggerganov#52 from jcsoo/get-logits

ee7ab3c

Change get_logits to return a single slice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated subtitles are too long #52

Generated subtitles are too long #52

R4ZZ3 commented Oct 15, 2022

ggerganov commented Oct 16, 2022

Topping1 commented Oct 17, 2022

pateljoel commented Oct 17, 2022

ggerganov commented Oct 17, 2022

pateljoel commented Oct 17, 2022

ggerganov commented Oct 18, 2022

R4ZZ3 commented Oct 18, 2022

pateljoel commented Oct 20, 2022

Generated subtitles are too long #52

Generated subtitles are too long #52

Comments

R4ZZ3 commented Oct 15, 2022

ggerganov commented Oct 16, 2022

Topping1 commented Oct 17, 2022

pateljoel commented Oct 17, 2022

ggerganov commented Oct 17, 2022

pateljoel commented Oct 17, 2022

ggerganov commented Oct 18, 2022

R4ZZ3 commented Oct 18, 2022

pateljoel commented Oct 20, 2022