Real-time streaming #141

ggerganov · 2022-11-11T20:43:44Z

[WIP in progress]

With the idea in #137 it is possible to reduce the time in the encoder multiple times.
This is beneficial for the stream example, because it already processes the audio in short chunks.
The decoding quality seems to drop, but I think not significantly.

With the current parameters, I am able to run the following commands in real-time on MacBook M1 Pro:

# real-time transcription with a step of 1.5 seconds using "medium.en"
./stream -m ./models/ggml-medium.en.bin -t 8 --step 1500 --length 7500 -ac 512

# real-time translation with a step of 2.5 seconds using "large"
./stream -m ./models/ggml-large.bin -t 8 --step 2500 --length 7500 --language bg --translate  -ac 512

This was not possible before.

Next thing to try is to run the tiny model in streaming mode in the browser using WASM with a step of 1 or 2 seconds.
I think there is some chance it could actually work.

meakbiyik · 2022-11-17T15:32:04Z

The performance gain here is absurd. Is there anything I can pitch in here to finalize this PR @ggerganov? I could not exactly get the issue with the last commit "...stitch encoder outputs together" 😅

ggerganov · 2022-11-17T19:29:58Z

The "stitching" is basically instead of running 10 seconds of audio through the encoder at one pass, run for example 5 x 2 second chunks and combine the results in the cross-attention layer to get effectively what we would have gotten with 10 seconds directly. This would allow to process audio more often and be more real-time.

The PR missies an option to enable/disable the encoder truncation - I currently hardcoded the values. It's not difficult to finalise, but I want to see how I will use it in the streaming examples - probably will get a better idea for the API.

Force the entire audio chunk to be transcribed into a single segment

Used to limit the number of tokens in a segment. Useful to battle with word repetition when using partial encoder context

Used to overwrite the audio context size of the Encoder. For example, setting "audio_ctx = 512" will make it run about 3 times faster, processing about 10s of audio, instead of 30s. The transcription quality drops, but this can be used for real-time streaming purposes where performance is important.

Controls the max tokens per segment for the stream example

ggerganov · 2022-11-20T19:23:23Z

@meakbiyik
This is now on master.
Simply add -ac 512 to the ./stream arguments and you will enable the 3x faster Encoder

meakbiyik · 2022-11-20T19:27:39Z

Wow, this is great - thanks a lot @ggerganov!

A quick follow-up question: would you recommend 2x speedup or reducing audio context size? Or can I mix them up, what was your experience? I do not quite understand why reducing the audio context should also reduce transcription accuracy, so I cannot be sure 😅

Also, interestingly I have noted that lowering step size improves much better transcription, so much so that using low step size + base model is better than 2x step size + small model. Is there anything going on behind the scenes that can explain this phenomenon? Does the option -kc / keep context play any role here?

ggerganov · 2022-11-20T19:46:14Z

The 2x speed-up does not seem very useful yet in my experience, so I don't recommend using it.
The smaller audio context intuitively is worse compared to the full context because you are analysing less data - less data means worse results.

The step size observation is strange - as long as your hardware is capable to process the data in real time, then the bigger model should be always better, regardless of step size. Regarding the -kc flag - I don't use it for stream because errors occur more often when doing real-time stream and the -kc flag can actually help propagate the errors in the future transcription.

meakbiyik · 2022-11-20T19:53:21Z

interesting, but why is there less data, particularly if the --length parameter is set less than the context? What I assumed was that --length amount of data is used (if available), and the rest is padded with zeros, therefore if we reduce the audio_context so that --length fits there snugly, there should be no issues. I feel like I totally misunderstood some of these parameters 😅

On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.

-kc part makes sense. I actually plan the create a PR to inject arbitrary contexts as you recommended in some previous PR, but let's see what happens 😄

ggerganov · 2022-11-20T20:08:40Z

Yeah, actually you have a good point - for a fixed --length, if the context is bigger, then it shouldn't affect the quality. For example -ac 512 corresponds to a little more than 10s context, so for --length 10000 or less you should be getting the same quality. Your understanding is correct.

On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.

Yes, correct. For example, if you have --length of 10s, then regardless if the step is 1s, 2s, 3s, etc.. the final pass when it processes the full 10s chunk will give the same result. Actually, I now realise that you must not use -kc if the --step is smaller than --length because it will be using the "partial" transcription as text context for the next step and it will definitely get messy.

The -kc option has to be reworked as you suggest to be able to provide the context from the previous --length pass for each step of the current --length pass. Feel free to give it a shot and don't hesitate to ask if you have any questions.

meakbiyik · 2022-11-20T20:14:56Z

Perfect, thanks a lot, all of this makes full sense! Will try to do that -kc thing quite soon.

Buuut I got one final follow-up just to understand it better: what happens if length>audio_context? Does the model it trim from the end? Or is there a downsampling going on?

ggerganov · 2022-11-20T20:21:56Z

Currently, it will trim from the end:

whisper.cpp/whisper.cpp

Lines 1103 to 1104 in f2df9bd

    
           const int i0 = std::min(mel_offset, mel_inp.n_len); 
        
           const int i1 = std::min(mel_offset + 2*n_ctx, mel_inp.n_len);

meakbiyik · 2022-11-20T20:23:41Z

A-ha, lovely. Thanks a lot again!

xyx361100238 · 2022-12-08T04:20:43Z

According to #137 , I set -ac = 750, the result have lots of noise word “[buzzer] / [static] / [AUDIO OUT]”, how can I remove it?
BTW,it's works well use src set audio_ctx= 0.

ggerganov · 2022-12-08T07:23:05Z

Currently, the only way is to manually replace these strings yourself (for example, using regex).
Btw, -ac 768 is better than -ac 750 - you want the number to be multiple of 64 for better performance.

xyx361100238 · 2022-12-09T01:48:34Z

Yes Yes！Much better set -ac 768 ：

add i will replace strings too. Thanks again！

ggerganov force-pushed the stream branch from 2675766 to e64896c Compare November 11, 2022 21:01

ggerganov mentioned this pull request Nov 11, 2022

is it possible to run openai-whisper ggml model on raspberry pi hardware? #7

Closed

ggerganov force-pushed the stream branch from e64896c to 17df5c1 Compare November 13, 2022 14:26

ggerganov added 4 commits November 20, 2022 21:16

stream : partial encoder experiments

ea3344e

stream : add "single_segment" option

b10d751

Force the entire audio chunk to be transcribed into a single segment

stream : add "max_tokens" parameter

4af1689

Used to limit the number of tokens in a segment. Useful to battle with word repetition when using partial encoder context

ggerganov force-pushed the stream branch from 56f22d1 to 1b7a7df Compare November 20, 2022 19:17

ggerganov marked this pull request as ready for review November 20, 2022 19:19

stream : add "max_tokens" cli arg

0a2621b

Controls the max tokens per segment for the stream example

ggerganov merged commit f2df9bd into master Nov 20, 2022

mattsta pushed a commit to mattsta/whisper.cpp that referenced this pull request Apr 1, 2023

Use PyTorch as logits transpose for ONNX support (ggerganov#141)

9c8183a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-time streaming #141

Real-time streaming #141

ggerganov commented Nov 11, 2022 •

edited

meakbiyik commented Nov 17, 2022 •

edited

ggerganov commented Nov 17, 2022

ggerganov commented Nov 20, 2022

meakbiyik commented Nov 20, 2022

ggerganov commented Nov 20, 2022

meakbiyik commented Nov 20, 2022

ggerganov commented Nov 20, 2022 •

edited

meakbiyik commented Nov 20, 2022 •

edited

ggerganov commented Nov 20, 2022

meakbiyik commented Nov 20, 2022

xyx361100238 commented Dec 8, 2022

ggerganov commented Dec 8, 2022

xyx361100238 commented Dec 9, 2022

Real-time streaming #141

Real-time streaming #141

Conversation

ggerganov commented Nov 11, 2022 • edited

meakbiyik commented Nov 17, 2022 • edited

ggerganov commented Nov 17, 2022

ggerganov commented Nov 20, 2022

meakbiyik commented Nov 20, 2022

ggerganov commented Nov 20, 2022

meakbiyik commented Nov 20, 2022

ggerganov commented Nov 20, 2022 • edited

meakbiyik commented Nov 20, 2022 • edited

ggerganov commented Nov 20, 2022

meakbiyik commented Nov 20, 2022

xyx361100238 commented Dec 8, 2022

ggerganov commented Dec 8, 2022

xyx361100238 commented Dec 9, 2022

ggerganov commented Nov 11, 2022 •

edited

meakbiyik commented Nov 17, 2022 •

edited

ggerganov commented Nov 20, 2022 •

edited

meakbiyik commented Nov 20, 2022 •

edited