Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-time streaming #141

Merged
merged 5 commits into from
Nov 20, 2022
Merged

Real-time streaming #141

merged 5 commits into from
Nov 20, 2022

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 11, 2022

[WIP in progress]

With the idea in #137 it is possible to reduce the time in the encoder multiple times.
This is beneficial for the stream example, because it already processes the audio in short chunks.
The decoding quality seems to drop, but I think not significantly.

With the current parameters, I am able to run the following commands in real-time on MacBook M1 Pro:

# real-time transcription with a step of 1.5 seconds using "medium.en"
./stream -m ./models/ggml-medium.en.bin -t 8 --step 1500 --length 7500 -ac 512

# real-time translation with a step of 2.5 seconds using "large"
./stream -m ./models/ggml-large.bin -t 8 --step 2500 --length 7500 --language bg --translate  -ac 512

This was not possible before.

Next thing to try is to run the tiny model in streaming mode in the browser using WASM with a step of 1 or 2 seconds.
I think there is some chance it could actually work.

@meakbiyik
Copy link
Contributor

meakbiyik commented Nov 17, 2022

The performance gain here is absurd. Is there anything I can pitch in here to finalize this PR @ggerganov? I could not exactly get the issue with the last commit "...stitch encoder outputs together" 😅

@ggerganov
Copy link
Owner Author

The "stitching" is basically instead of running 10 seconds of audio through the encoder at one pass, run for example 5 x 2 second chunks and combine the results in the cross-attention layer to get effectively what we would have gotten with 10 seconds directly. This would allow to process audio more often and be more real-time.

The PR missies an option to enable/disable the encoder truncation - I currently hardcoded the values. It's not difficult to finalise, but I want to see how I will use it in the streaming examples - probably will get a better idea for the API.

Force the entire audio chunk to be transcribed into a single segment
Used to limit the number of tokens in a segment.
Useful to battle with word repetition when using partial encoder context
Used to overwrite the audio context size of the Encoder.
For example, setting "audio_ctx = 512" will make it run about 3 times
faster, processing about 10s of audio, instead of 30s.

The transcription quality drops, but this can be used for real-time
streaming purposes where performance is important.
Controls the max tokens per segment for the stream example
@ggerganov ggerganov merged commit f2df9bd into master Nov 20, 2022
@ggerganov
Copy link
Owner Author

@meakbiyik
This is now on master.
Simply add -ac 512 to the ./stream arguments and you will enable the 3x faster Encoder

@meakbiyik
Copy link
Contributor

Wow, this is great - thanks a lot @ggerganov!

A quick follow-up question: would you recommend 2x speedup or reducing audio context size? Or can I mix them up, what was your experience? I do not quite understand why reducing the audio context should also reduce transcription accuracy, so I cannot be sure 😅

Also, interestingly I have noted that lowering step size improves much better transcription, so much so that using low step size + base model is better than 2x step size + small model. Is there anything going on behind the scenes that can explain this phenomenon? Does the option -kc / keep context play any role here?

@ggerganov
Copy link
Owner Author

The 2x speed-up does not seem very useful yet in my experience, so I don't recommend using it.
The smaller audio context intuitively is worse compared to the full context because you are analysing less data - less data means worse results.

The step size observation is strange - as long as your hardware is capable to process the data in real time, then the bigger model should be always better, regardless of step size. Regarding the -kc flag - I don't use it for stream because errors occur more often when doing real-time stream and the -kc flag can actually help propagate the errors in the future transcription.

@meakbiyik
Copy link
Contributor

interesting, but why is there less data, particularly if the --length parameter is set less than the context? What I assumed was that --length amount of data is used (if available), and the rest is padded with zeros, therefore if we reduce the audio_context so that --length fits there snugly, there should be no issues. I feel like I totally misunderstood some of these parameters 😅

On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.

-kc part makes sense. I actually plan the create a PR to inject arbitrary contexts as you recommended in some previous PR, but let's see what happens 😄

@ggerganov
Copy link
Owner Author

ggerganov commented Nov 20, 2022

Yeah, actually you have a good point - for a fixed --length, if the context is bigger, then it shouldn't affect the quality. For example -ac 512 corresponds to a little more than 10s context, so for --length 10000 or less you should be getting the same quality. Your understanding is correct.

On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.

Yes, correct. For example, if you have --length of 10s, then regardless if the step is 1s, 2s, 3s, etc.. the final pass when it processes the full 10s chunk will give the same result. Actually, I now realise that you must not use -kc if the --step is smaller than --length because it will be using the "partial" transcription as text context for the next step and it will definitely get messy.

The -kc option has to be reworked as you suggest to be able to provide the context from the previous --length pass for each step of the current --length pass. Feel free to give it a shot and don't hesitate to ask if you have any questions.

@meakbiyik
Copy link
Contributor

meakbiyik commented Nov 20, 2022

Perfect, thanks a lot, all of this makes full sense! Will try to do that -kc thing quite soon.

Buuut I got one final follow-up just to understand it better: what happens if length>audio_context? Does the model it trim from the end? Or is there a downsampling going on?

@ggerganov
Copy link
Owner Author

Currently, it will trim from the end:

whisper.cpp/whisper.cpp

Lines 1103 to 1104 in f2df9bd

const int i0 = std::min(mel_offset, mel_inp.n_len);
const int i1 = std::min(mel_offset + 2*n_ctx, mel_inp.n_len);

@meakbiyik
Copy link
Contributor

A-ha, lovely. Thanks a lot again!

@xyx361100238
Copy link

According to #137 , I set -ac = 750, the result have lots of noise word “[buzzer] / [static] / [AUDIO OUT]”, how can I remove it?
BTW,it's works well use src set audio_ctx= 0.
image

@ggerganov
Copy link
Owner Author

Currently, the only way is to manually replace these strings yourself (for example, using regex).
Btw, -ac 768 is better than -ac 750 - you want the number to be multiple of 64 for better performance.

@xyx361100238
Copy link

Yes Yes!Much better set -ac 768 :
image
add i will replace strings too. Thanks again!

mattsta pushed a commit to mattsta/whisper.cpp that referenced this pull request Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants