Add support for streaming speech-to-text results #242

jonatanklosko · 2023-09-12T11:34:22Z

Large audio is split into multiple chunks, now we support stream: true, so that output chunks are emitted as soon as they are available.

jonatanklosko · 2023-09-12T11:34:59Z

lib/bumblebee/audio.ex

+    serving = speech_to_text_whisper(model_info, featurizer, tokenizer, generation_config, opts)
+    client_postprocessing = serving.client_postprocessing
+
+    Nx.Serving.client_postprocessing(serving, fn output_pair, info ->


The output format changed, so I'm wrapping to avoid a breaking change.

jonatanklosko · 2023-09-12T11:35:51Z

lib/bumblebee/audio.ex

@@ -21,15 +41,13 @@ defmodule Bumblebee.Audio do

  """
  @type speech_to_text_whisper_input :: Nx.t() | {:file, String.t()}
-  @type speech_to_text_whisper_output :: %{results: list(speech_to_text_whisper_result())}


I removed results: [...]. It could make sense for non-deterministic text generation with sampling, but for transcription we generally expect a specific output, so I don't think it will ever be useful.

lib/bumblebee/audio/speech_to_text_whisper.ex

jonatanklosko · 2023-09-12T11:43:55Z

@josevalim Nx should be good to go with respect to streaming, awesome job!

lib/bumblebee/audio/speech_to_text_whisper.ex

josevalim · 2023-09-12T13:03:18Z

lib/bumblebee/audio.ex

-              end_timestamp_seconds: number() | nil
-            })
+          start_timestamp_seconds: number() | nil,
+          end_timestamp_seconds: number() | nil


Just to wrap up our conversation, the issue with this is that, based on an option, the timestamp is either nil or number. However, callers of the code that only call this with timestamps: true, once we have the typesystem, will now have to potentially handle nil on their code, even though it can never be nil! That's because it is hard in a type system to have different result types based on an option.

There are a couple fixes. One is to say this is dynamic and (number or nil) (in the future, not now). The other option is to make it always a number and return -1 if timestamps is false. Another option is to have different functions. One with timestamps and one without. But this is what I meant, in a nutshell.

Ah you are right it still depends on the option just in a different way. I'm not really a fan of -1, though it's the most straightforward option. Typing serving may be tricky on its own because what we call is Nx.Serving.batched_run and the return value depends on the serving name we pass?

Yeah, there is still a lot for it to happen, but I thought I would start the discussion. If we keep this as is, then -1 is like the only viable option in the future if we want to keep a single type.

Another option is to have a speech_to_text_whisper function, which returns a string (as it did before) and have transcribe_whisper (which has the return type above without nils).

Let's keep nil and if we get to type serving outputs we can switch to -1. The issue with separate functions is that if we have another option like that then we have 2x2 functions :D

Add support for streaming speech-to-text results

2c04d08

jonatanklosko commented Sep 12, 2023

View reviewed changes

lib/bumblebee/audio/speech_to_text_whisper.ex Show resolved Hide resolved

jonatanklosko added 2 commits September 12, 2023 13:36

Update lib/bumblebee/audio/speech_to_text_whisper.ex

ef277a9

Add tests for streaming

6732277

josevalim reviewed Sep 12, 2023

View reviewed changes

lib/bumblebee/audio/speech_to_text_whisper.ex Outdated Show resolved Hide resolved

josevalim reviewed Sep 12, 2023

View reviewed changes

josevalim approved these changes Sep 12, 2023

View reviewed changes

Up

cf2a438

jonatanklosko merged commit 391fcd0 into main Sep 12, 2023
2 checks passed

jonatanklosko deleted the jk-audio-stream branch September 12, 2023 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for streaming speech-to-text results #242

Add support for streaming speech-to-text results #242

jonatanklosko commented Sep 12, 2023

jonatanklosko Sep 12, 2023

jonatanklosko Sep 12, 2023

jonatanklosko commented Sep 12, 2023

josevalim Sep 12, 2023

jonatanklosko Sep 12, 2023

josevalim Sep 12, 2023

jonatanklosko Sep 12, 2023

Add support for streaming speech-to-text results #242

Add support for streaming speech-to-text results #242

Conversation

jonatanklosko commented Sep 12, 2023

jonatanklosko Sep 12, 2023

Choose a reason for hiding this comment

jonatanklosko Sep 12, 2023

Choose a reason for hiding this comment

jonatanklosko commented Sep 12, 2023

josevalim Sep 12, 2023

Choose a reason for hiding this comment

jonatanklosko Sep 12, 2023

Choose a reason for hiding this comment

josevalim Sep 12, 2023

Choose a reason for hiding this comment

jonatanklosko Sep 12, 2023

Choose a reason for hiding this comment