Add Whisper model and speech-to-text serving #107

seanmor5 · 2022-12-13T22:03:42Z

No description provided.

lib/bumblebee/audio/whisper.ex

…hisper

lib/bumblebee/audio/whisper_featurizer.ex

jonatanklosko · 2023-01-23T22:15:44Z

lib/bumblebee/audio/whisper_featurizer.ex

+      for sample <- padded_samples do
+        sample
+        |> Nx.transpose()
+        |> Nx.to_batched(1)


@polvalente btw is there a reasonable way to have a batched version? Is it something we could improve with vmap?

FFT is already batched, so it's more on as_windowed to become batched so we could have a batched STFT.

With a batched STFT, it would be reasonably easy to have a batched stft_to_mel because it's basically a clever matrix product

…hisper

jonatanklosko · 2023-01-24T20:46:25Z

seanmor5 · 2023-01-24T21:59:45Z

lib/bumblebee/audio/speech_recognition.ex

+    end)
+  end
+
+  defp ffmpeg_read_as_pcm(path, sampling_rate) do


Does it make sense to have Bumblebee dependent on 3rd party binary like this? I guess in one sense it makes things much easier to work with out of the box, but on the other it's a tight assumption. Though I guess we don't explicitly require it so it's not a big deal

I don't think it does. I was talking with @jonatanklosko about a lib I'm planning with someone on the ML channel to wrap ffmpeg and load audio data into Nx tensors from either a binary or a file name.

It would be an optional sister lib to NxSignal

For the serving I think we could do with either calling said library if available or just receiving tensors directly otherwise

@seanmor5 I think we definitely should have an easy option to work with a file in this case, and as long as it relies on optional dependencies we should be good.

FWIW hf/transformers also use ffmpeg for files.

Yeah, I am fine with this as long as:

Avoiding it is easy (i.e. just pass a tensor)

We explicitly document it

We raise a nice error message if not available

jonatanklosko · 2023-01-25T16:39:29Z

@seanmor5 @polvalente I changed the featurizer to return the input as channels-last ({batch_size, input_length, num_mel_bins}), this way we avoid transposing back and forth, plus we already use channels-last everywhere.

developertrinidad08 · 2023-02-08T09:05:32Z

I wanted to try this and I got the following error

** (RuntimeError) could not match the class name "WhisperForConditionalGeneration" to any of the supported models, please specify the :module and :architecture options
(bumblebee 0.1.2) lib/bumblebee.ex:262: Bumblebee.load_spec/2
(bumblebee 0.1.2) lib/bumblebee.ex:372: Bumblebee.load_model/2
(stdlib 3.17.2) erl_eval.erl:685: :erl_eval.do_apply/6
(stdlib 3.17.2) erl_eval.erl:446: :erl_eval.expr/5
(stdlib 3.17.2) erl_eval.erl:123: :erl_eval.exprs/5
(elixir 1.14.2) lib/module/parallel_checker.ex:107: Module.ParallelChecker.verify/1

could you help me? i am new with elixir

jonatanklosko · 2023-02-08T09:27:19Z

Hey @developertrinidad08, the feature is only available on main currently. Here's a notebook you can import to try it out:

# Whisper

```elixir
Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:kino, "~> 0.8.1"}
])

Nx.global_default_backend(EXLA.Backend)
```

## Example

```elixir
{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
    max_new_tokens: 100,
    defn_options: [compiler: EXLA]
  )

audio_input = Kino.Input.audio("Audio", sampling_rate: featurizer.sampling_rate)
```

```elixir
audio = Kino.Input.read(audio_input)

audio =
  audio.data
  |> Nx.from_binary(:f32)
  |> Nx.reshape({:auto, audio.num_channels})
  |> Nx.mean(axes: [1])

Nx.Serving.run(serving, audio)
```

developertrinidad08 · 2023-02-08T12:34:59Z

I work perfectly thank you very much @jonatanklosko

developertrinidad08 · 2023-02-17T20:02:02Z

Do I have another query like adding a label or something to change the language to Spanish or a different language?

jonatanklosko · 2023-02-17T23:21:28Z

@developertrinidad08 you can try this:

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
    max_new_tokens: 100,
    defn_options: [compiler: EXLA],
    forced_token_ids: [
      {1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|es|>")},
      {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
      {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
    ]
  )

We will likely have a higher level API to set the language, but it's model/tokenizer specific, so I deferred that for now.

dcaixinha · 2023-03-06T11:07:46Z

@jonatanklosko do you happen to know if providing a language to whisper has any effect on the speech detection results? Or is it just to perform translation on the output? Thanks in advance 🙏

jonatanklosko · 2023-03-06T12:13:11Z

@dcaixinha from what I saw it does. As far as I understand the language token always indicates what language the speech uses. Then <|transcribe|> transcribes in that very language, while <|translate|> transcribes it into English.

There's also a "glitch" when the speech is English and we set a different language token + <|transcribe|>, it transcribes the English speech translated into that language (ref).

wadestuart · 2023-03-11T19:33:46Z

Hello, taking a look at this I am trying to wrap my head around how to inject initial_prompt type functionality (from the whisper.py example app) -- basically it allows you to inject a string that gets tokenized to extend the initial window tokens in the model to give hints on things that may exist in the input audio. A primary use for instance is to inject proper names that may be in the input audio (so that the model is more likely to output the right proper name vs a sound-alike text "Jonatan Kłosko" vs "Jonathan Costco" ). I am just not seeing a good way to duplicate this functionality by extending this.

https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L194

jonatanklosko · 2023-03-13T11:28:12Z

@wadestuart the serving currently works on a single window, we will extend it to chunk longer inputs in the future. I'm not sure if there's an easy way to inject the prompt (other than :forced_token_ids, though it has a different purpose). I don't think hf/transformers have this option either, but we can think about it once we support multiple windows. That said, there's always an option to write the serving yourself for full control (just like the app is implemented), though we may still need changes to the generation API to handle prompt injection.

wadestuart · 2023-03-13T16:36:35Z

@jonatanklosko Thank you! I will probably hold out to see how the multiple windows implementation nets out and use a port to the python implementation for the time being and revisit at that time.

dcaixinha · 2023-04-06T13:36:24Z

Hi @jonatanklosko, sorry for necro-bumping this thread, but was wondering if there's any option of passing the forced_token_ids you suggested above at run-time. Your suggestion works fine if the serving will always serve the same language (which is set when calling Bumblebee.Audio.speech_to_text), but for dynamic languages would be great to be able to pass the language when calling Nx.Serving.run. I was reading the docs for Nx.Serving but didn't find anything useful. Do you know if it's possible? 🙏 thank you very much!

jonatanklosko · 2023-04-06T16:22:28Z

Hey @dcaixinha! Currently it's not really feasible since we use forced_token_ids to generate the computation graph that is compiled. While technically it should be possible to pass the language token as an input, it's too model-specific to handle reasonably I think.

That said, you can configure Whisper with no specific language and it may still return the expected transcription, which depending on what you're doing may do the job. I mean this:

forced_token_ids: [
  {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
  {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
]

dcaixinha · 2023-04-06T18:11:48Z

Gotcha, thank you very much @jonatanklosko 🙌 since in the Python library they do it at run-time, I was wondering if the same would be possible in Elixir 💭 thank you very much for your help 🙇

jonatanklosko · 2023-04-06T18:56:42Z

@dcaixinha yeah, the difference is that in PyTorch the computation is eager and everything can be dynamic, while we rely on defn to build a computation graph that we compile together. As said, having language configured at runtime is doable, I'm just not sure how it fits into the API yet. I added a note in #187 to consider that.

jonatanklosko reviewed Dec 13, 2022

View reviewed changes

lib/bumblebee/audio/whisper.ex Show resolved Hide resolved

jonatanklosko mentioned this pull request Dec 30, 2022

Simplify tokenizer modules #143

Closed

seanmor5 and others added 5 commits January 14, 2023 07:00

Add whisper

19bc781

Sync options and update naming

d8c49ce

Format

ef9a742

Fix conditional generation issue

76a0ae8

Work on feature extractor

d5f1e80

seanmor5 force-pushed the sm-whisper branch from b3c494c to d5f1e80 Compare January 14, 2023 15:00

polvalente and others added 11 commits January 14, 2023 16:17

chore: add reflect padding

383e836

test: make test pass

fcf5c99

chore: depend on nx-signal github

46b9a5f

fix: update mix.lock

0594f8a

chore: update nx signal

e6d103d

fix: change to new function interfaces

393a8c4

Fix generation compatibility

f826531

Bump deps

725992a

fix: use latest interface for nxsignal

acf9592

Merge branch 'sm-whisper' of github.com:elixir-nx/bumblebee into sm-w…

7aad1f6

…hisper

Rename load options

1c65f76

jonatanklosko reviewed Jan 23, 2023

View reviewed changes

lib/bumblebee/audio/whisper_featurizer.ex Outdated Show resolved Hide resolved

jonatanklosko reviewed Jan 23, 2023

View reviewed changes

polvalente and others added 4 commits January 23, 2023 21:38

chore: use nx-signal main

1fea676

Merge branch 'sm-whisper' of github.com:elixir-nx/bumblebee into sm-w…

3a6d010

…hisper

Add serving

9351f16

Fix test

92fc725

jonatanklosko force-pushed the sm-whisper branch from 37e9258 to 92fc725 Compare January 24, 2023 20:38

seanmor5 commented Jan 24, 2023

View reviewed changes

jonatanklosko added 2 commits January 25, 2023 13:16

Up

8278963

Up

81fa496

Use channels last order in featurizer and model input

c97b99f

jonatanklosko added 6 commits January 25, 2023 18:26

Docs

cb4f03b

Option naming

1b7cd2f

Enforce 1-dimensional input in featurizer

f65615a

Merge branch 'main' into sm-whisper

e33b81a

Refactor Whisper with layer mapping

d497859

Rename serving to speech-to-text

91b2fbb

jonatanklosko changed the title ~~Add whisper~~ Add Whisper model and speech-to-text serving Jan 26, 2023

jonatanklosko merged commit 1ca6418 into main Jan 26, 2023

jonatanklosko deleted the sm-whisper branch January 26, 2023 13:50

jonatanklosko mentioned this pull request May 19, 2023

How to configure a text-to-speech model forced_token_ids? #210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Whisper model and speech-to-text serving #107

Add Whisper model and speech-to-text serving #107

seanmor5 commented Dec 13, 2022

jonatanklosko Jan 23, 2023

polvalente Jan 23, 2023

jonatanklosko commented Jan 24, 2023

seanmor5 Jan 24, 2023

polvalente Jan 24, 2023

polvalente Jan 24, 2023

jonatanklosko Jan 24, 2023

josevalim Jan 24, 2023

jonatanklosko commented Jan 25, 2023

developertrinidad08 commented Feb 8, 2023

jonatanklosko commented Feb 8, 2023 •

edited

Loading

developertrinidad08 commented Feb 8, 2023

developertrinidad08 commented Feb 17, 2023

jonatanklosko commented Feb 17, 2023

dcaixinha commented Mar 6, 2023

jonatanklosko commented Mar 6, 2023

wadestuart commented Mar 11, 2023 •

edited

Loading

jonatanklosko commented Mar 13, 2023

wadestuart commented Mar 13, 2023

dcaixinha commented Apr 6, 2023

jonatanklosko commented Apr 6, 2023

dcaixinha commented Apr 6, 2023

jonatanklosko commented Apr 6, 2023

Add Whisper model and speech-to-text serving #107

Add Whisper model and speech-to-text serving #107

Conversation

seanmor5 commented Dec 13, 2022

jonatanklosko Jan 23, 2023

Choose a reason for hiding this comment

polvalente Jan 23, 2023

Choose a reason for hiding this comment

jonatanklosko commented Jan 24, 2023

seanmor5 Jan 24, 2023

Choose a reason for hiding this comment

polvalente Jan 24, 2023

Choose a reason for hiding this comment

polvalente Jan 24, 2023

Choose a reason for hiding this comment

jonatanklosko Jan 24, 2023

Choose a reason for hiding this comment

josevalim Jan 24, 2023

Choose a reason for hiding this comment

jonatanklosko commented Jan 25, 2023

developertrinidad08 commented Feb 8, 2023

jonatanklosko commented Feb 8, 2023 • edited Loading

developertrinidad08 commented Feb 8, 2023

developertrinidad08 commented Feb 17, 2023

jonatanklosko commented Feb 17, 2023

dcaixinha commented Mar 6, 2023

jonatanklosko commented Mar 6, 2023

wadestuart commented Mar 11, 2023 • edited Loading

jonatanklosko commented Mar 13, 2023

wadestuart commented Mar 13, 2023

dcaixinha commented Apr 6, 2023

jonatanklosko commented Apr 6, 2023

dcaixinha commented Apr 6, 2023

jonatanklosko commented Apr 6, 2023

jonatanklosko commented Feb 8, 2023 •

edited

Loading

wadestuart commented Mar 11, 2023 •

edited

Loading