need a simple python script to run WhisperSpeech locally to compare to bark #67

BBC-Esq · 2024-01-30T16:59:14Z

I really want to test this out and compare it to Bark, but it's implementation seems convoluted to someone like me who's never used Google Colab notebooks. I didn't see any straight Python scripts on the repository that I could tweak to get it working....I'm running an 4090 and am very familiar with OpenAI's whisper, ctranslate2's implementation (of which WhisperX is an offshoot), Transformers libraries, etc. I DO NOT use llama.cpp but suppose I could if need be.

Basically, is there or can they be provided....straight python scripts?

zoq · 2024-01-30T18:02:14Z

You can pretty much take the necessary lines from the notebook itself:

from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model', torch_compile=True)
pipe.generate_to_file("output.wav", "Hello from WhisperSpeech.")

you need to install WhisperSpeech first - pip install -Uqq WhisperSpeech.

BBC-Esq · 2024-01-30T18:06:20Z

Thanks for the quick response...here's my entire script so far:

import torch
from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

# Generating speech and saving it to a file
pipe.generate_to_file('output_audio.wav', """
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")

print("Audio file generated: output_audio.wav")

However, I keep getting an error saying that torch.save can't get the correct "backend", which apparently originates from "torchaudio?"

BTW. I'm using PyTorch 2.1.2 and CUDA 11.8 if that matters.

Lastly, I'm running Windows 10, not Linux...

BBC-Esq · 2024-01-30T18:12:22Z

The traceback indicates that the error originates from the line reading torchaudio.save(fname, audio.cpu(), 24000), which is a portion of the decode_to_file method within a2wav.py:

    def decode_to_file(self, fname, atoks):
        audio = self.decode(atoks)
        torchaudio.save(fname, audio.cpu(), 24000)
        if self.is_notebook():
            from IPython.display import display, HTML, Audio
            display(HTML(f'<a href="{fname}" target="_blank">Listen to {fname}</a>'))

The next step in the traceback refers to utils.py within the torchaudio library, line 311:

        backend = dispatcher(uri, format, backend)
        return backend.save(
            uri, src, sample_rate, channels_first, format, encoding, bits_per_sample, buffer_size, compression
        )

The "dispatcher" method it calls is as follows:

    def dispatcher(
        uri: Union[BinaryIO, str, os.PathLike], format: Optional[str], backend_name: Optional[str]
    ) -> Backend:
        if backend_name is not None:
            return get_backend(backend_name, backends)

        for backend in backends.values():
            if backend.can_decode(uri, format):
                return backend
        raise RuntimeError(f"Couldn't find appropriate backend to handle uri {uri} and format {format}.")

The runtime error it creates matches what I'm getting:

  File "C:\PATH\Scripts\test\Lib\site-packages\torchaudio\_backend\utils.py", line 221, in dispatcher
    raise RuntimeError(f"Couldn't find appropriate backend to handle uri {uri} and format {format}.")
RuntimeError: Couldn't find appropriate backend to handle uri output.wav and format None.

That's as far as I got without further reverse engineering the code invocations between one another...Any idea?

jpc · 2024-01-30T18:30:37Z

That's quite strange. It seems torchaudio does not know how to write a WAV file?

There seems to be a pretty long discussion (3+ years) about Windows torchaudio support here: pytorch/audio#425

The docs say that soundfile should work on Windows and support WAV files. I tried forcing the soundfile backend on Linux but I cannot reproduce the issue.

Maybe you could add print(torchaudio.__version__) to your script to rule out version-related differences?

BBC-Esq · 2024-01-30T18:43:00Z

This script worked, albeit it's functionally different somewhat...it merely saves the .wav file:

from pydub import AudioSegment
import numpy as np
from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

audio_tensor = pipe.generate_to_file("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")

audio_np = (audio_tensor.numpy() * 32767).astype(np.int16)

if len(audio_np.shape) == 1:
    audio_np = np.expand_dims(audio_np, axis=0)
else:
    audio_np = audio_np.T

print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)

try:
    # Convert the NumPy array to audio segment
    audio_segment = AudioSegment(
        audio_np.tobytes(), 
        frame_rate=24000, 
        sample_width=2, 
        channels=1
    )
    audio_segment.export('output_audio.wav', format='wav')
    print("Audio file generated: output_audio.wav")
except Exception as e:
    print(f"Error writing audio file: {e}")

HOWEVER, please note that I altered the pipeline.py script; specifically the generate_to_file method, to read as follows instead:

    def generate_to_file(self, text, speaker=None, lang='en', cps=15, step_callback=None):
        atoks = self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=step_callback)
        audio = self.vocoder.decode(atoks)
        return audio.cpu()

As I understand it, instead of using WhisperSpeech's default reliance on torchaudio to get the .wave file, I'm instead returning a numpy array and having PyDub and my script create the .wave file...Thoughts?

jpc · 2024-01-30T18:46:09Z

Wow, great that you got it to work. We'd love to integrate this solution into WhisperSpeech itself so it works out of the box on Windows.

Before we do that you could use pipe.generate to avoid having to modify pipeline.py. It will return the audio samples without trying to save a file.

BBC-Esq · 2024-01-30T18:51:37Z

Withdrawn message...I found the model files I was asking for...

BBC-Esq · 2024-01-30T19:12:35Z

Here's the completed script. Was necessary to move to cpu as explained in the comments:

from pydub import AudioSegment
import numpy as np
from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-small-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')

audio_tensor = pipe.generate("""
I want to test this and compare it to Bark models and see how they compare in terms of quality of audio as well as processing speed!
""")

# generate uses CUDA if available; therefore, it's necessary to move to CPU before converting to NumPy array
audio_np = (audio_tensor.cpu().numpy() * 32767).astype(np.int16)

if len(audio_np.shape) == 1:
    audio_np = np.expand_dims(audio_np, axis=0)
else:
    audio_np = audio_np.T

print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)

try:
    audio_segment = AudioSegment(
        audio_np.tobytes(), 
        frame_rate=24000, 
        sample_width=2, 
        channels=1
    )
    audio_segment.export('output_audio.wav', format='wav')
    print("Audio file generated: output_audio.wav")
except Exception as e:
    print(f"Error writing audio file: {e}")

jpc · 2024-01-30T19:26:07Z

Sure I can do that. BTW, what are all the different models I can try? For example,
collabora/whisperspeech:s2a-q4-tiny-en+pl.model
What other English ones are there? I'm unfamiliar with your guys' naming conventions...

These names come from the file names in this repo: https://huggingface.co/collabora/whisperspeech/tree/main

You can try out the other en+pl models listed there. The sizes of T2S and S2A can be different (we have t2s_ref and s2a_ref arguments).

Joosheen · 2024-01-31T20:06:03Z

Anyone knows what is the script to one-shot voice cloning? This one from colab is also not working "pipe.generate("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')"..

zoq · 2024-01-31T20:59:50Z

What is the error you see?

jpc · 2024-01-31T23:42:58Z

@Joosheen I've seen that loading the sample directly from a URL did not seem to work on Windows.

Could you try downloading the file, putting it in the same folder as the script and modifying the command to use speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg".

Joosheen · 2024-02-02T23:11:50Z

What is the error you see?

OSError: [WinError 1314] Klient nie ma wymaganych uprawnień: 'C:\Users\AJusi\.cache\huggingface\hub\models--speechbrain--spkrec-ecapa-voxceleb\snapshots\5c0be3875fda05e81f3c004ed8c7c06be308de1e\hyperparams.yaml' -> '~\.cache\speechbrain\hyperparams.yaml'

Joosheen · 2024-02-02T23:16:47Z

@Joosheen I've seen that loading the sample directly from a URL did not seem to work on Windows.

Could you try downloading the file, putting it in the same folder as the script and modifying the command to use speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg".

ok, I saved the file in the same folder as script and made command: "from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

pipe.generate("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', cps=10, speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg"

and the the same error:
"OSError: [WinError 1314] Klient nie ma wymaganych uprawnień: 'C:\Users\AJusi\.cache\huggingface\hub\models--speechbrain--spkrec-ecapa-voxceleb\snapshots\5c0be3875fda05e81f3c004ed8c7c06be308de1e\hyperparams.yaml' -> '~\.cache\speechbrain\hyperparams.yaml'ml"

I'm a newbie so i don't know if i'm not doing something really wrong..

BBC-Esq · 2024-02-02T23:48:37Z

I couldn't get it to work either...I ran into some kind of error regarding couldn't load a .ogg file though, don't have the error anymore. I gave up after I created my basic script to test whisperspeech...But for what it's worth I can confirm that the code snippet telling to just add the "speaker" parameter didn't work for some reason.

BBC-Esq · 2024-02-03T17:48:58Z

@Joosheen I've seen that loading the sample directly from a URL did not seem to work on Windows.
Could you try downloading the file, putting it in the same folder as the script and modifying the command to use speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg".

ok, I saved the file in the same folder as script and made command: "from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

pipe.generate(""" This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer. """, lang='en', cps=10, speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg"

and the the same error: "OSError: [WinError 1314] Klient nie ma wymaganych uprawnień: 'C:\Users\AJusi.cache\huggingface\hub\models--speechbrain--spkrec-ecapa-voxceleb\snapshots\5c0be3875fda05e81f3c004ed8c7c06be308de1e\hyperparams.yaml' -> '~.cache\speechbrain\hyperparams.yaml'ml"

I'm a newbie so i don't know if i'm not doing something really wrong..

There's some more discussion here #72 This issue is was more geared towards getting a simple working script...

I'd recommend that the DEVS close this issue and we continue our discussion at the other issue for ease of reference??

BBC-Esq · 2024-03-04T00:08:31Z

Closing since it's been addressed. Thanks!

BBC-Esq mentioned this issue Jan 30, 2024

Create a Huggingface demo page #39

Closed

BBC-Esq mentioned this issue Jan 30, 2024

possible quantization (e.g. ctranslate2, llama.cpp, bitsandbytes, gptq, etc.?) #68

Open

BBC-Esq changed the title ~~Can we please geta simple python script that runs it locally...no more Google Colab notebooks?~~ need a simple python script to run WhisperSpeech locally to compare to bark Feb 2, 2024

BBC-Esq closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need a simple python script to run WhisperSpeech locally to compare to bark #67

need a simple python script to run WhisperSpeech locally to compare to bark #67

BBC-Esq commented Jan 30, 2024

zoq commented Jan 30, 2024

BBC-Esq commented Jan 30, 2024 •

edited

Loading

BBC-Esq commented Jan 30, 2024 •

edited

Loading

jpc commented Jan 30, 2024

BBC-Esq commented Jan 30, 2024

jpc commented Jan 30, 2024

BBC-Esq commented Jan 30, 2024 •

edited

Loading

BBC-Esq commented Jan 30, 2024 •

edited

Loading

jpc commented Jan 30, 2024

Joosheen commented Jan 31, 2024

zoq commented Jan 31, 2024

jpc commented Jan 31, 2024

Joosheen commented Feb 2, 2024

Joosheen commented Feb 2, 2024

BBC-Esq commented Feb 2, 2024

BBC-Esq commented Feb 3, 2024

BBC-Esq commented Mar 4, 2024

need a simple python script to run WhisperSpeech locally to compare to bark #67

need a simple python script to run WhisperSpeech locally to compare to bark #67

Comments

BBC-Esq commented Jan 30, 2024

zoq commented Jan 30, 2024

BBC-Esq commented Jan 30, 2024 • edited Loading

BBC-Esq commented Jan 30, 2024 • edited Loading

jpc commented Jan 30, 2024

BBC-Esq commented Jan 30, 2024

jpc commented Jan 30, 2024

BBC-Esq commented Jan 30, 2024 • edited Loading

BBC-Esq commented Jan 30, 2024 • edited Loading

jpc commented Jan 30, 2024

Joosheen commented Jan 31, 2024

zoq commented Jan 31, 2024

jpc commented Jan 31, 2024

Joosheen commented Feb 2, 2024

Joosheen commented Feb 2, 2024

BBC-Esq commented Feb 2, 2024

BBC-Esq commented Feb 3, 2024

BBC-Esq commented Mar 4, 2024

BBC-Esq commented Jan 30, 2024 •

edited

Loading

BBC-Esq commented Jan 30, 2024 •

edited

Loading

BBC-Esq commented Jan 30, 2024 •

edited

Loading

BBC-Esq commented Jan 30, 2024 •

edited

Loading