Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need a simple python script to run WhisperSpeech locally to compare to bark #67

Closed
BBC-Esq opened this issue Jan 30, 2024 · 17 comments
Closed

Comments

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Jan 30, 2024

I really want to test this out and compare it to Bark, but it's implementation seems convoluted to someone like me who's never used Google Colab notebooks. I didn't see any straight Python scripts on the repository that I could tweak to get it working....I'm running an 4090 and am very familiar with OpenAI's whisper, ctranslate2's implementation (of which WhisperX is an offshoot), Transformers libraries, etc. I DO NOT use llama.cpp but suppose I could if need be.

Basically, is there or can they be provided....straight python scripts?

@zoq
Copy link
Contributor

zoq commented Jan 30, 2024

You can pretty much take the necessary lines from the notebook itself:

from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model', torch_compile=True)
pipe.generate_to_file("output.wav", "Hello from WhisperSpeech.")

you need to install WhisperSpeech first - pip install -Uqq WhisperSpeech.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

Thanks for the quick response...here's my entire script so far:

import torch
from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

# Generating speech and saving it to a file
pipe.generate_to_file('output_audio.wav', """
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")

print("Audio file generated: output_audio.wav")

However, I keep getting an error saying that torch.save can't get the correct "backend", which apparently originates from "torchaudio?"

image

BTW. I'm using PyTorch 2.1.2 and CUDA 11.8 if that matters.

Lastly, I'm running Windows 10, not Linux...

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

The traceback indicates that the error originates from the line reading torchaudio.save(fname, audio.cpu(), 24000), which is a portion of the decode_to_file method within a2wav.py:

    def decode_to_file(self, fname, atoks):
        audio = self.decode(atoks)
        torchaudio.save(fname, audio.cpu(), 24000)
        if self.is_notebook():
            from IPython.display import display, HTML, Audio
            display(HTML(f'<a href="{fname}" target="_blank">Listen to {fname}</a>'))

The next step in the traceback refers to utils.py within the torchaudio library, line 311:

        backend = dispatcher(uri, format, backend)
        return backend.save(
            uri, src, sample_rate, channels_first, format, encoding, bits_per_sample, buffer_size, compression
        )

The "dispatcher" method it calls is as follows:

    def dispatcher(
        uri: Union[BinaryIO, str, os.PathLike], format: Optional[str], backend_name: Optional[str]
    ) -> Backend:
        if backend_name is not None:
            return get_backend(backend_name, backends)

        for backend in backends.values():
            if backend.can_decode(uri, format):
                return backend
        raise RuntimeError(f"Couldn't find appropriate backend to handle uri {uri} and format {format}.")

The runtime error it creates matches what I'm getting:

  File "C:\PATH\Scripts\test\Lib\site-packages\torchaudio\_backend\utils.py", line 221, in dispatcher
    raise RuntimeError(f"Couldn't find appropriate backend to handle uri {uri} and format {format}.")
RuntimeError: Couldn't find appropriate backend to handle uri output.wav and format None.

That's as far as I got without further reverse engineering the code invocations between one another...Any idea?

@jpc
Copy link
Contributor

jpc commented Jan 30, 2024

That's quite strange. It seems torchaudio does not know how to write a WAV file?

There seems to be a pretty long discussion (3+ years) about Windows torchaudio support here: pytorch/audio#425

The docs say that soundfile should work on Windows and support WAV files. I tried forcing the soundfile backend on Linux but I cannot reproduce the issue.

Maybe you could add print(torchaudio.__version__) to your script to rule out version-related differences?

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

This script worked, albeit it's functionally different somewhat...it merely saves the .wav file:

from pydub import AudioSegment
import numpy as np
from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

audio_tensor = pipe.generate_to_file("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")

audio_np = (audio_tensor.numpy() * 32767).astype(np.int16)

if len(audio_np.shape) == 1:
    audio_np = np.expand_dims(audio_np, axis=0)
else:
    audio_np = audio_np.T

print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)

try:
    # Convert the NumPy array to audio segment
    audio_segment = AudioSegment(
        audio_np.tobytes(), 
        frame_rate=24000, 
        sample_width=2, 
        channels=1
    )
    audio_segment.export('output_audio.wav', format='wav')
    print("Audio file generated: output_audio.wav")
except Exception as e:
    print(f"Error writing audio file: {e}")

HOWEVER, please note that I altered the pipeline.py script; specifically the generate_to_file method, to read as follows instead:

    def generate_to_file(self, text, speaker=None, lang='en', cps=15, step_callback=None):
        atoks = self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=step_callback)
        audio = self.vocoder.decode(atoks)
        return audio.cpu()

As I understand it, instead of using WhisperSpeech's default reliance on torchaudio to get the .wave file, I'm instead returning a numpy array and having PyDub and my script create the .wave file...Thoughts?

@jpc
Copy link
Contributor

jpc commented Jan 30, 2024

Wow, great that you got it to work. We'd love to integrate this solution into WhisperSpeech itself so it works out of the box on Windows.

Before we do that you could use pipe.generate to avoid having to modify pipeline.py. It will return the audio samples without trying to save a file.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

Withdrawn message...I found the model files I was asking for...

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

Here's the completed script. Was necessary to move to cpu as explained in the comments:

from pydub import AudioSegment
import numpy as np
from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-small-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')

audio_tensor = pipe.generate("""
I want to test this and compare it to Bark models and see how they compare in terms of quality of audio as well as processing speed!
""")

# generate uses CUDA if available; therefore, it's necessary to move to CPU before converting to NumPy array
audio_np = (audio_tensor.cpu().numpy() * 32767).astype(np.int16)

if len(audio_np.shape) == 1:
    audio_np = np.expand_dims(audio_np, axis=0)
else:
    audio_np = audio_np.T

print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)

try:
    audio_segment = AudioSegment(
        audio_np.tobytes(), 
        frame_rate=24000, 
        sample_width=2, 
        channels=1
    )
    audio_segment.export('output_audio.wav', format='wav')
    print("Audio file generated: output_audio.wav")
except Exception as e:
    print(f"Error writing audio file: {e}")

@jpc
Copy link
Contributor

jpc commented Jan 30, 2024

Sure I can do that. BTW, what are all the different models I can try? For example,

collabora/whisperspeech:s2a-q4-tiny-en+pl.model

What other English ones are there? I'm unfamiliar with your guys' naming conventions...

These names come from the file names in this repo: https://huggingface.co/collabora/whisperspeech/tree/main

You can try out the other en+pl models listed there. The sizes of T2S and S2A can be different (we have t2s_ref and s2a_ref arguments).

@Joosheen
Copy link

Anyone knows what is the script to one-shot voice cloning? This one from colab is also not working "pipe.generate("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')"..

@zoq
Copy link
Contributor

zoq commented Jan 31, 2024

What is the error you see?

@jpc
Copy link
Contributor

jpc commented Jan 31, 2024

@Joosheen I've seen that loading the sample directly from a URL did not seem to work on Windows.

Could you try downloading the file, putting it in the same folder as the script and modifying the command to use speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg".

@Joosheen
Copy link

Joosheen commented Feb 2, 2024

What is the error you see?

OSError: [WinError 1314] Klient nie ma wymaganych uprawnień: 'C:\Users\AJusi\.cache\huggingface\hub\models--speechbrain--spkrec-ecapa-voxceleb\snapshots\5c0be3875fda05e81f3c004ed8c7c06be308de1e\hyperparams.yaml' -> '~\.cache\speechbrain\hyperparams.yaml'

@Joosheen
Copy link

Joosheen commented Feb 2, 2024

@Joosheen I've seen that loading the sample directly from a URL did not seem to work on Windows.

Could you try downloading the file, putting it in the same folder as the script and modifying the command to use speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg".

ok, I saved the file in the same folder as script and made command: "from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

pipe.generate("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', cps=10, speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg"

and the the same error:
"OSError: [WinError 1314] Klient nie ma wymaganych uprawnień: 'C:\Users\AJusi\.cache\huggingface\hub\models--speechbrain--spkrec-ecapa-voxceleb\snapshots\5c0be3875fda05e81f3c004ed8c7c06be308de1e\hyperparams.yaml' -> '~\.cache\speechbrain\hyperparams.yaml'ml"

I'm a newbie so i don't know if i'm not doing something really wrong..

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 2, 2024

I couldn't get it to work either...I ran into some kind of error regarding couldn't load a .ogg file though, don't have the error anymore. I gave up after I created my basic script to test whisperspeech...But for what it's worth I can confirm that the code snippet telling to just add the "speaker" parameter didn't work for some reason.

@BBC-Esq BBC-Esq changed the title Can we please geta simple python script that runs it locally...no more Google Colab notebooks? need a simple python script to run WhisperSpeech locally to compare to bark Feb 2, 2024
@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 3, 2024

@Joosheen I've seen that loading the sample directly from a URL did not seem to work on Windows.
Could you try downloading the file, putting it in the same folder as the script and modifying the command to use speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg".

ok, I saved the file in the same folder as script and made command: "from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

pipe.generate(""" This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer. """, lang='en', cps=10, speaker="Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg"

and the the same error: "OSError: [WinError 1314] Klient nie ma wymaganych uprawnień: 'C:\Users\AJusi.cache\huggingface\hub\models--speechbrain--spkrec-ecapa-voxceleb\snapshots\5c0be3875fda05e81f3c004ed8c7c06be308de1e\hyperparams.yaml' -> '~.cache\speechbrain\hyperparams.yaml'ml"

I'm a newbie so i don't know if i'm not doing something really wrong..

There's some more discussion here #72 This issue is was more geared towards getting a simple working script...

I'd recommend that the DEVS close this issue and we continue our discussion at the other issue for ease of reference??

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Mar 4, 2024

Closing since it's been addressed. Thanks!

@BBC-Esq BBC-Esq closed this as completed Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants