<a href="https://colab.research.google.com/github/dat0106/notebook/blob/main/Inference_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inference examples

In [None]:
# Do not forget to install all dependencies first:
!pip install -Uqq WhisperSpeech==0.7.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/630.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m337.9/630.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m630.6/630.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
def is_colab():
    try: import google.colab; return True
    except: return False

import torch
if not torch.cuda.is_available():
    if is_colab(): raise BaseException("Please change the runtime type to GPU. In the menu: Runtime -> Change runtime type (the free T4 instance is enough)")
    else:          raise BaseException("Currently the example notebook requires CUDA, make sure you are running this on a machine with a GPU.")

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
import torch
import torch.nn.functional as F

from IPython.display import Markdown, HTML

## The whole pipeline

In [5]:
# check "7. Pipeline.ipynb"
from whisperspeech.pipeline import Pipeline

In [6]:
# let's start with the fast SD S2A model
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


t2s-small-en+pl.model:   0%|          | 0.00/856M [00:00<?, ?B/s]

s2a-q4-tiny-en+pl.model:   0%|          | 0.00/80.3M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/503 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/40.4M [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 102MB/s]


In [7]:
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")



In [8]:
# The model knows how to speak in Polish
pipe.generate_to_notebook("""
To jest pierwszy test naszego modelu. Pozdrawiamy serdecznie.
""", lang='pl')

In [9]:
# We can also mix different languages (e.g. for borrowed words) in a single sentence
stoks = pipe.t2s.generate(["To jest pierwszy test wielojęzycznego ", " Whisper Speech ", ", modelu zamieniającego tekst na mowę, który Collabora i Laion nauczyli na superkomputerze", " Jewels."], lang=['pl', 'en', 'pl', 'en'])
stoks = stoks[stoks!=512]
pipe.vocoder.decode_to_notebook(pipe.s2a.generate(stoks, pipe.default_speaker.unsqueeze(0)))

In [None]:
stoks = pipe.t2s.generate(["I love to eat eastern european food! Especially ", "pierogi i bigos."], lang=['en', 'pl'], cps=11)
stoks = stoks[stoks!=512]
pipe.vocoder.decode_to_notebook(pipe.s2a.generate(stoks, pipe.default_speaker.unsqueeze(0)))

## Voice cloning

In [10]:
# we can give it an audio file reference to get zero-shot voice cloning
#
# you can provide a URL or upload your own audio files
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', cps=10.5, speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

hyperparams.yaml:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/129k [00:00<?, ?B/s]



In [11]:
# it works even better (but slower) with the HQ model
pipe = Pipeline()

s2a-q4-small-en+pl.model:   0%|          | 0.00/874M [00:00<?, ?B/s]



In [12]:
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', cps=10, speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')