Welcome to Tortoise! 🐢🐢🐢🐢

Before you begin, I **strongly** recommend you turn on a GPU runtime.

There's a reason this is called "Tortoise" - this model takes up to a minute to perform inference for a single sentence on a GPU. Expect waits on the order of hours on a CPU.


In [None]:
%pip install -r requirements.txt
# %python setup.py install

In [13]:
# Imports used through the rest of the notebook.
import torchaudio


import IPython

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_voice, load_voices

from sys import platform

model_iter = "autoregressive"
model_path = f"models/{model_iter}_gpt.pth"
autoregressive_batch_size = 8 if platform == "darwin" else None

# This will download all the models used by Tortoise from the HF hub.
# If you want to use deepspeed the pass use_deepspeed=True nearly 2x faster than normal

tts = TextToSpeech(
    model_path=None,
    use_deepspeed=True,
    kv_cache=True,
    autoregressive_batch_size=autoregressive_batch_size,
)

Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.o

In [2]:
# This is the text that will be spoken.
# text = "Hey, it's John Roderick. You might know me from all the great shows!"

text = "You know, liberalism used to be something my dad would get into fights over in bars."

# Here's something for the poetically inclined.. (set text=)
# text = """
# You know, liberalism used to be something my dad would get into fights over in bars. It was a muscular thing, a real force. Now, it's like... it's lost its way. It's become this brand, this thing that everyone's gotta conform to. Every new fad, every new buzzword, the liberals are all over it, trying to canonize it. It's become so bloated, so full of contradictions, that it's lost its meaning.
# """

# Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py
preset = "fast"

In [3]:
# Tortoise will attempt to mimic voices you provide. It comes pre-packaged
# with some voices you might recognize.

# Let's list all the voices available. These are just some random clips I've gathered
# from the internet as well as a few voices from the training dataset.
# Feel free to add your own clips to the voices/ folder.
%ls tortoise/voices

IPython.display.Audio('tortoise/voices/john/1.wav')

[34mangie[m[m/               [34mlj[m[m/                  [34mtrain_dotrice[m[m/
[34mapplejack[m[m/           [34mmol[m[m/                 [34mtrain_dreams[m[m/
[34mcond_latent_example[m[m/ [34mmyself[m[m/              [34mtrain_empire[m[m/
[34mdaniel[m[m/              [34mpat[m[m/                 [34mtrain_grace[m[m/
[34mdeniro[m[m/              [34mpat2[m[m/                [34mtrain_kennard[m[m/
[34memma[m[m/                [34mrainbow[m[m/             [34mtrain_lescault[m[m/
[34mfreeman[m[m/             [34msnakes[m[m/              [34mtrain_mouse[m[m/
[34mgeralt[m[m/              [34mtim_reynolds[m[m/        [34mweaver[m[m/
[34mhalle[m[m/               [34mtom[m[m/                 [34mwilliam[m[m/
[34mjlaw[m[m/                [34mtrain_atkins[m[m/
[34mjohn[m[m/                [34mtrain_daws[m[m/


In [14]:
# Pick one of the voices from the output above
voice = "john"

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(
    text,
    voice_samples=voice_samples,
    conditioning_latents=conditioning_latents,
    preset=preset,
)
torchaudio.save(f"{model_iter}.wav", gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(f"{model_iter}.wav")

Generating autoregressive samples..


100%|██████████| 12/12 [02:29<00:00, 12.47s/it]


Computing best candidates using CLVP


100%|██████████| 12/12 [00:13<00:00,  1.12s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:50<00:00,  1.59it/s]


In [None]:
# Tortoise can also generate speech using a random voice. The voice changes each time you execute this!
# (Note: random voices can be prone to strange utterances)
gen = tts.tts_with_preset(
    text, voice_samples=None, conditioning_latents=None, preset=preset
)
torchaudio.save("generated.wav", gen.squeeze(0).cpu(), 24000)
IPython.display.Audio("generated.wav")

In [None]:
# You can also combine conditioning voices. Combining voices produces a new voice
# with traits from all the parents.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
voice_samples, conditioning_latents = load_voices(["pat", "william"])

gen = tts.tts_with_preset(
    "They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.",
    voice_samples=None,
    conditioning_latents=None,
    preset=preset,
)
torchaudio.save("captain_kirkard.wav", gen.squeeze(0).cpu(), 24000)
IPython.display.Audio("captain_kirkard.wav")

In [None]:
del tts  # Will break other cells, but necessary to conserve RAM if you want to run this cell.

# Tortoise comes with some scripts that does a lot of the lifting for you. For example,
# read.py will read a text file for you.
%python tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=.

IPython.display.Audio('train_atkins/combined.wav')
# This will take awhile..