# Installation

In [1]:
! python --version

Python 3.10.6


In [None]:
! git clone --quiet https://github.com/camille-vanhoffelen/tortoise-tts.git

In [None]:
! pip -q install voicefixer==0.1.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.8/203.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m84.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m94.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.8/164.8 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
! cd tortoise-tts && pip -q install -e .

  Preparing metadata (setup.py) ... [?25l[?25hdone


# Benchmark

In [None]:
import tortoise

In [None]:
import time
import uuid
from pathlib import Path

import torchaudio
from tortoise.api import MODELS_DIR, TextToSpeech

SEED = 4242
SAMPLE_RATE = 24000
OUTPUT_DIR = Path("results")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

tts = TextToSpeech(models_dir=MODELS_DIR, use_deepspeed=False, kv_cache=True, half=True)


def benchmark(text: str):
    start = time.perf_counter()
    gen, dbg_state = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=None,
        preset="ultra_fast",
        use_deterministic_seed=SEED,
        return_deterministic_state=True,
        cvvp_amount=0.0,
    )
    audio_array = gen.squeeze(0).cpu()
    end = time.perf_counter()

    log_benchmark(start=start, end=end, audio_array=audio_array)
    return audio_array

def log_benchmark(start, end, audio_array):
    run_time_in_s = end - start
    duration_in_s = audio_array.shape[1] / SAMPLE_RATE
    speed_ratio = run_time_in_s / duration_in_s
    print(
        f"Benchmark results: run_time_in_s={round(run_time_in_s, 3)}, duration_in_s={round(duration_in_s, 3)}, speed_ratio={round(speed_ratio, 3)}"
    )

def save_audio(audio_array):
    uuid_str = str(uuid.uuid4())[:4]
    path = OUTPUT_DIR / f"tortoise-benchmark-{uuid_str}.wav"
    torchaudio.save(
        path,
        audio_array,
        SAMPLE_RATE,
    )
    return path


In [None]:
from IPython.display import Audio

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
audio_array = benchmark(text=text)
Audio(audio_array, rate=SAMPLE_RATE)

  0% |                                                                        |

Downloading rlg_auto.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth...





Done.


  0% |                                                                        |

Downloading rlg_diffuser.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth...





Done.
Generating autoregressive samples..


100%|██████████| 1/1 [00:17<00:00, 17.02s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  4.58it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:07<00:00,  4.16it/s]


Benchmark results: run_time_in_s=41.482, duration_in_s=12.395, speed_ratio=3.347


# Sentence Batching

According to its documentation, `split_and_recombine_text`'s goal is to "Split text it into chunks of a desired length trying to keep sentences intact".

It does so by trying to make chunks 200 characters long, with a hard cutoff at 300 characters.

In [None]:
from tortoise.utils.text import split_and_recombine_text

text = """Welcome, my friends, to the realm of Modern Mindfulness. In this guided meditation, we shall embark on a journey of facing the unlucky events and stressful circumstances that life throws our way. Together, we shall learn to embrace the chaos with serene grace and a touch of sarcastic humor. Let us begin, shall we?
Find a comfortable position, preferably on a rooftop overlooking the city. Close your eyes, take a deep breath in, and exhale slowly. [breathes] Feel the weight of the world lift off your shoulders as you release all tension. As we dive into this meditation, let us reflect on the circumstances and events that will test your patience and resilience.
Imagine a candlelit dinner before you, the soft glow casting mesmerizing shadows upon your face. A gentle breeze caresses your skin, cooling you in the warm embrace of the evening. [breathes] But alas, as luck would have it, your wrist throbs with pain from an overzealous air guitar session. Feel the discomfort, acknowledge it, and let it go. [breathes]
As the wind picks up, you feel a draft sweep through, swirling around the flickering candles on your table. [breathes] Watch as their flames dance erratically, finally succumbing to the invisible force. The darkness envelopes you, stirring a slight frustration within. [breathes] Embrace the darkness, for it is in the absence of light that we truly learn to appreciate its presence.
Oh, dear listener, as your heart skips a beat, you reach for the matches, only to realize that you have forgotten them. The realization sinks in, a mix of annoyance and disbelief intertwining. [breathes] Accept this moment of forgetfulness and find solace in the fact that you are not alone in your absent-mindedness.
As you ponder the lack of matches, seagulls swoop down from above, attacking your dinner with unrelenting determination. Their shrill cries echo through the night as they snatch morsels from your plate. Feel the frustration and annoyance rise, but remember, these birds of the sky symbolize freedom. [breathes] Let their wild spirit remind you to find freedom amidst chaos.
Oh, the wineglass, it was once brimming with your favorite red elixir. As you reach for it, your arm brushes against the table's edge, sending the glass crashing to the ground. The crimson liquid spills like a river, staining the pavement beneath. Feel the tension rise within you, but remember, it is just a glass, and wine can always be refilled. [breathes]
And there it is, my friends, an unexpected drip of candle wax lands upon your favorite shirt. The fabric, once pristine, bears witness to the mishap. Feel the frustration well up within you, but remember, material possessions do not define our worth. [breathes] Embrace imperfections, for they make life interesting and remind us to laugh at ourselves.
As we near the end of this journey, remember the mantra we have shared: "Life sucks, but I breathe" [breathes]. In the face of unfortunate events and stressful circumstances, breathe in the acceptance of what is, and exhale the need for control. Embrace the chaos, the imperfections, and the absurdity of life. For in these moments, true liberation is found.
Slowly bring your awareness back to the present moment. Wiggle your fingers and toes, feeling the energy flow through your body. [breathes] When you're ready, open your eyes, and face the world with newfound resilience and a touch of satirical grace. May this Modern Mindfulness meditation guide you through the chaos of life, reminding you to embrace the unpredictable with a calm heart. [breathes]"""
splits = split_and_recombine_text(text)

In [None]:
[len(s) for s in splits]

[291, 249, 221, 247, 213, 180, 192, 245, 252, 247, 205, 258, 245, 241, 259, 10]

In [None]:
from IPython.display import Audio

audio_array = benchmark(splits[0])
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:19<00:00, 19.28s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  1.81it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:12<00:00,  2.42it/s]


Benchmark results: run_time_in_s=39.953, duration_in_s=17.269, speed_ratio=2.314


# Voice Conditioning

Are conditioning latents faster than voice samples? How to get conditioning latents... Directly from generation? Or have to reconvert from produced audio anyways?

Inside tts_with_presets, conditioning latents are computed for voice samples. Thus, faster to provide latents directly.

In [None]:
from IPython.display import Audio

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
gen, dbg_state = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=None,
        preset="ultra_fast",
        use_deterministic_seed=SEED,
        return_deterministic_state=True,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:19<00:00, 19.10s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.35it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:07<00:00,  4.25it/s]


In [None]:
dbg_state

(4242,
 "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!",
 None,
 (tensor([[-1.3977,  0.6395, -2.8321,  ...,  2.8888, -1.6749,  0.1308]]),
  tensor([[-1.1527, -0.8785, -0.9940,  ..., -0.3133,  0.0126,  0.3852]])))

In [None]:
auto_conditioning, diffusion_conditioning = dbg_state[3]

In [None]:
from IPython.display import Audio

text = "Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors."
gen = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=None,
        preset="ultra_fast",
        use_deterministic_seed=SEED,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:22<00:00, 22.51s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  1.34it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:07<00:00,  3.76it/s]


In [None]:
from IPython.display import Audio

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
gen = tts.tts_with_preset(
        text=text,
        k=1,
        conditioning_latents=(auto_conditioning, diffusion_conditioning),
        preset="ultra_fast",
        use_deterministic_seed=1235,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:09<00:00,  9.60s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.50it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:06<00:00,  4.37it/s]


In [None]:
from IPython.display import Audio

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
gen = tts.tts_with_preset(
        text=text,
        k=1,
        conditioning_latents=(auto_conditioning, diffusion_conditioning),
        preset="ultra_fast",
        use_deterministic_seed=1236,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:10<00:00, 10.04s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  1.46it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:05<00:00,  5.21it/s]


Not so successful... is it because the latents amount is not enough? How does the voice sound banks get converted to the latents?

> Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.

> Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.

What if get random conditioning latent directly, instead of converting off of voice? to get consistency?

In [None]:
auto_conditioning, diffusion_conditioning = tts.get_random_conditioning_latents()



Downloading rlg_auto.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth...





Done.




Downloading rlg_diffuser.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth...





Done.


In [None]:
from IPython.display import Audio

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
gen = tts.tts_with_preset(
        text=text,
        k=1,
        conditioning_latents=(auto_conditioning, diffusion_conditioning),
        preset="ultra_fast",
        use_deterministic_seed=1,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:16<00:00, 16.85s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  4.64it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:04<00:00,  6.73it/s]


In [None]:
from IPython.display import Audio

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
gen = tts.tts_with_preset(
        text=text,
        k=1,
        conditioning_latents=(auto_conditioning, diffusion_conditioning),
        preset="ultra_fast",
        use_deterministic_seed=2,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:11<00:00, 11.59s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.08it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:05<00:00,  5.95it/s]


In [None]:
from IPython.display import Audio
from tortoise.utils.audio import load_voice

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
voice_samples, conditioning_latents = load_voice("tom")
gen = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=voice_samples,
        preset="ultra_fast",
        use_deterministic_seed=2,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:11<00:00, 11.84s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.80it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:06<00:00,  4.85it/s]


In [None]:
from IPython.display import Audio
from tortoise.utils.audio import load_voice

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
voice_samples, conditioning_latents = load_voice("tom")
gen = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=voice_samples,
        preset="ultra_fast",
        use_deterministic_seed=3,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:10<00:00, 10.25s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.30it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:07<00:00,  4.20it/s]


Does seed help here? and is enough to not get variation across texts?

In [6]:
from IPython.display import Audio
from tortoise.utils.audio import load_voice

text = "I never, ever said that! You're building a conspiracy against me, because you are afraid that I eat your vegetables again... You are a coward!"
voice_samples, conditioning_latents = load_voice("tom")
gen = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=None,
        preset="ultra_fast",
        use_deterministic_seed=4,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:09<00:00,  9.39s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  1.46it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:04<00:00,  7.01it/s]


In [None]:
from IPython.display import Audio
from tortoise.utils.audio import load_voice

text = "Conditioning-free diffusion performs two forward passes for each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors."
voice_samples, conditioning_latents = load_voice("tom")
gen = tts.tts_with_preset(
        text=text,
        k=1,
        voice_samples=None,
        preset="ultra_fast",
        use_deterministic_seed=4,
        return_deterministic_state=False,
        cvvp_amount=0.0,
    )
audio_array = gen.squeeze(0).cpu()
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:10<00:00, 10.72s/it]


Looks like the seed + random caching does offer some consistency across generations, but heavily dependent on text. If change register or topic, the generated voice follows.

# Cond free

> Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
                          each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                          of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
                          dramatically improves realism."

Test speed with conf free
  

# Redaction

> Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the prompt "[I am really sad,] Please feed me." will only speak the words "Please feed me" (with a sad tonality).

# Voice Fixing

In [None]:
! voicefixer --infile results/tortoise-benchmark-b70c.wav --outfile results/tortoise-voicefixer-b70c.wav

Initializing VoiceFixer
Start processing the input file results/tortoise-benchmark-b70c.wav.
Processing results/tortoise-benchmark-b70c.wav, mode=0
Done


# Annotations

In [None]:
from IPython.display import Audio

text = "Oh dear, not again... [sighs] I hate to do this, you know? How MANY TIMES do I have to tell you? Are you fucking BLIND?! [screams] ARGH!"
audio_array = benchmark(text=text)
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:16<00:00, 16.36s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.94it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:11<00:00,  2.59it/s]


Benchmark results: run_time_in_s=45.102, duration_in_s=15.327, speed_ratio=2.943


Seems like annotations (e.g [sighs]) don't do anything here. Nor do onomatopeas like "Argh".

In [None]:
from IPython.display import Audio

text = "Do you know why the chicken crossed the road? Because it wanted to! Hahahahaha"
audio_array = benchmark(text=text)
Audio(audio_array, rate=SAMPLE_RATE)

Generating autoregressive samples..


100%|██████████| 1/1 [00:07<00:00,  7.34s/it]


Computing best candidates using CLVP


100%|██████████| 1/1 [00:00<00:00,  5.48it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 30/30 [00:01<00:00, 15.22it/s]


Benchmark results: run_time_in_s=15.234, duration_in_s=5.099, speed_ratio=2.988


Laughter seems to work better, but somewhat creepy. Perhaps avoid?

# Speaking Rate