Welcome to Tortoise! 🐢🐢🐢🐢

In case of bugs, compare against original notebook [here](https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing) and Github repository [here](https://github.com/neonbjb/tortoise-tts). Quality of Life improvements and additional voices added by [Downy](http://www.twitter.com/tooltrackers).

When making a new voice, film voiceover, especially documentary narration, tends to work best. Audiobooks work well, but lack variety. Or, you could use any YouTube video with clear audio; just [download as mp3](https://x2download.com/en192/download-youtube-to-mp3). If you can't get a voice to work, try [Demucs](https://colab.research.google.com/drive/1qlpoIAb-nD-L29kFP976syIN4e6QiP4i?usp=sharing) and [VoiceFixer](https://colab.research.google.com/drive/1rypU23DARH3VsoJTKgDlviPDClOsXtXa?usp=sharing). Tortoise appears to work best with Standard American English "news anchor" voices, as it has trouble with cartoonish or noise-heavy (such as gravelly) ones.


# Diagnostics

In [None]:
#@title Check GPU
#@markdown - Tier List: (K80 < T4 < P100 < V100 < A100)
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-47962951-170d-fea9-98a1-7bc2c92d4433)


In [None]:
# @title Check RAM

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


In [None]:
#@title Check memory footprint
#If util is not 0%, restart

skip_footprint = True #@param{type:'boolean'}
if(not skip_footprint):

  !ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
  !pip install gputil
  !pip install psutil
  !pip install humanize
  import psutil
  import humanize
  import os
  import GPUtil as GPU

  GPUs = GPU.getGPUs()

  # XXX: only one GPU on Colab and isn’t guaranteed
  gpu = GPUs[0]

  def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))

  printm()

# Setup

In [None]:
#@title Install libraries

# the scipy version packaged with colab is not tolerant of misformated WAV files.
# install the latest version.

!pip3 install -U scipy

!git clone https://github.com/jnordberg/tortoise-tts.git
%cd tortoise-tts
!pip3 install -r requirements.txt
!python3 setup.py install

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Cloning into 'tortoise-tts'...
remote: Enumerating objects: 1481, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 1481 (delta 163), reused 143 (delta 143), pack-reused 1297[K
Receiving objects: 100% (1481/1481), 53.55 MiB | 21.78 MiB/s, done.
Resolving deltas: 100% (608/608), done.
/content/tortoise-tts
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rotary_embedding_torch
  Downloading rotary_embedding_torch-0.1.5-py3-none-any.whl (4.1 kB)
Collecting transformers
  Downloading transformers-4.22.0-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 4.1 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6

In [None]:
#@title Mount Google Drive

#@markdown This will also transfer saved voices and three large files.

from google.colab import drive
drive.mount('/content/drive')

!gdown https://drive.google.com/uc?id=1SxZ3Qz9xIgCBxY7gxypg9o8E6sOORK49 #autoregressive.pth
!gdown https://drive.google.com/uc?id=1Q-uShpp_81PNV1o8LZ2bKDhJ4szGmaaa #clvp2.pth
!gdown https://drive.google.com/uc?id=1SxQNjL3VS5E1b5SMAKP69qLOEpsX7hRV #diffusion_decoder.pth

Mounted at /content/drive
Downloading...
From: https://drive.google.com/uc?id=1SxZ3Qz9xIgCBxY7gxypg9o8E6sOORK49
To: /content/tortoise-tts/autoregressive.pth
100% 1.72G/1.72G [00:26<00:00, 64.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Q-uShpp_81PNV1o8LZ2bKDhJ4szGmaaa
To: /content/tortoise-tts/clvp2.pth
100% 976M/976M [00:10<00:00, 92.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1SxQNjL3VS5E1b5SMAKP69qLOEpsX7hRV
To: /content/tortoise-tts/diffusion_decoder.pth
100% 1.17G/1.17G [00:13<00:00, 86.3MB/s]


In [None]:
#@title Import functions

import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

import IPython

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

# This will download all the models used by Tortoise from the HuggingFace hub.
tts = TextToSpeech()

!cp -r "/content/tortoise-tts/autoregressive.pth" "/content/tortoise-tts/build/lib/tortoise/models"
!cp -r "/content/tortoise-tts/clvp2.pth" "/content/tortoise-tts/build/lib/tortoise/models"
!cp -r "/content/tortoise-tts/diffusion_decoder.pth" "/content/tortoise-tts/build/lib/tortoise/models"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/181 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading autoregressive.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth...





Done.
Downloading classifier.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/classifier.pth...





Done.
Downloading clvp2.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth...





Done.
Downloading cvvp.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth...





Done.
Downloading diffusion_decoder.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth...





Done.
Downloading vocoder.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth...





Done.
Downloading rlg_auto.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth...





Done.
Downloading rlg_diffuser.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth...





Done.


In [None]:
#@title Download and upload voices

upload_voice = False #@param{type:'boolean'}

#Download voices from Google Drive
!gdown https://drive.google.com/uc?id=1T9AOI4lTjF3gGZr2gxU66Qj3ygfvK6jx #voices.zip
!unzip /content/tortoise-tts/voices.zip -d /content/tortoise-tts/tortoise/voices


#Upload a new voice
if (upload_voice):

  from google.colab import files

  %cd /content/tortoise-tts/tortoise/voices

  new_voice_name = "" #@param {type: 'string'}

  !mkdir $new_voice_name

  %cd $new_voice_name

  uploaded = files.upload()

  for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))
    
    %cd /content/tortoise-tts

Downloading...
From: https://drive.google.com/uc?id=1T9AOI4lTjF3gGZr2gxU66Qj3ygfvK6jx
To: /content/tortoise-tts/voices.zip
100% 24.8M/24.8M [00:00<00:00, 67.8MB/s]
Archive:  /content/tortoise-tts/voices.zip
  inflating: /content/tortoise-tts/tortoise/voices/bella/1.wav  
  inflating: /content/tortoise-tts/tortoise/voices/bella/2.wav  
  inflating: /content/tortoise-tts/tortoise/voices/bella/3.wav  
   creating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/1.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/2.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/3.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/4.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/5.wav  
   creating: /content/tortoise-tts/tortoise/voices/space_freckle/
  inflating: /content/tortoise-tts/tortoise/voices/space_freckle/1.wav  
  inflating: /content/tortoi

# Execute

In [None]:
#@title Preview voices

#@markdown Tortoise will attempt to mimic voices you provide. It comes pre-packaged with some voices you might recognize. Let's list all the voices available. These are just some random clips I've gathered from the internet as well as a few voices from the training dataset.  Feel free to add your own clips to the voices/ folder.
preview_voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"] {allow-input: true}

%ls tortoise/voices

IPython.display.Audio('tortoise/voices/' + preview_voice + '/1.wav')

[0m[01;34mangie[0m/      [01;34mgeralt[0m/         [01;34mpat[0m/            [01;34mtom[0m/            [01;34mtrain_kennard[0m/
[01;34mapplejack[0m/  [01;34mhalle[0m/          [01;34mpat2[0m/           [01;34mtrain_atkins[0m/   [01;34mtrain_lescault[0m/
[01;34mbella[0m/      [01;34mjlaw[0m/           [01;34mrainbow[0m/        [01;34mtrain_daws[0m/     [01;34mtrain_mouse[0m/
[01;34mdaniel[0m/     [01;34mlj[0m/             [01;34msnakes[0m/         [01;34mtrain_dotrice[0m/  [01;34municole[0m/
[01;34mdeniro[0m/     [01;34mmichelle_yeoh[0m/  [01;34mspace_freckle[0m/  [01;34mtrain_dreams[0m/   [01;34mweaver[0m/
[01;34memma[0m/       [01;34mmol[0m/            [01;34mspartan[0m/        [01;34mtrain_empire[0m/   [01;34mwilliam[0m/
[01;34mfreeman[0m/    [01;34mmyself[0m/         [01;34mtim_reynolds[0m/   [01;34mtrain_grace[0m/


In [None]:
#@title Text to speak

text = "The story is that in the region of Naucratis in Egypt there dwelt one of the old gods of the country, the god to whom the bird called Ibis is sacred, his own name being Thoth." #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "high_quality" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take1" #@param ["take1", "take2", "take3", "take4", "take5"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

Generating autoregressive samples..


100%|██████████| 16/16 [04:28<00:00, 16.80s/it]


Computing best candidates using CLVP and CVVP


 94%|█████████▍| 15/16 [00:14<00:00,  1.00it/s]

No stop tokens found in one of the generated voice clips. This typically means the spoken audio is too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, try breaking up your input text.


100%|██████████| 16/16 [00:15<00:00,  1.01it/s]


Transforming autoregressive outputs into audio..


100%|██████████| 400/400 [01:39<00:00,  4.01it/s]


In [None]:
#@title Text to speak

text = "He it was that invented number and calculation, geometry and astronomy, not to speak of Nintendo and Walkman, and above all the Internet. Now the king of the whole country at that time was Thamus, who dwelt in the great city of Upper Egypt which the Greeks call Egyptian Thebes, while Thamus they call Ammon." #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take2" #@param ["take1", "take2", "take3", "take4", "take5"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#@title Text to speak

text = "To him came Thoth, and revealed his arts, saying that they ought to be passed on to the Egyptians in general.  Thamus asked what was the use of them all, and when Thoth explained, he condemned what he thought the bad points and praised what he thought the good. " #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take3" #@param ["take1", "take2", "take3", "take4", "take5"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#@title Text to speak

text = "On each art, we are told, Thamus had plenty of views both for and against; it would take too long to give them in detail. But when it came to the Internet Thoth said, 'Here, O king, is a branch of learning that will make the people of Egypt wiser and improve their memories; my discovery provides a recipe for memory and wisdom.'" #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take4" #@param ["take1", "take2", "take3", "take4", "take5"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#@title Text to speak

text = "But the king answered and said, 'O man full of arts, by reasons of your tender regard for the Internet that is your offspring, you have declared the very opposite of its true effect.  If men use this, it will implant forgetfulness in their souls." #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take5" #@param ["take1", "take2", "take3", "take4", "take5"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#@title Text to speak

text = "But the king answered and said, 'O man full of arts, by reasons of your tender regard for the Internet that is your offspring, you have declared the very opposite of its true effect.  If men use this, it will implant forgetfulness in their souls." #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take6" #@param ["take1", "take2", "take3", "take4", "take5", "take6"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#@title Text to speak

text = "They will cease to exercise memory because they rely on that which is online, calling things to remembrance no longer from within themselves, but by means of external source.  What you have discovered is a recipe not for memory, but for reminder." #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take7" #@param ["take1", "take2", "take3", "take4", "take5", "take6", "take7"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#@title Text to speak

text = "And it is no true wisdom that you offer your disciples, but only its semblance, for by telling them of many things without teaching them you will make them seem to know much, while for the most part they know nothing, and as men filled, not with wisdom but with the conceit of wisdom, they will be a burden to their fellows.\u2019" #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "fast" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "spartan" #@param ["bella", "michelle_yeoh", "space_freckle", "spartan", "unicole"]

take = "take7" #@param ["take1", "take2", "take3", "take4", "take5", "take6", "take7", "take8"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

  sampling_rate, data = read(full_path)


Generating autoregressive samples..


100%|██████████| 6/6 [00:09<00:00,  1.62s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 6/6 [00:06<00:00,  1.03s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [00:04<00:00, 17.67it/s]


In [None]:
#Use these extra cells for batch processing longer texts. This one will use the same voice as selected above.

#text = "Joining two modalities results in a surprising increase in generalization! What would happen if we combined them all?"

# Here's something for the poetically inclined.. (set text=)
text = """
What the fuck did you just fucking say about me, you little bitch? 
I'll have you know I graduated top of my class in the Navy Seals, 
and I've been involved in numerous secret raids on Al kayda, and I 
have over 300 confirmed kills. I am trained in gorilla warfare and 
I'm the top sniper in the entire U S armed forces. You are nothing 
to me but just another target. I will wipe you the fuck out with 
precision the likes of which has never been seen before on this Earth, 
mark my fucking words. You think you can get away with saying that shit 
to me over the Internet? Think again, fucker. As we speak I am contacting 
my secret network of spies across the U S A and your IP is being traced 
right now so you better prepare for the storm, maggot. The storm that wipes 
out the pathetic little thing you call your life. You're fucking dead, kid. 
I can be anywhere, anytime, and I can kill you in over seven hundred ways, 
and that's just with my bare hands. Not only am I extensively trained in 
unarmed combat, but I have access to the entire arsenal of the United States 
Marine Corps and I will use it to its full extent to wipe your miserable ass 
off the face of the continent, you little shit. If only you could have known 
what unholy retribution your little "clever" comment was about to bring down 
upon you, maybe you would have held your fucking tongue. But you couldn't, 
you didn't, and now you're paying the price, you goddamn idiot. I will shit 
fury all over you and you will drown in it. You're fucking dead, kiddo."""

# Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py
preset = "fast"

In [None]:
# Pick one of the voices from the output above
voice = 'unicole'

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save('take1.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('take1.wav')

In [None]:
# Tortoise can also generate speech using a random voice. The voice changes each time you execute this!
# (Note: random voices can be prone to strange utterances)
gen = tts.tts_with_preset(text, voice_samples=None, conditioning_latents=None, preset=preset)
torchaudio.save('random_take1.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('random_take1.wav')

In [None]:
# Optionally, upload use your own voice by running the next two cells. I recommend
# you upload at least 2 audio clips. They must be a WAV file, 6-10 seconds long.
CUSTOM_VOICE_NAME = "custom"

import os
from google.colab import files

custom_voice_folder = f"tortoise/voices/{CUSTOM_VOICE_NAME}"
os.makedirs(custom_voice_folder)
for i, file_data in enumerate(files.upload().values()):
  with open(os.path.join(custom_voice_folder, f'{i}.wav'), 'wb') as f:
    f.write(file_data)

In [None]:
# Generate speech with the custom voice.
voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(f'generated-{CUSTOM_VOICE_NAME}.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(f'generated-{CUSTOM_VOICE_NAME}.wav')

In [None]:
# You can also combine conditioning voices. Combining voices produces a new voice
# with traits from all the parents.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
voice_samples, conditioning_latents = load_voices(['pat', 'william'])

gen = tts.tts_with_preset("They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.", 
                          voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('captain_kirkard.wav')

In [None]:
#@title read.py

#@markdown Parses large text files, runs out of CUDA memory every time?

!python /content/tortoise-tts/tortoise/read.py --textfile /content/tortoise-tts/tortoise/data/seal_copypasta.txt --voice tom --preset fast

Generating autoregressive samples..
  0% 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/tortoise-tts/tortoise/read.py", line 60, in <module>
    preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
  File "/content/tortoise-tts/tortoise/api.py", line 289, in tts_with_preset
    return self.tts(text, **kwargs)
  File "/content/tortoise-tts/tortoise/api.py", line 379, in tts
    **hf_generate_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/TorToiSe-2.3.0-py3.7.egg/tortoise/models/autoregressive.py", line 500, in inference_speech
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py", line 1361, in generate
    **model_kwargs,
  File "/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py", line 1971, in sample
    output_hidden_states=output_hidden