Welcome to Tortoise! 🐢🐢🐢🐢

In case of bugs, compare against original notebook [here](https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing) and Github repository [here](https://github.com/neonbjb/tortoise-tts). Quality of Life improvements and additional voices added by [Downy](http://www.twitter.com/tooltrackers).

When making a new voice, film voiceover, especially documentary narration, tends to work best. Audiobooks work well, but lack variety. Or, you could use any YouTube video with clear audio; just [download as mp3](https://x2download.com/en192/download-youtube-to-mp3). If you can't get a voice to work, try [Demucs](https://colab.research.google.com/drive/1qlpoIAb-nD-L29kFP976syIN4e6QiP4i?usp=sharing) and [VoiceFixer](https://colab.research.google.com/drive/1rypU23DARH3VsoJTKgDlviPDClOsXtXa?usp=sharing). Tortoise appears to work best with Standard American English "news anchor" voices, as it has trouble with cartoonish or noise-heavy (such as gravelly) ones.


# Diagnostics

In [1]:
#@title Check GPU
#@markdown - Tier List: (K80 < T4 < P100 < V100 < A100)
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-1ddd78a8-d579-c2b6-391a-0b47ca2e3c6d)


In [None]:
# @title Check RAM

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


In [None]:
#@title Check memory footprint
#If util is not 0%, restart

skip_footprint = True #@param{type:'boolean'}
if(not skip_footprint):

  !ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
  !pip install gputil
  !pip install psutil
  !pip install humanize
  import psutil
  import humanize
  import os
  import GPUtil as GPU

  GPUs = GPU.getGPUs()

  # XXX: only one GPU on Colab and isn’t guaranteed
  gpu = GPUs[0]

  def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))

  printm()

# Setup

In [2]:
#@title Install libraries

# the scipy version packaged with colab is not tolerant of misformated WAV files.
# install the latest version.

!pip3 install -U scipy

!git clone https://github.com/jnordberg/tortoise-tts.git
%cd tortoise-tts
!pip3 install -r requirements.txt
!python3 setup.py install

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Cloning into 'tortoise-tts'...
remote: Enumerating objects: 1481, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 1481 (delta 163), reused 143 (delta 143), pack-reused 1297[K
Receiving objects: 100% (1481/1481), 53.55 MiB | 19.67 MiB/s, done.
Resolving deltas: 100% (608/608), done.
/content/tortoise-tts
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rotary_embedding_torch
  Downloading rotary_embedding_torch-0.1.5-py3-none-any.whl (4.1 kB)
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 6.8 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6

In [3]:
#@title Mount Google Drive

#@markdown This will also transfer saved voices and three large files.

from google.colab import drive
drive.mount('/content/drive')

!gdown https://drive.google.com/uc?id=1SxZ3Qz9xIgCBxY7gxypg9o8E6sOORK49 #autoregressive.pth
!gdown https://drive.google.com/uc?id=1Q-uShpp_81PNV1o8LZ2bKDhJ4szGmaaa #clvp2.pth
!gdown https://drive.google.com/uc?id=1SxQNjL3VS5E1b5SMAKP69qLOEpsX7hRV #diffusion_decoder.pth

Mounted at /content/drive
Downloading...
From: https://drive.google.com/uc?id=1SxZ3Qz9xIgCBxY7gxypg9o8E6sOORK49
To: /content/tortoise-tts/autoregressive.pth
100% 1.72G/1.72G [00:19<00:00, 86.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Q-uShpp_81PNV1o8LZ2bKDhJ4szGmaaa
To: /content/tortoise-tts/clvp2.pth
100% 976M/976M [00:09<00:00, 102MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1SxQNjL3VS5E1b5SMAKP69qLOEpsX7hRV
To: /content/tortoise-tts/diffusion_decoder.pth
100% 1.17G/1.17G [00:06<00:00, 176MB/s]


In [4]:
#@title Import functions

import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

import IPython

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

# This will download all the models used by Tortoise from the HuggingFace hub.
tts = TextToSpeech()

!cp -r "/content/tortoise-tts/autoregressive.pth" "/content/tortoise-tts/build/lib/tortoise/models"
!cp -r "/content/tortoise-tts/clvp2.pth" "/content/tortoise-tts/build/lib/tortoise/models"
!cp -r "/content/tortoise-tts/diffusion_decoder.pth" "/content/tortoise-tts/build/lib/tortoise/models"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/181 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading autoregressive.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth...





Done.
Downloading classifier.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/classifier.pth...





Done.
Downloading clvp2.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth...





Done.
Downloading cvvp.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth...





Done.
Downloading diffusion_decoder.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth...





Done.
Downloading vocoder.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth...





Done.
Downloading rlg_auto.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth...





Done.
Downloading rlg_diffuser.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth...





Done.


In [5]:
#@title Download and upload voices

upload_voice = True #@param{type:'boolean'}

#Download voices from Google Drive
!gdown https://drive.google.com/uc?id=1T9AOI4lTjF3gGZr2gxU66Qj3ygfvK6jx #voices.zip
!unzip /content/tortoise-tts/voices.zip -d /content/tortoise-tts/tortoise/voices


#Upload a new voice
if (upload_voice):

  from google.colab import files

  %cd /content/tortoise-tts/tortoise/voices

  new_voice_name = "unicole-youtube" #@param {type: 'string'}

  !mkdir $new_voice_name

  %cd $new_voice_name

  uploaded = files.upload()

  for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))
    
    %cd /content/tortoise-tts

Downloading...
From: https://drive.google.com/uc?id=1T9AOI4lTjF3gGZr2gxU66Qj3ygfvK6jx
To: /content/tortoise-tts/voices.zip
100% 19.2M/19.2M [00:00<00:00, 73.0MB/s]
Archive:  /content/tortoise-tts/voices.zip
  inflating: /content/tortoise-tts/tortoise/voices/bella/1.wav  
  inflating: /content/tortoise-tts/tortoise/voices/bella/2.wav  
  inflating: /content/tortoise-tts/tortoise/voices/bella/3.wav  
   creating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/1.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/2.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/3.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/4.wav  
  inflating: /content/tortoise-tts/tortoise/voices/michelle_yeoh/5.wav  
   creating: /content/tortoise-tts/tortoise/voices/space_freckle/
  inflating: /content/tortoise-tts/tortoise/voices/space_freckle/1.wav  
  inflating: /content/tortoi

Saving 1.wav to 1.wav
Saving 2.wav to 2.wav
Saving 3.wav to 3.wav
Saving 4.wav to 4.wav
Saving 5.wav to 5.wav
User uploaded file "1.wav" with length 1467712 bytes
/content/tortoise-tts
User uploaded file "2.wav" with length 1107632 bytes
/content/tortoise-tts
User uploaded file "3.wav" with length 1154680 bytes
/content/tortoise-tts
User uploaded file "4.wav" with length 1797036 bytes
/content/tortoise-tts
User uploaded file "5.wav" with length 1056968 bytes
/content/tortoise-tts


# Execute

In [6]:
#@title Preview voices

#@markdown Tortoise will attempt to mimic voices you provide. It comes pre-packaged with some voices you might recognize. Let's list all the voices available. These are just some random clips I've gathered from the internet as well as a few voices from the training dataset.  Feel free to add your own clips to the voices/ folder.
preview_voice = "unicole-youtube" #@param ["bella", "michelle_yeoh", "space_freckle", "unicole-audiobook", "unicole-youtube"] {allow-input: true}

%ls tortoise/voices

IPython.display.Audio('tortoise/voices/' + preview_voice + '/1.wav')

[0m[01;34mangie[0m/      [01;34mgeralt[0m/         [01;34mpat[0m/            [01;34mtrain_atkins[0m/   [01;34mtrain_lescault[0m/
[01;34mapplejack[0m/  [01;34mhalle[0m/          [01;34mpat2[0m/           [01;34mtrain_daws[0m/     [01;34mtrain_mouse[0m/
[01;34mbella[0m/      [01;34mjlaw[0m/           [01;34mrainbow[0m/        [01;34mtrain_dotrice[0m/  [01;34municole[0m/
[01;34mdaniel[0m/     [01;34mlj[0m/             [01;34msnakes[0m/         [01;34mtrain_dreams[0m/   [01;34municole-youtube[0m/
[01;34mdeniro[0m/     [01;34mmichelle_yeoh[0m/  [01;34mspace_freckle[0m/  [01;34mtrain_empire[0m/   [01;34mweaver[0m/
[01;34memma[0m/       [01;34mmol[0m/            [01;34mtim_reynolds[0m/   [01;34mtrain_grace[0m/    [01;34mwilliam[0m/
[01;34mfreeman[0m/    [01;34mmyself[0m/         [01;34mtom[0m/            [01;34mtrain_kennard[0m/


In [10]:
#@title Text to speak

text = "The ability to be a tourist at all times is probably as useful as being a child on demand. If you can see things fresh, as if for the first time, you can adapt and create whenever you want. This is the secret of the internal alchemists who forged the philosopher's stone within." #@param {type:"string"}

# Enter long text strings between triple-quotes here.
#text = """
#Space-Heaven 
#"""

preset = "standard" #@param ["ultra_fast", "fast", "standard", "high_quality"]

# Pick one of the voices from the output above
voice = "unicole-youtube" #@param ["bella", "michelle_yeoh", "space_freckle", "unicole-audiobook", "unicole-youtube"]

take = "take1" #@param ["take1", "take2", "take3", "take4", "take5"]

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(take + '.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(take + '.wav')

Generating autoregressive samples..


100%|██████████| 16/16 [05:07<00:00, 19.24s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 16/16 [00:37<00:00,  2.37s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 400/400 [06:36<00:00,  1.01it/s]


In [None]:
#Use these extra cells for batch processing longer texts. This one will use the same voice as selected above.

#text = "Joining two modalities results in a surprising increase in generalization! What would happen if we combined them all?"

# Here's something for the poetically inclined.. (set text=)
text = """
If only you could have known 
what unholy retribution your little "clever" comment was about to bring down 
upon you, maybe you would have held your fucking tongue. But you couldn't, 
you didn't, and now you're paying the price, you goddamn idiot. I will shit 
fury all over you and you will drown in it. You're fucking dead, kiddo."""

# Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py
preset = "fast"

In [None]:
# Pick one of the voices from the output above
voice = 'unicole'

# Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save('take1.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('take1.wav')

In [None]:
# Tortoise can also generate speech using a random voice. The voice changes each time you execute this!
# (Note: random voices can be prone to strange utterances)
gen = tts.tts_with_preset(text, voice_samples=None, conditioning_latents=None, preset=preset)
torchaudio.save('random_take1.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('random_take1.wav')

In [None]:
# Optionally, upload use your own voice by running the next two cells. I recommend
# you upload at least 2 audio clips. They must be a WAV file, 6-10 seconds long.
CUSTOM_VOICE_NAME = "custom"

import os
from google.colab import files

custom_voice_folder = f"tortoise/voices/{CUSTOM_VOICE_NAME}"
os.makedirs(custom_voice_folder)
for i, file_data in enumerate(files.upload().values()):
  with open(os.path.join(custom_voice_folder, f'{i}.wav'), 'wb') as f:
    f.write(file_data)

In [None]:
# Generate speech with the custom voice.
voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(f'generated-{CUSTOM_VOICE_NAME}.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(f'generated-{CUSTOM_VOICE_NAME}.wav')

In [11]:
# You can also combine conditioning voices. Combining voices produces a new voice
# with traits from all the parents.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
voice_samples, conditioning_latents = load_voices(['unicole', 'unicole-youtube'])

gen = tts.tts_with_preset("They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.", 
                          voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save('take5.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('take5.wav')

Generating autoregressive samples..


100%|██████████| 16/16 [01:39<00:00,  6.19s/it]


Computing best candidates using CLVP and CVVP


100%|██████████| 16/16 [00:44<00:00,  2.80s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 400/400 [01:09<00:00,  5.79it/s]


In [None]:
#@title read.py

#@markdown Parses large text files, runs out of CUDA memory every time?

!python /content/tortoise-tts/tortoise/read.py --textfile /content/tortoise-tts/tortoise/data/seal_copypasta.txt --voice tom --preset fast

Generating autoregressive samples..
  0% 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/tortoise-tts/tortoise/read.py", line 60, in <module>
    preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
  File "/content/tortoise-tts/tortoise/api.py", line 289, in tts_with_preset
    return self.tts(text, **kwargs)
  File "/content/tortoise-tts/tortoise/api.py", line 379, in tts
    **hf_generate_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/TorToiSe-2.3.0-py3.7.egg/tortoise/models/autoregressive.py", line 500, in inference_speech
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py", line 1361, in generate
    **model_kwargs,
  File "/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py", line 1971, in sample
    output_hidden_states=output_hidden