# How to use instructions

- Run the first 3 cells to initialize the notebook
- Duplicate or modify any of the next cells with the text you want your character to speak and the name of your character
- Run the cell, you will be asked to upload a wav file corresponding to the speech of your character
- If you had already uploaded a wav file and simply want to change the text, input the name of your character followed by the '.wav' extensions. Example: 'tony_stark.wav'
- For better results, make sure the speech is at least 10 seconds long, there is no noise background, and that the sound is clear
- Inference should take < 3s with CPU and < 0.5s with GPU

## Custom wav file requirements

- Avoid long pauses
- Avoid stuttering or stammering
- Cloned voices should have an American English accent
- Wav files should have a 16000 sample rate
- This is a good converter: https://audio.online-convert.com/convert-to-wav

In [1]:
#!sudo apt-get install espeak
#!pip install py-espeak-ng
!sudo apt-get install espeak-ng
!git clone https://github.com/coqui-ai/TTS
%cd TTS
!pip install -e .[all,dev,notebooks]  # Select the relevant extras

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0
The following NEW packages will be installed:
  espeak-ng espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 24 not upgraded.
Need to get 4,215 kB of archives.
After this operation, 12.0 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libpcaudio0 amd64 1.1-4 [7,908 B]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 libsonic0 amd64 0.2.0-8 [13.1 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/main amd64 espeak-ng-data amd64 1.50+dfsg-6 [3,682 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal/main amd64 libespeak-ng1 amd64 1.50+dfsg-6 [189 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal/universe amd64 espeak-ng amd64 1.50+dfsg-6 [322 kB]
Fetched 4,215 kB in 0s (8,973 kB/s)
debconf: una

In [2]:
import torch

tts = None
use_gpu = torch.cuda.is_available()
use_gpu

True

In [16]:
import os
from google.colab import files
import IPython
from TTS.api import TTS
import librosa
import soundfile as sf
import shutil

if tts is None:
  tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=use_gpu)

def preprocess_audio(wav_file, freq=16000):
  new_file = 'f{}_{}'.format(freq, wav_file)
  shutil.copyfile(wav_file, new_file)
  y, sr = librosa.load(new_file, sr=freq)
  sf.write(new_file, y, sr)
  return new_file

def clone_voice(tts, text: str, voice: str = None, resample: bool = False, freq: int = 22050, language: str = 'en'):
  if voice.endswith('.wav'):
    voice = voice.split('.wav')[0]
  if not os.path.exists(f'{voice}.wav'):
    # select only one file
    for i, file_data in enumerate(files.upload().values()):
      with open(f'{voice}.wav', 'wb') as f:
        f.write(file_data)

  wav_file = f'{voice}.wav'
  wav_file = preprocess_audio(wav_file) 

  tts.tts_to_file(text, speaker_wav=wav_file, language=language, file_path=f'output_{voice}.wav')
  return f'output_{voice}.wav'

In [20]:
# Tony Stark - 16k

text = 'We must destroy all of them and revenge our planet'
voice = 'weapon.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.12303805351257324
 > Real-time factor: 0.03137925363748361


In [18]:
# Tony Stark - 22k

text = 'We must destroy all of them and revenge our planet'
voice = 'weapon.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=22000)
IPython.display.Audio(voice)

 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.1289057731628418
 > Real-time factor: 0.03260960616312719


In [21]:
# Tony Stark - 44k

text = 'We must destroy all of them and revenge our planet'
voice = 'weapon.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=44000)
IPython.display.Audio(voice)

 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.16820836067199707
 > Real-time factor: 0.03794458846650058


In [22]:
# Tony Stark - original ~ 22k

text = 'We must destroy all of them and revenge our planet'
voice = 'weapon.wav'
voice = clone_voice(tts, text, voice, resample=False)
IPython.display.Audio(voice)

 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.12191224098205566
 > Real-time factor: 0.03010922227267366


In [23]:
# Darth Vader 1

text = 'We must destroy all of them and revenge our planet'
voice = 'darkside.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

Saving darkside.wav to darkside.wav
 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.12881779670715332
 > Real-time factor: 0.028342749550528783


In [26]:
# Darth Vader 2

text = 'We must destroy all of them and revenge our planet'
voice = 'darth_vader_1.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.11947751045227051
 > Real-time factor: 0.027856728946670673


In [27]:
# Loky

text = 'We must destroy all of them and revenge our planet'
voice = 'kneel.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

Saving kneel.wav to kneel.wav
 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.14972448348999023
 > Real-time factor: 0.023274441705268184


In [28]:
# Naruto
text = 'We must destroy all of them and revenge our planet'
voice = 'naruto.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

Saving naruto.wav to naruto.wav
 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.12209725379943848
 > Real-time factor: 0.03598504385483009


In [29]:
# Joker

text = 'We must destroy all of them and revenge our planet'
voice = 'joker.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

Saving joker.wav to joker.wav
 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.12605905532836914
 > Real-time factor: 0.028539518978575762


In [30]:
# Lara Croft

text = 'We must destroy all of them and revenge our planet'
voice = 'lara_croft.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

Saving lara_croft.wav to lara_croft.wav
 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.1508934497833252
 > Real-time factor: 0.04859692424583742


In [31]:
# Trump

text = 'We must destroy all of them and revenge our planet'
voice = 'trump.wav'
voice = clone_voice(tts, text, voice, resample=True, freq=16000)
IPython.display.Audio(voice)

Saving trump.wav to trump.wav
 > Text splitted to sentences.
['We must destroy all of them and revenge our planet']
 > Processing time: 0.12595510482788086
 > Real-time factor: 0.03334792290915564
