# Mozilla TTS on CPU Real-Time Speech Synthesis 

We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.

Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.

MultiBand-Melgan is trained  1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.

### Download Models

In [4]:

!gdown --id 1dntzjWFg7ufWaTaFy80nRz-Tu02xWZos -O tts_model.pth.tar
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

Downloading...
From: https://drive.google.com/uc?id=1dntzjWFg7ufWaTaFy80nRz-Tu02xWZos
To: /Users/r2q2/Projects/Saati/notebooks/tts_model.pth.tar
347MB [00:44, 7.80MB/s] 
Downloading...
From: https://drive.google.com/uc?id=18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc
To: /Users/r2q2/Projects/Saati/notebooks/config.json
100%|██████████████████████████████████████| 9.53k/9.53k [00:00<00:00, 12.5MB/s]


In [5]:
!gdown --id 1Ty5DZdOc0F7OTGj9oJThYbL5iVu_2G0K -O vocoder_model.pth.tar
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

Downloading...
From: https://drive.google.com/uc?id=1Ty5DZdOc0F7OTGj9oJThYbL5iVu_2G0K
To: /Users/r2q2/Projects/Saati/notebooks/vocoder_model.pth.tar
82.8MB [00:06, 12.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu
To: /Users/r2q2/Projects/Saati/notebooks/config_vocoder.json
100%|██████████████████████████████████████| 6.76k/6.76k [00:00<00:00, 9.11MB/s]
Downloading...
From: https://drive.google.com/uc?id=11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU
To: /Users/r2q2/Projects/Saati/notebooks/scale_stats.npy
100%|██████████████████████████████████████| 10.5k/10.5k [00:00<00:00, 13.7MB/s]


### Setup Libraries

In [None]:
#! sudo apt-get install espeak for linux
!brew install espeak

Reading package lists... Done
Building dependency tree       
Reading state information... Done
espeak is already the newest version (1.48.04+dfsg-5).
0 upgraded, 0 newly installed, 0 to remove and 21 not upgraded.


In [6]:
!git clone https://github.com/mozilla/TTS

Cloning into 'TTS'...
remote: Enumerating objects: 347, done.[K
remote: Counting objects: 100% (347/347), done.[K
remote: Compressing objects: 100% (157/157), done.[K
remote: Total 9552 (delta 233), reused 270 (delta 189), pack-reused 9205[K
Receiving objects: 100% (9552/9552), 119.90 MiB | 8.13 MiB/s, done.
Resolving deltas: 100% (6569/6569), done.


In [7]:
%cd TTS
!git checkout b1935c97
!pip install -r requirements.txt
!python setup.py install
%cd ..

/Users/r2q2/Projects/Saati/notebooks/TTS
Note: switching to 'b1935c97'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at b1935c9 update server to enable native vocoder inference and remove pwgan support
Collecting librosa>=0.5.1
  Downloading librosa-0.8.0.tar.gz (183 kB)
[K     |████████████████████████████████| 183 kB 2.9 MB/s eta 0:00:01
Collecting tensorboardX
  Using cached tensorboardX-2.1-py2.py3-none-any.whl (308 kB)
Collecting soundfile
  Using cached SoundFile-0.10.3.post1-py2.py3.cp26.cp27.cp32.cp33.

### Define TTS function

In [8]:
def tts(model, text, CONFIG, use_cuda, ap, use_gl, figures=True):
    t_1 = time.time()
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
    # mel_postnet_spec = ap._denormalize(mel_postnet_spec.T)
    if not use_gl:
        waveform = vocoder_model.inference(torch.FloatTensor(mel_postnet_spec.T).unsqueeze(0))
        waveform = waveform.flatten()
    if use_cuda:
        waveform = waveform.cpu()
    waveform = waveform.numpy()
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate']))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

In [10]:
!pip install inflect

Collecting inflect
  Using cached inflect-4.1.0-py3-none-any.whl (31 kB)
Installing collected packages: inflect
Successfully installed inflect-4.1.0


### Load Models

In [11]:
import os
import torch
import time
import IPython

from TTS.utils.generic_utils import setup_model
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.utils.synthesis import synthesis

In [12]:
# runtime settings
use_cuda = False

In [13]:
# model paths
TTS_MODEL = "tts_model.pth.tar"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.pth.tar"
VOCODER_CONFIG = "config_vocoder.json"

In [14]:
# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

In [15]:
# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio)         

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024


In [16]:
# LOAD TTS MODEL
# multi speaker 
speaker_id = None
speakers = []

# load the model
num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)

# load model state
cp =  torch.load(TTS_MODEL, map_location=torch.device('cpu'))

# load the model
model.load_state_dict(cp['model'])
if use_cuda:
    model.cuda()
model.eval()

# set model stepsize
if 'r' in cp:
    model.decoder.set_r(cp['r'])

 > Using model: Tacotron2


In [17]:
from TTS.vocoder.utils.generic_utils import setup_generator

# LOAD VOCODER MODEL
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location="cpu")["model"])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0

ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio'])    
if use_cuda:
    vocoder_model.cuda()
vocoder_model.eval()

 > Generator Model: multiband_melgan_generator
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024


MultibandMelganGenerator(
  (layers): Sequential(
    (0): ReflectionPad1d((3, 3))
    (1): Conv1d(80, 384, kernel_size=(7,), stride=(1,))
    (2): LeakyReLU(negative_slope=0.2)
    (3): ConvTranspose1d(384, 192, kernel_size=(16,), stride=(8,), padding=(4,))
    (4): ResidualStack(
      (blocks): ModuleList(
        (0): Sequential(
          (0): LeakyReLU(negative_slope=0.2)
          (1): ReflectionPad1d((1, 1))
          (2): Conv1d(192, 192, kernel_size=(3,), stride=(1,))
          (3): LeakyReLU(negative_slope=0.2)
          (4): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
        )
        (1): Sequential(
          (0): LeakyReLU(negative_slope=0.2)
          (1): ReflectionPad1d((3, 3))
          (2): Conv1d(192, 192, kernel_size=(3,), stride=(1,), dilation=(3,))
          (3): LeakyReLU(negative_slope=0.2)
          (4): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
        )
        (2): Sequential(
          (0): LeakyReLU(negative_slope=0.2)
          (1): Reflectio

## Run Inference

In [18]:
sentence =  "William got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

(188160,)
 > Run-time: 1.8140780925750732
 > Real-time factor: 0.21256444975733757
 > Time per step: 9.640136442216885e-06
