<a href="https://colab.research.google.com/github/bfidelin/machine-learning-roadmap/blob/master/DDC_TTS_Universal_Fullband_MelGAN_MAI_karen_savage_ES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mozilla TTS on CPU Real-Time Speech Synthesis 

We use Tacotron2 and Universal-FullBand-Melgan models and a subset of [MAI_labs](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (es_ES/by_book/female/karen_savage/angelina) dataset

Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 80K steps with a single GPU, finetuning a LJSpeech model trained for 300K steps. 

Universal-Fullband-Melgan is trained for 1.575K steps with real spectrograms using LibriTTS dataset.

Note that the dataset sampling rate is 16khz and the vocoder is trained with
24khz. Therefore we interpolate spectrograms to match their sampling rates. Training a specific vocoder for this model can improve the performance.

### Download Models

In [1]:
!gdown --id 1jZ4HvYcAXI5ZClke2iGA7qFQQJBXIovw -O tts_model.pth.tar 
!gdown --id 1s7g4n-B73ChCB48AQ88_DV_8oyLth8r0 -O config.json
!gdown --id 13st0CZ743v6Br5R5Qw_lH1OPQOr3M-Jv -O scale_stats.npy

Downloading...
From: https://drive.google.com/uc?id=1jZ4HvYcAXI5ZClke2iGA7qFQQJBXIovw
To: /content/tts_model.pth.tar
575MB [00:08, 66.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1s7g4n-B73ChCB48AQ88_DV_8oyLth8r0
To: /content/config.json
100% 11.4k/11.4k [00:00<00:00, 10.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=13st0CZ743v6Br5R5Qw_lH1OPQOr3M-Jv
To: /content/scale_stats.npy
100% 10.5k/10.5k [00:00<00:00, 9.10MB/s]


In [2]:
!gdown --id 14MlX3EY5KuXA5GVglyd2i1EXu8ApHNsS -O vocoder_model.pth.tar
!gdown --id 1uBRVNxsoCYJxNCqPoQedASm6EtSW3w04 -O config_vocoder.json
!gdown --id 1O8ziB27XqzIpkb-6_QI0fpDouF4-v7_1 -O scale_stats_vocoder.npy

Downloading...
From: https://drive.google.com/uc?id=14MlX3EY5KuXA5GVglyd2i1EXu8ApHNsS
To: /content/vocoder_model.pth.tar
109MB [00:01, 63.2MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1uBRVNxsoCYJxNCqPoQedASm6EtSW3w04
To: /content/config_vocoder.json
100% 7.51k/7.51k [00:00<00:00, 14.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1O8ziB27XqzIpkb-6_QI0fpDouF4-v7_1
To: /content/scale_stats_vocoder.npy
100% 10.5k/10.5k [00:00<00:00, 9.84MB/s]


### Setup Libraries

In [3]:
!sudo apt-get install espeak

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 15 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak amd64 1.48.04+dfsg-5 [61.6 kB]
Fetched 1,219 kB in 1s (818 kB/s)
deb

In [4]:
!git clone https://github.com/mozilla/TTS TTS_repo
%cd TTS_repo
!git checkout 4873601
!pip install -r requirements.txt
!python setup.py develop
%cd ..

Cloning into 'TTS_repo'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 11523 (delta 2), reused 3 (delta 1), pack-reused 11511[K
Receiving objects: 100% (11523/11523), 122.57 MiB | 31.04 MiB/s, done.
Resolving deltas: 100% (8083/8083), done.
/content/TTS_repo
Note: checking out '4873601'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 4873601 Merge pull request #531 from WeberJulian/french-cleaners
Collecting tensorflow==2.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/97/ae/0b08f53498417914f2

### Define TTS function

In [5]:
def interpolate_vocoder_input(scale_factor, spec):
    """Interpolation to tolarate the sampling rate difference
    btw tts model and vocoder"""
    print(" > before interpolation :", spec.shape)
    spec = torch.tensor(spec).unsqueeze(0).unsqueeze(0)
    spec = torch.nn.functional.interpolate(spec, scale_factor=scale_factor, mode='bilinear').squeeze(0)
    print(" > after interpolation :", spec.shape)
    return spec


def tts(model, text, CONFIG, use_cuda, ap, use_gl, figures=True):
    t_1 = time.time()
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
    print(mel_postnet_spec.shape)
    mel_postnet_spec = ap._denormalize(mel_postnet_spec.T).T
    if not use_gl:
        target_sr = VOCODER_CONFIG.audio['sample_rate']
        vocoder_input = ap_vocoder._normalize(mel_postnet_spec.T)
        if scale_factor[1] != 1:
            vocoder_input = interpolate_vocoder_input(scale_factor, vocoder_input)
        else:
            vocoder_input = torch.tensor(vocoder_input).unsqueeze(0)
        waveform = vocoder_model.inference(vocoder_input)
    if use_cuda and not use_gl:
        waveform = waveform.cpu()
    if not use_gl:
        waveform = waveform.numpy()
    waveform = waveform.squeeze()
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    IPython.display.display(IPython.display.Audio(waveform, rate=VOCODER_CONFIG.audio['sample_rate']))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

### Load Models

In [6]:
%load_ext autoreload
%autoreload 2

import os
import sys
import torch
import time
import IPython

# for some reason TTS installation does not work on Colab
sys.path.append('TTS_repo')

from TTS.tts.utils.generic_utils import setup_model
from TTS.utils.io import load_config
from TTS.tts.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.synthesis import synthesis

FileNotFoundError: ignored

In [None]:
# runtime settings
use_cuda = False

In [None]:
# model paths
TTS_MODEL = "tts_model.pth.tar"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.pth.tar"
VOCODER_CONFIG = "config_vocoder.json"

In [None]:
# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)
VOCODER_CONFIG.audio['stats_path'] = 'scale_stats_vocoder.npy'

In [None]:
# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio)    
ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio'])   

# scale factor for sampling rate difference
scale_factor = [1,  VOCODER_CONFIG['audio']['sample_rate'] / ap.sample_rate]
print(f"scale_factor: {scale_factor}")

 > Setting up Audio Processor...
 | > sample_rate:16000
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024
 > Setting up Audio Processor...
 | > sample_rate:24000
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_sil

In [None]:
# LOAD TTS MODEL
# multi speaker 
speaker_id = None
speakers = []

# load the model
num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)

# load model state
cp =  torch.load(TTS_MODEL, map_location=torch.device('cpu'))

# load the model
model.load_state_dict(cp['model'])
if use_cuda:
    model.cuda()
model.eval()

# set model stepsize
if 'r' in cp:
    model.decoder.set_r(cp['r'])

 > Using model: Tacotron2


In [None]:
from TTS.vocoder.utils.generic_utils import setup_generator

# LOAD VOCODER MODEL
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location="cpu")["model"])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0
 
if use_cuda:
    vocoder_model.cuda()
vocoder_model.eval();

 > Generator Model: fullband_melgan_generator


## Run Inference

In [None]:
sentence =  "En sendas sentencias de 2013, esa Sala del Tribunal Supremo anuló dos indultos porque no estaban suficientemente motivados."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

(588, 80)
 > before interpolation : (80, 588)
 > after interpolation : torch.Size([1, 80, 882])


  "See the documentation of nn.Upsample for details.".format(mode))


(225792,)
 > Run-time: 8.28113842010498
 > Real-time factor: 0.5868061903923277
 > Time per step: 3.6675418577159076e-05
