# Mozilla TTS on CPU Real-Time Speech Synthesis 

We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.

Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.

MultiBand-Melgan is trained  1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.

### Download Models

In [None]:
!gdown --id 1X09hHAyAJOnrplCUMAdW_t341Kor4YR4 -O data/vocoder_model.pth.tar
!gdown --id "1qN7vQRIYkzvOX_DtiZtTajzoZ1eW1-Eg" -O data/config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O data/scale_stats.npy

### Load Models

In [None]:
import os
import torch
import time
import IPython

from mozilla_voice_tts.tts.utils.generic_utils import setup_model
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.tts.utils.text.symbols import symbols, phonemes
from mozilla_voice_tts.utils.audio import AudioProcessor
from mozilla_voice_tts.tts.utils.synthesis import synthesis

In [None]:
# runtime settings
use_cuda = True

In [None]:
# model paths
VOCODER_MODEL = "data/vocoder_model.pth.tar"
VOCODER_CONFIG = "data/config_vocoder.json"
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

In [None]:
# load the audio processor
VOCODER_CONFIG.audio['stats_path'] = 'data/scale_stats.npy'
ap = AudioProcessor(**VOCODER_CONFIG.audio)

In [None]:
from mozilla_voice_tts.vocoder.utils.generic_utils import setup_generator

# LOAD VOCODER MODEL
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location="cpu")["model"])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0

ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio'])    
if use_cuda:
    vocoder_model.cuda()
vocoder_model.eval()

In [None]:
def vocode(mel):
    waveform = vocoder_model.inference(torch.FloatTensor(mel.T).unsqueeze(0))
    waveform = waveform.flatten()
    waveform = waveform.cpu()
    waveform = waveform.numpy()
    IPython.display.display(IPython.display.Audio(waveform, rate=VOCODER_CONFIG.audio['sample_rate']))