# Mozilla TTS on CPU Real-Time Speech Synthesis 

We use Tacotron2 and MultiBand-Melgan models and Baker dataset (chinese mandarin).

Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 126K steps (3 days) with a single GPU.

MultiBand-Melgan is trained 1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.

### Download Models

### Define TTS function

### Load Models

In [3]:
import os
import torch
import IPython

from TTS.utils.synthesizer import Synthesizer
from TTS.utils.manage import ModelManager


In [4]:
# runtime settings
use_cuda = False

In [6]:
# tts and vocoder name
TTS_NAME = "tts_models/zh-CN/baker/tacotron2-DDC-GST"
VOCODER_NAME = "vocoder_models/en/ljspeech/multiband-melgan"


In [7]:
manager = ModelManager("../TTS/.models.json")

In [9]:
tts_checkpoint_file, tts_config_file, tts_json_dict = manager.download_model(TTS_NAME)
vocoder_checkpoint_file, vocoder_config_file, vocoder_json_dict = manager.download_model(VOCODER_NAME)

 > tts_models/zh-CN/baker/tacotron2-DDC-GST is already downloaded.
 > vocoder_models/en/ljspeech/multiband-melgan is already downloaded.


In [11]:
synthesizer = Synthesizer(tts_checkpoint_file, tts_config_file, vocoder_checkpoint_file, vocoder_config_file, use_cuda)
sample_rate = synthesizer.tts_config.audio["sample_rate"]

 > Using model: tacotron2
 > Generator Model: multiband_melgan_generator


## Run Inference

In [14]:
# Here some test sentences for you to play with :
sentences= ["我从来不会说很标准的中文。",
"我喜欢听人工智能的博客。",
"我来自一个法国郊区的地方。",
"不比不知道，一比吓一跳！",
"台湾是一个真的很好玩的地方！",
"干一行，行一行，行行都行。",
"我要盖被子，好尴尬！",]

In [16]:
for sentence in sentences:
    wav = synthesizer.tts(sentence)
    IPython.display.display(IPython.display.Audio(wav, rate=sample_rate))  
    

 > Text splitted to sentences.
['我从来不会说很标准的中文。']
 > Processing time: 1.6665124893188477
 > Real-time factor: 0.5583910829911347


 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 1.4052538871765137
 > Real-time factor: 0.5193391025114328


 > Text splitted to sentences.
['我来自一个法国郊区的地方。']
 > Processing time: 1.605910062789917
 > Real-time factor: 0.5785999490934259


 > Text splitted to sentences.
['不比不知道，一比吓一跳！']
 > Processing time: 1.9105627536773682
 > Real-time factor: 0.6607262973429417


 > Text splitted to sentences.
['台湾是一个真的很好玩的地方！']
 > Processing time: 1.3081049919128418
 > Real-time factor: 0.4218891158389621


 > Text splitted to sentences.
['干一行，行一行，行行都行。']
 > Processing time: 2.0958540439605713
 > Real-time factor: 0.6709288860239634


 > Text splitted to sentences.
['我要盖被子，好尴尬！']
 > Processing time: 1.5188167095184326
 > Real-time factor: 0.6257456734843319


In [17]:
# you can also play with Global Style Token (GST) by feeding a 
# ... wav_style parameter to the tts method

style_wav = {"2": 0.2}

wav = synthesizer.tts(sentences[1], style_wav=style_wav)
IPython.display.display(IPython.display.Audio(wav, rate=sample_rate))  

 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 2.114016056060791
 > Real-time factor: 0.643271887228699


In [18]:
# On this model specifically, we can observe that the GSToken "2" is responsible for speech speed
# You can listen to these 5 different samples, the flow is slower and slower as the value is higher
for value in [-0.2, -0.1, 0, 0.1, 0.2]:
    style_wav = {"2": value}
    wav = synthesizer.tts(sentences[1], style_wav=style_wav)
    IPython.display.display(IPython.display.Audio(wav, rate=sample_rate))  

 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 1.5687272548675537
 > Real-time factor: 0.6401842606201799


 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 2.070594072341919
 > Real-time factor: 0.8067677285683367


 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 1.3769311904907227
 > Real-time factor: 0.5088718951180015


 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 2.024374485015869
 > Real-time factor: 0.6782983435843654


 > Text splitted to sentences.
['我喜欢听人工智能的博客。']
 > Processing time: 2.4434399604797363
 > Real-time factor: 0.7435119663360867
