<center> 
    <h1> Transformer TTS: A Text-to-Speech Transformer in TensorFlow 2 </h1>
    <h2> Audio synthesis with Autoregressive Transformer TTS and MelGAN Vocoder</h2>
</center>

## Autoregressive Model

In [0]:
# Clone the Transformer TTS and MelGAN repos
!git clone https://github.com/as-ideas/TransformerTTS.git
!git clone https://github.com/seungwonpark/melgan.git

Cloning into 'TransformerTTS'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 2474 (delta 33), reused 44 (delta 20), pack-reused 2402[K
Receiving objects: 100% (2474/2474), 4.24 MiB | 18.73 MiB/s, done.
Resolving deltas: 100% (1670/1670), done.
Cloning into 'melgan'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 396 (delta 6), reused 28 (delta 2), pack-reused 356[K
Receiving objects: 100% (396/396), 18.04 MiB | 26.73 MiB/s, done.
Resolving deltas: 100% (184/184), done.


In [0]:
# Install requirements
!apt-get install -y espeak
!pip install -r TransformerTTS/requirements.txt

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 32 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak amd64 1.48.04+dfsg-5 [61.6 kB]
Fetched 1,219 kB in 1s (926 kB/s)
Sel

In [0]:
# Download the pre-trained weights
! wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_melgan_autoregressive_transformer.zip
! unzip ljspeech_melgan_autoregressive_transformer.zip

--2020-06-04 12:55:50--  https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_melgan_autoregressive_transformer.zip
Resolving public-asai-dl-models.s3.eu-central-1.amazonaws.com (public-asai-dl-models.s3.eu-central-1.amazonaws.com)... 52.219.47.96
Connecting to public-asai-dl-models.s3.eu-central-1.amazonaws.com (public-asai-dl-models.s3.eu-central-1.amazonaws.com)|52.219.47.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177804789 (170M) [application/zip]
Saving to: ‘ljspeech_melgan_autoregressive_transformer.zip’


2020-06-04 12:55:58 (24.3 MB/s) - ‘ljspeech_melgan_autoregressive_transformer.zip’ saved [177804789/177804789]

Archive:  ljspeech_melgan_autoregressive_transformer.zip
   creating: ljspeech_melgan_autoregressive_transformer/
  inflating: ljspeech_melgan_autoregressive_transformer/.DS_Store  
  inflating: __MACOSX/ljspeech_melgan_autoregressive_transformer/._.DS_Store  
   creating: ljspeech_melgan_autoregressive_

In [0]:
# Set up the paths
from pathlib import Path
MelGAN_path = 'melgan/'
TTS_path = 'TransformerTTS/'
config_path = Path('ljspeech_melgan_autoregressive_transformer/melgan')

import sys
sys.path.append(TTS_path)

In [0]:
# Load pretrained models
from utils.config_manager import ConfigManager
from utils.audio import Audio

import IPython.display as ipd

config_loader = ConfigManager(str(config_path), model_kind='autoregressive')
audio = Audio(config_loader.config)
model = config_loader.load_model(str(config_path / 'autoregressive_weights/ckpt-90'))

restored weights from ljspeech_melgan_autoregressive_transformer/melgan/autoregressive_weights/ckpt-90 at step 900000


In [0]:
# Synthesize text
sentence = 'Scientists at the CERN laboratory, say that they have discovered a new particle.'
out = model.predict(sentence)

pred text mel: 394 stop out: -1.5218873023986816Stopping


In [0]:
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config_loader.config['sampling_rate']))

### MelGAN

In [0]:
# Do some sys cleaning
sys.path.remove(TTS_path)
sys.modules.pop('model')

<module 'model' from 'TransformerTTS/model/__init__.py'>

In [0]:
sys.path.append(MelGAN_path)
import torch
import numpy as np

vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')
vocoder.eval()

mel = torch.tensor(out['mel'].numpy().T[np.newaxis,:,:])

Downloading: "https://github.com/seungwonpark/melgan/archive/master.zip" to /root/.cache/torch/hub/master.zip
Downloading: "https://github.com/seungwonpark/melgan/releases/download/v0.3-alpha/nvidia_tacotron2_LJ11_epoch6400.pt" to /root/.cache/torch/hub/checkpoints/nvidia_tacotron2_LJ11_epoch6400.pt


HBox(children=(FloatProgress(value=0.0, max=17090302.0), HTML(value='')))




In [0]:
if torch.cuda.is_available():
    vocoder = vocoder.cuda()
    mel = mel.cuda()

with torch.no_grad():
    audio = vocoder.inference(mel)

In [0]:
# Display audio
ipd.display(ipd.Audio(audio.cpu().numpy(), rate=22050))