<center> 
    <h1> Transformer TTS: A Text-to-Speech Transformer in TensorFlow 2 </h1>
    <h2> Audio synthesis with Forward Transformer TTS and HiFiGAN Vocoder</h2>
</center>

## Forward Model

In [1]:
# Clone the Transformer TTS and MelGAN repos
!git clone https://github.com/as-ideas/TransformerTTS.git

Cloning into 'TransformerTTS'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 3745 (delta 13), reused 14 (delta 6), pack-reused 3718[K
Receiving objects: 100% (3745/3745), 10.83 MiB | 24.86 MiB/s, done.
Resolving deltas: 100% (2562/2562), done.


In [2]:
# Install requirements
!apt-get install -y espeak
!pip install -r TransformerTTS/requirements.txt

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 29 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak amd64 1.48.04+dfsg-5 [61.6 kB]
Fetched 1,219 kB in 1s (1,106 kB/s)
S

In [3]:
!cd TransformerTTS/; git checkout 4eea35f4f9997302afde107503ea410d7869c355
!pip install -r TransformerTTS/vocoding/extra_requirements.txt

Note: checking out '4eea35f4f9997302afde107503ea410d7869c355'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 4eea35f Update README.md, refactor.


In [4]:
# Set up the paths
TTS_path = 'TransformerTTS/'

import sys
sys.path.append(TTS_path)

In [5]:
# Load pretrained model
from model.factory import tts_ljspeech
from data.audio import Audio
from vocoding.predictors import HiFiGANPredictor, MelGANPredictor
model, config = tts_ljspeech()
audio = Audio(config)

Downloading data from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/api_weights/ljspeech_tts_config_v1.yaml
Downloading data from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/api_weights/ljspeech_tts_weights_v1.hdf5


In [6]:
# Synthesize text
sentence = 'Scientists at the CERN laboratory, say they have discovered a new particle.'
out_normal = model.predict(sentence)
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out_normal['mel'].numpy().T)

In [7]:
import IPython.display as ipd
ipd.display(ipd.Audio(wav, rate=config['sampling_rate']))

You can also vary the speech speed

In [8]:
# 20% faster
sentence = 'Scientists at the CERN laboratory, say they have discovered a new particle.'
out = model.predict(sentence, speed_regulator=1.20)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config['sampling_rate']))

In [9]:
# 10% slower
sentence = 'Scientists at the CERN laboratory, say they have discovered a new particle.'
out = model.predict(sentence, speed_regulator=.9)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config['sampling_rate']))

HiFiGAN

In [10]:
# The file is hosted on ASAI s3, but it is taken from https://github.com/jik876/hifi-gan
!wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/hifigan.zip
!unzip -q hifigan.zip
!rsync -avq hifigan/ TransformerTTS/vocoding/hifigan/

--2021-03-16 16:08:41--  https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/hifigan.zip
Resolving public-asai-dl-models.s3.eu-central-1.amazonaws.com (public-asai-dl-models.s3.eu-central-1.amazonaws.com)... 52.219.75.76
Connecting to public-asai-dl-models.s3.eu-central-1.amazonaws.com (public-asai-dl-models.s3.eu-central-1.amazonaws.com)|52.219.75.76|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51874346 (49M) [application/zip]
Saving to: ‘hifigan.zip’


2021-03-16 16:08:44 (19.9 MB/s) - ‘hifigan.zip’ saved [51874346/51874346]



In [11]:
vocoder = HiFiGANPredictor.from_folder('TransformerTTS/vocoding/hifigan/en')

In [12]:
wav = vocoder([out_normal['mel'].numpy().T])[0]

In [13]:
# Display audio
ipd.display(ipd.Audio(wav, rate=22050))