# Welcome to the DWT IT Lab voice cloning experiment

The demonstration, though basic, shows that taking a simple and short sample of a speaker's audio recording (~>=5 seconds) can produce a convincing synthesis of the speaker's voice within a matter of a few seconds.

This demonstration is based on the [Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis
](https://arxiv.org/pdf/1806.04558.pdf) which leverages three components:


1.   A sample speaker without transcription which is encoded using a neural network trained using a dataset consisting of 18 thousand speakers with background noise and reverberation.
2.   A sequence-to-sequence synthesis network based
on Tacotron 2 that generates a mel spectrogram from text, conditioned on the
embedded speaker (1).
3.   An auto-regressive WaveNet-based vocoder network that
converts the mel spectrogram (2) into time domain waveform samples.

All of the above steps are based on mel spectrograms providing a higher degree of performance when compared to traditional wave-frequency.





In [1]:
!git clone https://github.com/akiral/Real-Time-Voice-Cloning.git

Cloning into 'Real-Time-Voice-Cloning'...
remote: Enumerating objects: 2450, done.[K
remote: Total 2450 (delta 0), reused 0 (delta 0), pack-reused 2450[K
Receiving objects: 100% (2450/2450), 362.28 MiB | 36.42 MiB/s, done.
Resolving deltas: 100% (1354/1354), done.


In [2]:
# Changing the current directory to the repository's directory
%cd Real-Time-Voice-Cloning/

/content/Real-Time-Voice-Cloning


In [3]:
# Installing the dependencies
!pip install -q -r requirements.txt
!apt-get install -qq libportaudio2
!pip install webrtcvad

[K     |████████████████████████████████| 109.2MB 95kB/s 
[K     |████████████████████████████████| 686kB 46.5MB/s 
[K     |████████████████████████████████| 245kB 49.0MB/s 
[K     |████████████████████████████████| 71.6MB 58kB/s 
[K     |████████████████████████████████| 491kB 51.9MB/s 
[K     |████████████████████████████████| 51kB 8.0MB/s 
[K     |████████████████████████████████| 3.2MB 49.3MB/s 
[K     |████████████████████████████████| 204kB 51.6MB/s 
[K     |████████████████████████████████| 286kB 36.7MB/s 
[?25h  Building wheel for visdom (setup.py) ... [?25l[?25hdone
  Building wheel for torchfile (setup.py) ... [?25l[?25hdone
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 144611 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1_amd64.deb ...
Unpacking libportaudio2:amd64 (19.6.0-1) ...
Setting up libportaudio2:amd64 (19.6.0-1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.2) ...
/s

In [4]:
# Downloading pretrained data and unzipping it
!gdown https://drive.google.com/uc?id=1n1sPXvT34yXFLT47QZA6FIRGrwMeSsZc
!unzip pretrained.zip

Downloading...
From: https://drive.google.com/uc?id=1n1sPXvT34yXFLT47QZA6FIRGrwMeSsZc
To: /content/Real-Time-Voice-Cloning/pretrained.zip
384MB [00:03, 103MB/s]
Archive:  pretrained.zip
   creating: encoder/saved_models/
  inflating: encoder/saved_models/pretrained.pt  
   creating: synthesizer/saved_models/
   creating: synthesizer/saved_models/logs-pretrained/
   creating: synthesizer/saved_models/logs-pretrained/taco_pretrained/
 extracting: synthesizer/saved_models/logs-pretrained/taco_pretrained/checkpoint  
  inflating: synthesizer/saved_models/logs-pretrained/taco_pretrained/tacotron_model.ckpt-278000.data-00000-of-00001  
  inflating: synthesizer/saved_models/logs-pretrained/taco_pretrained/tacotron_model.ckpt-278000.index  
  inflating: synthesizer/saved_models/logs-pretrained/taco_pretrained/tacotron_model.ckpt-278000.meta  
   creating: vocoder/saved_models/
   creating: vocoder/saved_models/pretrained/
  inflating: vocoder/saved_models/pretrained/pretrained.pt  


In [5]:
# Initializing all the encoder libraries
from IPython.display import Audio
from IPython.utils import io
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
from pathlib import Path
import numpy as np
import librosa
encoder_weights = Path("encoder/saved_models/pretrained.pt")
vocoder_weights = Path("vocoder/saved_models/pretrained/pretrained.pt")
syn_dir = Path("synthesizer/saved_models/logs-pretrained/taco_pretrained")
encoder.load_model(encoder_weights)
synthesizer = Synthesizer(syn_dir)
vocoder.load_model(vocoder_weights)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Loaded encoder "pretrained.pt" trained to step 1564501
Found synthesizer "pretrained" trained to step 278000
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt


In [10]:
# Mounting personal google drive in order to support quick voice sample uploads
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
text = "the corona virus has caused a worldwide economic impact which hasn't seen an end"

In [16]:
in_fpath = Path("/content/drive/My Drive/Colab Notebooks/voice_samples/ben-5.wav")
reprocessed_wav = encoder.preprocess_wav(in_fpath)
original_wav, sampling_rate = librosa.load(in_fpath)
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
preprocessed_wav2 = encoder.preprocess_wav(preprocessed_wav, synthesizer.sample_rate)
embed = encoder.embed_utterance(preprocessed_wav2)
with io.capture_output() as captured:
  specs = synthesizer.synthesize_spectrograms([text], [embed])
generated_wav = vocoder.infer_waveform(specs[0])
generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
display(Audio(original_wav, rate=sampling_rate))
display(Audio(generated_wav, rate=synthesizer.sample_rate))

{| ████████████████ 76000/76800 | Batch Size: 8 | Gen Rate: 10.0kHz | }