# [VQ-VAE](https://arxiv.org/abs/1711.00937) for audio in PyTorch

This notebook is based on 
https://github.com/zalandoresearch/pytorch-vq-vae

## Introduction

Variational Auto Encoders (VAEs) can be thought of as what all but the last layer of a neural network is doing, namely feature extraction or seperating out the data. Thus given some data we can think of using a neural network for representation generation. 

Recall that the goal of a generative model is to estimate the probability distribution of high dimensional data such as images, videos, audio or even text by learning the underlying structure in the data as well as the dependencies between the different elements of the data. This is very useful since we can then use this representation to generate new data with similar properties. This way we can also learn useful features from the data in an unsupervised fashion.

The VQ-VAE uses a discrete latent representation mostly because many important real-world objects are discrete. For example in images we might have categories like "Cat", "Car", etc. and it might not make sense to interpolate between these categories. Discrete representations are also easier to model since each category has a single value whereas if we had a continous latent space then we will need to normalize this density function and learn the dependencies between the different variables which could be very complex.

### Code

I have followed the code from the TensorFlow implementation by the author which you can find here [vqvae.py](https://github.com/deepmind/sonnet/blob/master/sonnet/python/modules/nets/vqvae.py) and [vqvae_example.ipynb](https://github.com/deepmind/sonnet/blob/master/sonnet/examples/vqvae_example.ipynb). 

Another PyTorch implementation is found at [pytorch-vqvae](https://github.com/ritheshkumar95/pytorch-vqvae).


## Basic Idea

We start by defining a latent embedding space of dimension `[K, D]` where `K` are the number of embeddings and `D` is the dimensionality of each latent embeddng vector $e_i$.

The model will take in batches of waveforms, of size 16126 for our example, and pass it through a ConvNet encoder producing some output, where we make sure the channels are the same as the dimensionality of the latent embedding vectors. To calculate the discrete latent variable we find the nearest embedding vector and output it's index. 

The input to the decoder is the embedding vector corresponding to the index which is passed through the decoder to produce the reconstructed audio. 

Since the nearest neighbour lookup has no real gradient in the backward pass we simply pass the gradients from the decoder to the encoder  unaltered. The intuition is that since the output representation of the encoder and the input to the decoder share the same `D` channel dimensional space, the gradients contain useful information for how the encoder has to change its output to lower the reconstruction loss.

## Loss

The total loss is composed of three components:

1. reconstruction loss which optimizes the decoder and encoder
1. due to the fact that gradients bypass the embedding, we use a dictionary learning algorithm  which uses an $l_2$  error to move the embedding vectors $e_i$ towards the encoder output
1. also since the volume of the embedding space is dimensionless, it can grow arbirtarily if the embeddings $e_i$ do not train as fast as  the encoder parameters, and thus we add a commitment loss to make sure that the encoder commits to an embedding

In [1]:
import os
import subprocess

import math

import matplotlib.pyplot as plt
import numpy as np
import pickle

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

import random
import fastdtw
from wavenet_vocoder.wavenet import WaveNet
from wavenet_vocoder.wavenet import receptive_field_size
#from vq import VectorQuantizerEMA

In [2]:
import easydict
args = easydict.EasyDict({
    "batch": 1,
    "epochs": 500,
    "training_data": './2_speaker/vctk_train.txt',
    "test_data": './2_speaker/vctk_test.txt',
#    "training_data": './vctk_train.txt',
#    "test_data": './vctk_test.txt',
#    "out": "result",
#    "resume": False,
    "load": 0,
    "load_mid" : 0,
    "seed": 1 })

In [3]:
device = torch.device("cuda")
#torch.cuda.set_device(0)
device

torch.manual_seed(args.seed)
np.random.seed(args.seed)
random.seed(args.seed)

In [4]:
with open(args.training_data, 'r') as f:
    data = f.read()
file = data.splitlines()
speaker_dic = {}
number_of_speakers = 0
for i in range (0, len(file)):
    if (file[i].split('/')[0] in speaker_dic):
        continue
    else :
        speaker_dic[file[i].split('/')[0]] = number_of_speakers
        number_of_speakers+=1
        

In [5]:
#TO DO: check that weight gets updated
class VectorQuantizerEMA(nn.Module):
    """We will also implement a slightly modified version  which will use exponential moving averages
    to update the embedding vectors instead of an auxillary loss.
    This has the advantage that the embedding updates are independent of the choice of optimizer 
    for the encoder, decoder and other parts of the architecture.
    For most experiments the EMA version trains faster than the non-EMA version."""
    def __init__(self, num_embeddings, embedding_dim, commitment_cost, decay, epsilon=1e-5):
        super(VectorQuantizerEMA, self).__init__()
        
        self._embedding_dim = embedding_dim
        self._num_embeddings = num_embeddings
        
        self._embedding = nn.Embedding(self._num_embeddings, self._embedding_dim)
        #self._embedding.weight.data.normal_()
        self._embedding.weight.data.uniform_(-1./512, 1./512)
#        self._embedding.weight.data = torch.Tensor([0])
        #self._embedding.weight.data = torch.Tensor(np.zeros(()))
        self._commitment_cost = commitment_cost
        
        self.register_buffer('_ema_cluster_size', torch.zeros(num_embeddings))
        self._ema_w = nn.Parameter(torch.Tensor(num_embeddings, self._embedding_dim))
        self._ema_w.data.normal_()
        
        self._decay = decay
        self._epsilon = epsilon
    '''
    def forward(self, inputs):
        # convert inputs from BCL -> BLC
        inputs = inputs.permute(0, 2, 1).contiguous()
        input_shape = inputs.shape
        # Flatten input
        flat_input = inputs.view(-1, self._embedding_dim)     #[BL, C]
        if (self._embedding.weight.data == 0).all():
            self._embedding.weight.data = flat_input[-self._num_embeddings:].detach()
        # Calculate distances

        distances = (torch.sum(flat_input**2, dim=1, keepdim=True) 
                    + torch.sum(self._embedding.weight**2, dim=1)
                    - 2 * torch.matmul(flat_input, self._embedding.weight.t())) #[BL, num_embeddings]
        # Encoding
        encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1) #[BL, 1]
        encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings).to(device)# [BL, num_embeddings]
        encodings.scatter_(1, encoding_indices, 1)
        #print(encodings.shape) [250, 512]
        # Use EMA to update the embedding vectors
        if self.training:
            self._ema_cluster_size = self._ema_cluster_size * self._decay + \
                                     (1 - self._decay) * torch.sum(encodings, 0)
            #print(self._ema_cluster_size.shape) [512]
            n = torch.sum(self._ema_cluster_size)
            self._ema_cluster_size = (
                (self._ema_cluster_size + self._epsilon)
                / (n + self._num_embeddings * self._epsilon) * n)
            
            dw = torch.matmul(encodings.t(), flat_input)
            self._ema_w = nn.Parameter(self._ema_w * self._decay + (1 - self._decay) * dw)
            
            self._embedding.weight = nn.Parameter(self._ema_w / self._ema_cluster_size.unsqueeze(1))
        
        # Quantize and unflatten
        #encodings.shape = [BL, num_embeddings] , weight.shape=[num_embeddings, C]
        quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape)

        
        # Loss
        e_latent_loss = torch.mean((quantized.detach() - inputs)**2)
        q_latent_loss = torch.mean((quantized - inputs.detach())**2)
#        print(q_latent_loss.item(), 0.25 * e_latent_loss.item())
        loss = q_latent_loss + self._commitment_cost * e_latent_loss
        
        quantized = inputs + (quantized - inputs).detach()
        avg_probs = torch.mean(encodings, dim=0)
        perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))
        # convert quantized from BLC -> BCL
        return loss, quantized.permute(0, 2, 1).contiguous(), perplexity
    '''
    
    def forward(self, inputs):
        # convert inputs from BCL -> BLC
        inputs = inputs.permute(0, 2, 1).contiguous()
        input_shape = inputs.shape
        # Flatten input
        flat_input = inputs.view(-1, self._embedding_dim)     #[BL, C]
        # Calculate distances
        
        distances = torch.norm(flat_input.unsqueeze(1) - self._embedding.weight, dim=2, p=2)
 #       distances = (torch.sum(flat_input**2, dim=1, keepdim=True) 
 #                   + torch.sum(self._embedding.weight**2, dim=1)
        # Encoding
        encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1) #[BL, 1]
        
        encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings).to(device)# [BL, num_embeddings]
        encodings.scatter_(1, encoding_indices, 1)
        #print(encodings.shape) [250, 512]

#         # Use EMA to update the embedding vectors
#         if self.training:
#             self._ema_cluster_size = self._ema_cluster_size * self._decay + \
#                                      (1 - self._decay) * torch.sum(encodings, 0)
#             #print(self._ema_cluster_size.shape) [512]
#             n = torch.sum(self._ema_cluster_size)
#             self._ema_cluster_size = (
#                 (self._ema_cluster_size + self._epsilon)
#                 / (n + self._num_embeddings * self._epsilon) * n)
            
#             dw = torch.matmul(encodings.t(), flat_input)
#             self._ema_w = nn.Parameter(self._ema_w * self._decay + (1 - self._decay) * dw)
            
#             self._embedding.weight = nn.Parameter(self._ema_w / self._ema_cluster_size.unsqueeze(1))

        # Quantize and unflatten
        #encodings.shape = [BL, num_embeddings] , weight.shape=[num_embeddings, C]
        quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape)
        # Loss
        e_latent_loss = torch.mean((quantized.detach() - inputs)**2)
        q_latent_loss = torch.mean((quantized - inputs.detach())**2)
#        print(q_latent_loss.item(), 0.25 * e_latent_loss.item())
        loss = q_latent_loss + self._commitment_cost * e_latent_loss
        
        quantized = inputs + (quantized - inputs).detach()
        avg_probs = torch.mean(encodings, dim=0)
        perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))
        # same as torch.exp( entropy loss )
        
        # convert quantized from BLC -> BCL
        return loss, quantized.permute(0, 2, 1).contiguous(), perplexity
#    '''

In [6]:
# embedding_dim=1
# num_embeddings=2
# ema = VectorQuantizerEMA(embedding_dim=embedding_dim,
#                         num_embeddings=num_embeddings,
#                         commitment_cost=0.5,
#                         decay=0.99,
#                         device=device)
  
# ema.eval()
# print("is training", ema.training)
# inputs_np = np.random.randn(20, embedding_dim).astype(np.float32)
# print("inputs", inputs_np)
# inputs = torch.Tensor(inputs_np.reshape(1,embedding_dim,20))

# loss, vq_output, perplexity = ema(inputs)
# print("loss", loss)
# print("output", vq_output)
# # Output shape is correct
# assert vq_output.shape == inputs.shape
    
# #assert ema._embedding.weight.detach().numpy().shape == [embedding_dim, num_embeddings]
# # Check that each input was assigned to the embedding it is closest to.
# embeddings_np = ema._embedding.weight.detach().numpy().T
# distances = ((inputs_np**2).sum(axis=1, keepdims=True) -
#              2 * np.dot(inputs_np, embeddings_np) +
#              (embeddings_np**2).sum(axis=0, keepdims=True))
# closest_index = np.argmax(-distances, axis=1)

# print(closest_index)

## Encoder & Decoder Architecture

In [7]:
class Encoder(nn.Module):
    """Audio encoder
    The vq-vae paper says that the encoder has 6 strided convolutions with stride 2 and window-size 4.
    The number of channels and a nonlinearity is not specified in the paper. 
    I tried using ReLU, it didn't work.
    Now I try using tanh, hoping that this will keep my encoded values within the neighborhood of 0,
    so they do not drift too far away from encoding vectors.
    """
    def __init__(self, encoding_channels, in_channels=256):
        super(Encoder,self).__init__()
        self._num_layers = 2 * len(encoding_channels)
        self._layers = nn.ModuleList()
        for out_channels in encoding_channels:
            self._layers.append(nn.Conv1d(in_channels=in_channels,
                                    out_channels=out_channels,
                                    stride=2,
                                    kernel_size=4,
                                    padding=0, 
                                        ))
            self._layers.append(nn.Tanh())
            in_channels = out_channels
        
    def forward(self, x):
        for i in range(self._num_layers):
            x = self._layers[i](x)
        return x

In [8]:
class Model(nn.Module):
    def __init__(self,
                 encoding_channels,
                 num_embeddings, 
                 embedding_dim,
                 commitment_cost, 
                 layers,
                 stacks,
                 kernel_size,
                 decay=0):
        super(Model, self).__init__()       
        self._encoder = Encoder(encoding_channels=encoding_channels)
        #I tried adding batch normalization here, because:
        #the distribution of encoded values needs to be similar to the distribution of embedding vectors
        #otherwise we'll see "posterior collapse": all values will be assigned to the same embedding vector,
        #and stay that way (because vectors which do not get assigned anything do not get updated).
        #Batch normalization is a way to fix that. But it didn't work: model
        #reproduced voice correctly, but the words were completely wrong.
        #self._batch_norm = nn.BatchNorm1d(1)
        if decay > 0.0:
#             self._vq_vae = EMVectorQuantizerEMA(num_embeddings, embedding_dim, 
#                                               commitment_cost, decay, 100)
            self._vq_vae = VectorQuantizerEMA(num_embeddings, embedding_dim, 
                                               commitment_cost, decay)

        else:
            self._vq_vae = VectorQuantizer(num_embeddings, embedding_dim,
                                           commitment_cost)
        self._decoder = WaveNet(device, out_channels=256, #dimension of ohe mu-quantized signal
                                layers=layers, #like in original WaveNet
                                stacks=stacks,
                                residual_channels=512,
                                gate_channels=512,
                                skip_out_channels=512,
                                kernel_size=kernel_size, 
                                dropout=1 - 0.95,
                                cin_channels=embedding_dim, #local conditioning channels - on encoder output
                                gin_channels=number_of_speakers, #global conditioning channels - on speaker_id
                                n_speakers=number_of_speakers,
                                weight_normalization=False, 
                                upsample_conditional_features=True, 
                                decoding_channels=encoding_channels[::-1],
                                use_speaker_embedding=False
                               )
        self.recon_loss = torch.nn.CrossEntropyLoss()
        self.receptive_field = receptive_field_size(total_layers=layers, num_cycles=stacks, kernel_size=kernel_size)
#        self.mean = None
#        self.std = None
    def forward(self, x):
        audio, target, speaker_id = x
        assert len(audio.shape) == 3 # B x C x L 
        assert audio.shape[1] == 256
        z = self._encoder(audio)
        #normalize output - subtract mean, divide by standard deviation
        #without this, perplexity goes to 1 almost instantly
#         if self.mean is None:
#             self.mean = z.mean().detach()
#         if self.std is None:
#              self.std = z.std().detach()
#        z = z - self.mean
#        z = z / self.std
        vq_loss, quantized, perplexity = self._vq_vae(z)
#        assert z.shape == quantized.shape
#        print("audio.shape", audio.shape)
#        print("quantized.shape", quantized.shape)
        x_recon = self._decoder(audio, quantized, speaker_id, softmax=False)
        x_recon = x_recon[:, :, self.receptive_field:-1]
        recon_loss_value = self.recon_loss(x_recon, target[:, 1:])
        loss = recon_loss_value + vq_loss
        
        return loss, recon_loss_value, x_recon, perplexity

# Train

In [9]:
num_training_updates = 39818
#vector quantizer parameters:
embedding_dim = 64 #dimension of each vector
encoding_channels = [512,512,512,512,512,embedding_dim]
num_embeddings = 512 #number of vectors
commitment_cost = 0.25

#wavenet parameters:
kernel_size=2
total_layers=30
num_cycles=3


decay = 0.99
#decay = 0

learning_rate = 1e-3
batch_size=1

In [10]:
receptive_field = receptive_field_size(total_layers=total_layers, num_cycles=num_cycles, kernel_size=kernel_size)
print(receptive_field)

3070


## Load data

In [11]:
model = Model(num_embeddings=num_embeddings,
              encoding_channels=encoding_channels,
              embedding_dim=embedding_dim, 
              commitment_cost=commitment_cost, 
              layers=total_layers,
              stacks=num_cycles,
              kernel_size=kernel_size,
              decay=decay).to(device)

In [12]:
optimizer = optim.Adam(model.parameters(), lr=1, amsgrad=False)

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer,
                                              lr_lambda=lambda epoch: 1e-3 if epoch == 0 else  (optimizer.param_groups[0]['lr'] - (1e-3 - 1e-6)/500) if epoch <= 500 else optimizer.param_groups[0]['lr'])

In [13]:
    '''
# data = VCTK("/gpfs/gpfs0/a.phan/Vika_voice_conversion/VCTK", receptive_field=receptive_field)
data = VCTK("./VCTK", receptive_field=receptive_field)
print(len(data))
indices = np.arange(len(data))
test_size = len(data) // 10

train_indices = indices[:-test_size]
test_indices = indices[-test_size:]

training_loader = DataLoader(data, 
                           batch_size=1,
                           shuffle=False, 
                           num_workers=1,
                           sampler=SubsetSequentialSampler(train_indices))

validation_loader = DataLoader(data, 
                           batch_size=1,
                           shuffle=False, 
                           num_workers=1,
                           sampler=SubsetSequentialSampler(test_indices))
    '''

'\n# data = VCTK("/gpfs/gpfs0/a.phan/Vika_voice_conversion/VCTK", receptive_field=receptive_field)\ndata = VCTK("./VCTK", receptive_field=receptive_field)\nprint(len(data))\nindices = np.arange(len(data))\ntest_size = len(data) // 10\n\ntrain_indices = indices[:-test_size]\ntest_indices = indices[-test_size:]\n\ntraining_loader = DataLoader(data, \n                       batch_size=1,\n                       shuffle=False, \n                       num_workers=1,\n                       sampler=SubsetSequentialSampler(train_indices))\n\nvalidation_loader = DataLoader(data, \n                       batch_size=1,\n                       shuffle=False, \n                       num_workers=1,\n                       sampler=SubsetSequentialSampler(test_indices))\n'

In [14]:
import librosa

In [15]:
class TrainingSet(Dataset):
    # VCTK-Corpus Training data set

    def __init__(self, num_speakers,
                 receptive_field,
                 segment_length=16126,
                 chunk_size=1000,
                 classes=256):
        
        self.x_list = self.read_files(args.training_data)
        self.classes = 256
        self.segment_length = segment_length
        self.chunk_size = chunk_size
        self.classes = classes
        self.receptive_field = receptive_field
        self.cached_pt = 0
        self.num_speakers = num_speakers

    def read_files(self, filename):
        print("training data from " + args.training_data)
        with open(filename) as file:
            files = file.readlines()
        return [f.strip() for f in files]

    def __getitem__(self, index):
        try:
            audio, sr = librosa.load('./VCTK/wav48/'+self.x_list[index])
        except Exception as e:
            print(e, audiofile)
        if sr != 22050:
            raise ValueError("{} SR of {} not equal to 22050".format(sr, audiofile))
            
        audio = librosa.util.normalize(audio) #divide max(abs(audio))
        audio = self.quantize_data(audio, self.classes)
            
        while audio.shape[0] < self.segment_length:
            index += 1
            audio, speaker_id = librosa.load('./VCTK/wav48/'+self.x_list[index])
            
        max_audio_start = audio.shape[0] - self.segment_length
        audio_start = random.randint(0, max_audio_start)
        audio = audio[audio_start:audio_start+self.segment_length]
        
                #divide into input and target
        audio = torch.from_numpy(audio)
        ohe_audio = torch.FloatTensor(self.classes, self.segment_length).zero_()
        ohe_audio.scatter_(0, audio.unsqueeze(0), 1.)
        target = audio[self.receptive_field:]
            
        speaker_index = speaker_dic[self.x_list[index].split('/')[0]]
        speaker_id = torch.from_numpy(np.array(speaker_index)).unsqueeze(0).unsqueeze(0)
        ohe_speaker = torch.FloatTensor(self.num_speakers, 1).zero_()
        ohe_speaker.scatter_(0, speaker_id, 1.)
        
        return ohe_audio, target, ohe_speaker
    
    def __len__(self):
        return len(self.x_list)
    
    def quantize_data(self, data, classes):
        mu_x = self.mu_law_encode(data, classes)
        bins = np.linspace(-1, 1, classes)
        quantized = np.digitize(mu_x, bins) - 1
        return quantized

    def mu_law_encode(self, data, mu):
        mu_x = np.sign(data) * np.log(1 + mu * np.abs(data)) / np.log(mu + 1)
        return mu_x

In [16]:
class TestSet(Dataset):
    # VCTK-Corpus Test data set


    def __init__(self, num_speakers,
                 receptive_field,
                 segment_length=16126,
                 chunk_size=1000,
                 classes=256):
        
        
        self.x_list = self.read_files(args.test_data)
        self.classes = 256
        self.segment_length = segment_length
        self.chunk_size = chunk_size
        self.classes = classes
        self.receptive_field = receptive_field
        self.cached_pt = 0
        self.num_speakers = num_speakers


    def read_files(self, filename):
        print("training data from " + args.test_data)
        with open(filename) as file:
            files = file.readlines()
        return [f.strip() for f in files]

    def __getitem__(self, index):
        try:
            audio, sr = librosa.load('./VCTK/wav48/'+self.x_list[index])
        except Exception as e:
            print(e, audiofile)
        if sr != 22050:
            raise ValueError("{} SR of {} not equal to 22050".format(sr, audiofile))
        
        audio = librosa.util.normalize(audio) #divide max(abs(audio))
        audio = self.quantize_data(audio, self.classes)
            
        while audio.shape[0] < self.segment_length:
            index += 1
            audio, speaker_id = librosa.load('./VCTK/wav48/'+self.x_list[index])
            
        max_audio_start = audio.shape[0] - self.segment_length
        audio_start = random.randint(0, max_audio_start)
        audio = audio[audio_start:audio_start+self.segment_length]
        
                #divide into input and target
        audio = torch.from_numpy(audio)
        ohe_audio = torch.FloatTensor(self.classes, self.segment_length).zero_()
        ohe_audio.scatter_(0, audio.unsqueeze(0), 1.)
        target = audio[self.receptive_field:]
            
        speaker_index = speaker_dic[self.x_list[index].split('/')[0]]
        speaker_id = torch.from_numpy(np.array(speaker_index)).unsqueeze(0).unsqueeze(0)
        ohe_speaker = torch.FloatTensor(self.num_speakers, 1).zero_()
        ohe_speaker.scatter_(0, speaker_id, 1.)
        
        return ohe_audio, target, ohe_speaker

    def __len__(self):
        return len(self.x_list)
        
    def quantize_data(self, data, classes):
        mu_x = self.mu_law_encode(data, classes)
        bins = np.linspace(-1, 1, classes)
        quantized = np.digitize(mu_x, bins) - 1
        return quantized

    def mu_law_encode(self, data, mu):
        mu_x = np.sign(data) * np.log(1 + mu * np.abs(data)) / np.log(mu + 1)
        return mu_x

In [17]:
trainset = TrainingSet(number_of_speakers, receptive_field=receptive_field)
testset = TestSet(number_of_speakers, receptive_field=receptive_field)


training_loader = DataLoader(dataset = trainset,
                           batch_size=batch_size,
                           shuffle=True, 
                           num_workers=1)


validation_loader = DataLoader(dataset = testset,
                           batch_size=batch_size,
                           shuffle=True, 
                           num_workers=1)

training data from ./small_data/vctk_train.txt
training data from ./small_data/vctk_test.txt


In [18]:
train_res_recon_error = []
train_res_perplexity = []

In [19]:
def train():
    model.train()
    global train_res_recon_error
    global train_res_perplexity
    train_total_loss = []
    train_recon_error = []
    train_perplexity = []
    # with open("errors", "rb") as file:
    #     train_res_recon_error, train_res_perplexity = pickle.load(file)
# num_epochs = 1
# for epoch in range(num_epochs):
    iterator = iter(training_loader)
#     datas0 = []
#     datas1 = []
#     datas2 = []
    for i, data_train in enumerate(iterator):
        data_train = [data_train[0].to(device),
                     data_train[1].to(device),
                     data_train[2].to(device)
                     ]

#         datas0.append(data_train[0])
#         datas1.append(data_train[1])
#         datas2.append(data_train[2])
#         if (i+1) % batch_size == 0:
#             data = [torch.cat(datas0).to(device),
#                    torch.cat(datas1).to(device),
#                    torch.cat(datas2).to(device)]
        optimizer.zero_grad()
        loss, recon_error, data_recon, perplexity = model(data_train)
        loss.backward()
        optimizer.step()
        train_total_loss.append(loss.item())
        train_recon_error.append(recon_error.item())
        train_perplexity.append(perplexity.item())

        if (i+1) % (10 * batch_size) == 0:
            print('%d iterations' % (i+1))
            print('recon_error: %.3f' % np.mean(train_recon_error[-100:]))
            print('perplexity: %.3f' % np.mean(train_perplexity[-100:]))
            print()
    train_res_recon_error.extend(train_recon_error)
    train_res_perplexity.extend(train_perplexity)
    return np.mean(train_total_loss), np.mean(train_res_recon_error)

In [20]:
def validation():
    model.eval()
    with torch.no_grad():
        test_total_loss = []
        test_res_recon_error = []
        # with open("errors", "rb") as file:
        #     train_res_recon_error, train_res_perplexity = pickle.load(file)
    # num_epochs = 1
    # for epoch in range(num_epochs):
        iterator = iter(validation_loader)
    #     datas0 = []
    #     datas1 = []
    #     datas2 = []
        for i, data_test in enumerate(iterator):
            data_test = [data_test[0].to(device),
                         data_test[1].to(device),
                         data_test[2].to(device)]
            
            loss, recon_error, data_recon, perplexity = model(data_test)

            test_total_loss.append(loss.item())
            test_res_recon_error.append(recon_error.item())

            if (i+1) % (10 * batch_size) == 0:
                print('%d iterations' % (i+1))
                print('recon_error: %.3f' % np.mean(test_res_recon_error[-100:]))
                print()
    return np.mean(test_total_loss), np.mean(test_res_recon_error)

In [21]:
from fastdtw import fastdtw

def conversion(original_wav, speaker):
    model.eval()
    with torch.no_grad():
        generated_file = np.array([])
        wav, sr = librosa.load(original_wav)
        speaker_index = speaker_dic[speaker]
        
        normalized = librosa.util.normalize(wav)
        quantized = quantize_data(normalized, 256)

        for i in range(0, len(wav), 16126):
            sample = quantized[i:i+16126]
            if (len(sample)!= 16126):
                sample = np.append(sample, np.zeros(16126 - len(sample)).astype(int))
                
            sample = torch.from_numpy(sample)
            ohe_audio = torch.FloatTensor(256, 16126).zero_()
            ohe_audio.scatter_(0, sample.unsqueeze(0), 1.)
            
            speaker_id = torch.from_numpy(np.array(speaker_index)).unsqueeze(0).unsqueeze(0)
            ohe_speaker = torch.FloatTensor(number_of_speakers, 1).zero_()
            ohe_speaker.scatter_(0, speaker_id, 1.)

            ohe_audio = ohe_audio.unsqueeze(0).to(device)
            ohe_speaker = ohe_speaker.unsqueeze(0).to(device)
            encoded = model._encoder(ohe_audio)
            _, valid_quantize, _ = model._vq_vae(encoded)
            
            valid_reconstructions = model._decoder.incremental_forward(ohe_audio[:,:,0:1], 
                                                               valid_quantize, 
                                                               ohe_speaker, 
                                                               T=16126)
            recon = valid_reconstructions.squeeze().argmax(dim=0).detach().cpu().numpy()
            mu_encoded = (recon + 1) / 128 - 1
            mu_decoded = mu_law_decode(recon, mu=256)
            generated_file = np.append(generated_file, mu_encoded)
        # librosa.output.write_wav("generated.wav",  generated_file, sr=sr)
        return wav, generated_file, sr

In [22]:
def getMFCC(y, sr, n_mfcc = 24):
    data = librosa.feature.mfcc(y=y, sr = sr, n_fft=4096, n_mfcc=n_mfcc)
    return data

In [23]:
def cal_mcd(C, C_hat):
    if C.ndim==2:
        K = 10 * np.sqrt(2) / np.log(10)
        return K * np.mean(np.sqrt(np.sum((C - C_hat) ** 2, axis = 1)))
    elif C.ndim==1:
        K = 10 * np.sqrt(2) / np.log(10)
        return K * np.mean(np.sqrt(np.sum((C - C_hat) ** 2)))

In [24]:
def calculate_mcd(C, C_hat, sr):
    c, r = C.shape
    cc, rc = C_hat.shape
    a, b = fastdtw(C.T, C_hat.T, dist=cal_mcd)
    b = np.array(b)
    if (r > rc):
        fdtw_C = np.zeros(shape=(r,c))
        fdtw_C_hat = np.zeros(shape=(r,c))
        for j in range(0, r):
            fdtw_C[j] = C.T[b[j][0]]
            fdtw_C_hat[j] = C_hat.T[b[j][1]]
    else:
        fdtw_C = np.zeros(shape=(rc,cc))
        fdtw_C_hat = np.zeros(shape=(rc,cc))
        for j in range(0, rc):
            fdtw_C[j] = C.T[b[j][0]]
            fdtw_C_hat[j] = C_hat.T[b[j][1]]
    mcd = cal_mcd(fdtw_C, fdtw_C_hat)
    return mcd

In [25]:
def calc_mcd_msd(conversion_list, data_path = './VCTK'):
    
    mcd = []
    for j in range(0, len(conversion_list)):
        original, conversed, sr = conversion(original_wav = data_path + conversion_list[j][0].item(), speaker = conversion_list[j][1].item())
        compair, _ = librosa.load(data_path + conversion_list[j][2].item())
        C = getMFCC(compair, sr)
        C_hat = getMFCC(conversed, sr)
        mcd.append(calculate_mcd(C, C_hat,sr))

    return  np.mean(mcd).item()

In [26]:
def mu_law_encode(data, mu):
    mu_x = np.sign(data) * np.log(1 + mu * np.abs(data)) / np.log(mu + 1)
    return mu_x

def mu_law_decode(mu_x, mu):
    data = np.sign(mu_x) * (1 / mu) * ((1 + mu) ** np.abs(mu_x) - 1)
    return data

def quantize_data(data, classes):
    mu_x = mu_law_encode(data, classes)
    bins = np.linspace(-1, 1, classes)
    quantized = np.digitize(mu_x, bins) - 1
    return quantized

In [27]:
epochs = args.epochs
training_total_loss_per_epochs = []
training_reconstruction_errors_per_epochs = []
validation_total_loss_per_epochs = []
validation_reconstruction_errors_per_epochs = []

training_mcd_per_epochs = []
validation_mcd_per_epochs = []

lrs = []

if (args.load != 0):
    model.load_state_dict(torch.load("model_epoch"+str(args.load)))
    optimizer.load_state_dict(torch.load("optim_epoch"+str(args.load)))
    training_total_loss_per_epochs = np.load('training_total_loss_per_epochs'+str(args.load)+'.npy').tolist()
    training_reconstruction_errors_per_epochs = np.load('training_reconstruction_errors_per_epochs'+str(args.load)+'.npy').tolist()
    validation_total_loss_per_epochs = np.load('validation_total_loss_per_epochs'+str(args.load)+'.npy').tolist()
    validation_reconstruction_errors_per_epochs = np.load('validation_reconstruction_errors_per_epochs'+str(args.load)+'.npy').tolist()
    lrs = np.load('lrs.npy')
    
    
if (args.load_mid != 0 and args.load == 0):
    model.load_state_dict(torch.load("model_epoch"+str(args.load)))
    optimizer.load_state_dict(torch.load("optim_epoch"+str(args.load)))


for i in range(1, epochs+1):
    print(str(i)+" epochs ==> training")
    total_loss, reconstruction_loss = train()
    training_total_loss_per_epochs.append(total_loss)
    training_reconstruction_errors_per_epochs.append(reconstruction_loss)
    
    print(str(i)+" epochs ==> validation")
    total_loss, reconstruction_loss = validation()
    training_total_loss_per_epochs.append(total_loss)
    training_reconstruction_errors_per_epochs.append(reconstruction_loss)

    
    if (i % 5 == 0):
        torch.save(model.state_dict(), "model_epoch"+str(i+args.load))
        torch.save(optimizer.state_dict(), "optim_epoch"+str(i+args.load))
        
    for param_group in optimizer.param_groups:
        lr = param_group['lr']
    lrs.append(lr)
    np.save('lrs.npy', lrs)
    np.save('training_total_loss_per_epochs'+str(args.epochs + args.load), np.array(training_total_loss_per_epochs))
    np.save('training_reconstruction_errors_per_epochs'+str(args.epochs + args.load), np.array(training_reconstruction_errors_per_epochs))
    np.save('validation_total_loss_per_epochs'+str(args.epochs + args.load), np.array(validation_total_loss_per_epochs))
    np.save('validation_reconstruction_errors_per_epochs'+str(args.epochs + args.load), np.array(validation_reconstruction_errors_per_epochs))
    scheduler.step()

1 epochs ==> training
1 epochs ==> validation
epoch: 500, lr=0.0010000000
2 epochs ==> training
2 epochs ==> validation


KeyboardInterrupt: 

# calculate graph

In [None]:
from scipy.signal import convolve
smooth_loss = convolve(train_res_recon_error, [0.001] * 1000, mode='valid')

In [None]:
f = plt.figure(figsize=(16,8))
ax = f.add_subplot(1,2,1)
ax.plot(smooth_loss)
#ax.set_yscale('log')
ax.set_title('NMSE.')
ax.set_xlabel('iteration')

ax = f.add_subplot(1,2,2)
ax.plot(train_res_perplexity)
ax.set_title('Average codebook usage (perplexity).')
ax.set_xlabel('iteration')
#I accidentally removed loss for first 20000 iterations or so.

In [None]:
start = 51
print(training_reconstruction_errors_per_epochs)
end = len(training_reconstruction_errors_per_epochs)
x_axis = range(start, start + end)
plt.title("reconstruction loss")
plt.xlabel("epochs")
plt.xticks(np.arange(start, start + end, step=1))
plt.plot(x_axis, training_reconstruction_errors_per_epochs, label='train')
plt.plot(x_axis, validation_reconstruction_errors_per_epochs, label='test')
plt.legend()

In [None]:
# x_axis = range(11, 31)
# plt.title("mcd")
# plt.xlabel("epochs")
# plt.xticks(np.arange(11, 31, step=1))
# plt.plot(x_axis, training_mcd_per_epochs, label='train')
# plt.plot(x_axis, validation_mcd_per_epochs, label='test')
# plt.legend()

## View Reconstructions

In [None]:
# with open("errors", "wb") as file:
#     pickle.dump([train_res_recon_error, train_res_perplexity], file)

In [None]:
#test encoding-decoding
# import librosa
# audio, sr = librosa.load("../VCTK/wav48/p225/p225_001.wav")
# normalized = librosa.util.normalize(audio) #divide max(abs(audio))
# mu_x = data.mu_law_encode(normalized, 256)
# bins = np.linspace(-1, 1, 256)
# quantized = np.digitize(mu_x, bins) - 1
# plt.plot(quantized[:100])
# plt.show()

# mu_rec = (quantized + 1) / 128 - 1
# plt.plot(mu_rec[:1000], color='red')
# plt.plot(mu_x[:1000], color='blue')
# plt.show()

# decoded = data.mu_law_decode(mu_rec, 256)
# plt.plot(normalized[:1000], color='blue')
# plt.plot(decoded[:1000], color='red')
# plt.show()

In [None]:
def mu_law_encode(data, mu):
    mu_x = np.sign(data) * np.log(1 + mu * np.abs(data)) / np.log(mu + 1)
    return mu_x

def mu_law_decode(mu_x, mu):
    data = np.sign(mu_x) * (1 / mu) * ((1 + mu) ** np.abs(mu_x) - 1)
    return data

def quantize_data(data, classes):
    mu_x = mu_law_encode(data, classes)
    bins = np.linspace(-1, 1, classes)
    quantized = np.digitize(mu_x, bins) - 1
    return quantized

In [None]:
import librosa

def generate(original_wav, speaker, filename = "generated.wav"):
    model.eval()
    with torch.no_grad():
        generated_file = np.array([])
        wav, sr = librosa.load(original_wav)
        print(len(wav))
        speaker_index = speaker_dic[speaker]
        
        normalized = librosa.util.normalize(wav)
        quantized = quantize_data(normalized, 256)
        
        
        
        for i in range(0, len(wav), 16126):
            sample = quantized[i:i+16126]
            
            
            if (len(sample)!= 16126):
                sample = np.append(sample, np.zeros(16126 - len(sample)).astype(int))
                print(16126 - len(sample))
                
            sample = torch.from_numpy(sample)
            print(sample)
            print(torch.max(sample), torch.min(sample))
            ohe_audio = torch.FloatTensor(256, 16126).zero_()
            ohe_audio.scatter_(0, sample.unsqueeze(0), 1.)
            
            speaker_id = torch.from_numpy(np.array(speaker_index)).unsqueeze(0).unsqueeze(0)
            ohe_speaker = torch.FloatTensor(number_of_speakers, 1).zero_()
            ohe_speaker.scatter_(0, speaker_id, 1.)

            ohe_audio = ohe_audio.unsqueeze(0).to(device)
            ohe_speaker = ohe_speaker.unsqueeze(0).to(device)
            encoded = model._encoder(ohe_audio)
            _, valid_quantize, _ = model._vq_vae(encoded)
            
            valid_reconstructions = model._decoder.incremental_forward(ohe_audio[:,:,0:1], 
                                                               valid_quantize, 
                                                               ohe_speaker, 
                                                               T=16126)
            
            
#         for i in range(0, len(wav), 16126):
#             sample = quantized[i:i+16126]
#             if (len(sample)!= 16126):
#                 sample = np.append(sample, np.zeros(16126 - len(sample)).astype(int))
                
#             sample = torch.from_numpy(sample)
#             print(sample)
#             print(torch.max(sample), torch.min(sample))
#             ohe_audio = torch.FloatTensor(256, 16126).zero_()
#             ohe_audio.scatter_(0, sample.unsqueeze(0), 1.)
            
#             speaker_id = torch.from_numpy(np.array(speaker_index)).unsqueeze(0).unsqueeze(0)
#             ohe_speaker = torch.FloatTensor(number_of_speakers, 1).zero_()
#             ohe_speaker.scatter_(0, speaker_id, 1.)

#             ohe_audio = ohe_audio.unsqueeze(0).to(device)
#             ohe_speaker = ohe_speaker.unsqueeze(0).to(device)
#             encoded = model._encoder(ohe_audio)
#             _, valid_quantize, _ = model._vq_vae(encoded)
            
#             valid_reconstructions = model._decoder.incremental_forward(ohe_audio[:,:,0:1], 
#                                                                valid_quantize, 
#                                                                ohe_speaker, 
#                                                                T=16126)
            
                
            recon = valid_reconstructions.squeeze().argmax(dim=0).detach().cpu().numpy()
            mu_encoded = (recon + 1) / 128 - 1
            mu_decoded = mu_law_decode(mu_encoded, mu=256)
            generated_file = np.append(generated_file, mu_decoded)
        print(len(generated_file))
        librosa.output.write_wav("generated.wav",  generated_file, sr=sr)
                

In [None]:
librosa.output
generate(original_wav = './VCTK/wav48/p225/p225_001.wav', speaker = 'p226', filename = "generated.wav")

In [None]:
model.eval()

original_wav = './VCTK/wav48/p225/p225_001.wav'
speaker_index = speaker_dic['p226']

wav, sr = librosa.load(original_wav)
print(len(wav))
speaker = np.eye(number_of_speakers)[speaker_index]
speaker = torch.tensor(speaker)
speaker = speaker.unsqueeze(0)

normalized = librosa.util.normalize(wav)
quantized = quantize_data(normalized, 256)

max_audio_start = quantized.shape[0] - 16126
if (max_audio_start > 0):
    audio_start = random.randint(0, max_audio_start)
    sample = quantized[audio_start:audio_start+16126]
else :
    sample = np.append(quantized, np.zeros(16126 - len(quantized))).astype(int)

sample = torch.from_numpy(sample)
ohe_audio = torch.FloatTensor(256, 16126).zero_()
ohe_audio.scatter_(0, sample.unsqueeze(0), 1.)

speaker_id = torch.from_numpy(np.array(speaker_index)).unsqueeze(0).unsqueeze(0)
ohe_speaker = torch.FloatTensor(number_of_speakers, 1).zero_()
ohe_speaker.scatter_(0, speaker_id, 1.)


valid_originals = ohe_audio.to(device).unsqueeze(0)
speaker_id = ohe_speaker.to(device).unsqueeze(0)

with torch.no_grad():
    encoded = model._encoder(valid_originals)

    _, valid_quantize, _ = model._vq_vae(encoded)
    #valid_reconstructions = model._decoder(valid_originals, valid_quantize, speaker_id) - this one works fine
    valid_reconstructions = model._decoder.incremental_forward(valid_originals[:,:,0:1], 
                                                               valid_quantize, 
                                                               speaker_id, 
                                                               T=16126)

In [None]:
plt.plot(valid_quantize[:,0,:].detach().cpu().numpy().ravel())
plt.show()

In [None]:
recon = valid_reconstructions.squeeze().argmax(dim=0).detach().cpu().numpy()
plt.plot(recon)
plt.show()

In [None]:
orig = valid_originals.squeeze().argmax(dim=0).detach().cpu().numpy()
#plt.plot(valid_quantize.detach().numpy().ravel())
plt.plot(orig)
plt.show()

In [None]:
recon = valid_reconstructions.squeeze().argmax(dim=0).detach().cpu().numpy()
mu_encoded = (recon + 1) / 128 - 1
mu_decoded = mu_law_decode(mu_encoded, mu=256)
plt.plot(mu_decoded[2000:4000])
plt.show()

In [None]:
recon = valid_originals.squeeze().argmax(dim=0).detach().cpu().numpy()
mu_encoded_orig = (recon + 1) / 128 - 1
mu_decoded_orig = mu_law_decode(mu_encoded_orig, mu=256)
plt.plot(mu_decoded_orig[2000:4000])
plt.show()

In [None]:
from IPython.display import Audio, display

In [None]:
display(Audio(mu_decoded, rate=22050))

In [None]:
display(Audio(mu_decoded_orig, rate=22050))

## View Embedding

In [None]:
embeddings = model._vq_vae._embedding.weight.data.cpu()
print(embeddings)
print(embeddings.shape)
# plt.plot(sorted(embeddings))
# plt.show()

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
y = pca.fit_transform(embeddings)
plt.scatter(y[:,0],y[:,1])
plt.show()

In [None]:
print("done!")