#### <p style="text-align:right";> *Techniques Avancées d'Apprentissage - ENSAE ParisTech - 2017/2018*</p>  <p style="text-align:right";>Charles Dognin - Samuel Ritchie</p>

# <p style="text-align:center";><span style="color: #fb4141">Music generation with Variational Auto-Encoders</span></p> 

This Notebook aims at trying to generate music (and in a first time images) thanks to a particular type of generative model : Variational Auto-Encoders. This method was first described by Diederik P. Kingma and Max Welling in *Auto-Encoding Variational Bayes* (https://arxiv.org/pdf/1312.6114.pdf) and today achieves state of the art results along with Generative Adversarial Networks (GANs) in data generation (text, image, music).

## 0. Import useful packages

In [3]:
import numpy as np
import IPython
from IPython.display import Image
import cv2
import matplotlib.pyplot as plt
from pathlib import Path
from jyquickhelper import add_notebook_menu
add_notebook_menu()

Use Keras with associated Tensorflow backend and relevant associated functions

In [2]:
from keras import metrics
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, History
import numpy as np
from keras.models import Sequential, Model
from keras.layers import LSTM, Dense, Embedding, Bidirectional, Dropout, Conv1D, \
    MaxPooling1D, Flatten, BatchNormalization, LeakyReLU, Lambda
from keras.optimizers import Adam, RMSprop
from keras.engine.topology import Input
from keras import backend as K

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## 1. A few theoretical aspects

We start by giving a few insights on generative models and VAEs. Please refer to the full report in /Deliverables folder for more details.

Basically, a generative model comes down to describing how data is generated in terms of probabilistic model. Two of the most commonly used and efficient approaches are VAEs and GANs. These both methods fundamentally differ in the approach for density estimation : GANs aims at achieving a Nash equilibrium between a Generator and a Discriminator, while VAEs are based on the auto-encoder theory.

Traditional Auto-Encoders are models whose goal are to learn a compressed representation of the data, similarly to the Principal Component Analysis. The autoencoder has two parts: an encoder and a decoder. The encoder encodes the input into a "code" (also called latent space), generally of lower dimension than the input in order to only keep the most important information. While the primary purpose of such models was dimensionality reduction, the rise of deep learning frameworks and interest in generating data made the autoencoder concept be widely used for learning generative models of data.

<img src="images/autoencoder.jpg" >
<figcaption>*Image source : https://blog.keras.io/building-autoencoders-in-keras.html*</figcaption>

Many applications use autoencoder methods, among which dimensionality reduction as already mentioned, and most importantly denoising autoencoders allowing to find the relevant features in a blurred input signal. However, regarding generation of data from the learned representation, they are extremely limited. Indeed, the latent variable does not have a tractable distribution, or said in other terms the latent space may not allow easy interpolation. More precisely, one could say that autoencoders are fine for *replicating* data (thanks to clusters in the latent space), but is not good at *generating* new data because of eventual discontinuities in the latent space, as it can be seen in the following image.

Let x be the data we want to model and z the latent variable (in a lower dimension). In what follows, we will refer to the decoder network distribution modeling as $p_{\theta}(x|z)$ and the encoder one as $q_{\Phi}(z|x)$. Our goal is to learn model parameters in order to maximize the likelihood of training data :

$$ p_{\theta} (x) = \int p_{\theta} (z) p_{\theta} (x|z) dz$$

This problem is completely untractable since it is impossible to integrate over the whole latent space. Using variational inference methods (see report for details), one can lower bound the log-likelihood by a sum of two terms :

$$ 
\log p_{\theta}(x) \geq \mathcal{L}(x, \theta, \Phi) = \mathbb{E}_z \left[ \log p_{\theta}(x|z) \right] - D_{KL} \left( q_{\Phi}(z|x), p_{\theta}(z)\right) \\
 $$

The variational lower-bound is the sum of two terms (in the auto-encoder we only had reconstruction loss) : 
> -  the *reconstruction error* : it is the log-likelihood of the observed $x$ given the latent feature $z$ we have sampled. It is linked to the decoder network performances $p_{x|z}$ 
> -  the second part corresponds to the difference between distributions $p(z)$ we want to estimate and $q(z|x)$ which is used to approximate it. In practice, standard normal distributions will be used. This part checks that the proposal distribution should be like a Gaussian (or any other chosen distribution) and is often called the *regularization term*

<img src="images/latent_space_diff.png" >
<figcaption>*Image source : https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf*</figcaption>

The image above clearly shows to what extent VAEs enable us to sample new image from the latent space, and not only replicate data as auto-encoders do.

## 2. VAE Model

### 2.1. Signal Pre-Processing

#### 2.1.1 Image pre-processing

In [6]:
x_train, x_val = [], []
p_train = Path("data/train")
p_val = Path("data/val")
data_train = list(p_train.glob("*.jpg"))
data_val = list(p_val.glob("*.jpg"))
data_train = [str(path) for path in data_train]
data_val = [str(path) for path in data_val]

for path in data_train:
    im = cv2.imread(path)
    im = im.astype('float32')
    im = cv2.resize(im, (20, 20)) / 255
    im = np.expand_dims(im, axis=0)
    x_train.append(im)
    
for path in data_val:
    im = cv2.imread(path)
    im = im.astype('float32')
    im = cv2.resize(im, (20, 20)) / 255
    im = np.expand_dims(im, axis=0)
    x_val.append(im)

x_train = np.array(x_train)
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))

x_val = np.array(x_val)
x_val = x_val.reshape((len(x_val), np.prod(x_val.shape[1:])))


#### 2.1.2 Audio pre-processing

### 2.2 Training algorithm

In [None]:
# Essayer différentes activations, différentes profondeurs, différents types de couches

class VAE:

    def __init__(self, batch_size, epochs, original_dim=1200, 
                 latent_dim=100, intermediate_dim=100):
        self.batch_size = batch_size
        self.epochs = epochs
        self.original_dim = original_dim
        self.latent_dim = latent_dim
        self.intermediate_dim = intermediate_dim
        self.encoder = self.make_encoder()
        self.sampling_layer = self.make_sampling_layer()
        self.decoder = self.make_decoder()
        self.vae_model = self.make_vae(self.sampling_layer, self.encoder, self.decoder)

    def fit(self, x_train, x_val):
        """
        Train the Vae using the Adam Optimizer. 
        """
        
        early_stopping = EarlyStopping(monitor='val_loss',
                                       patience=3,
                                       mode='min')

        reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
                                      patience=1, min_lr=0.001)
        
        #xent_loss = self.original_dim * metrics.binary_crossentropy(
        #K.flatten(x_train), K.flatten(self.x_decoded))
        #kl_loss = - 0.5 * K.sum(1 + self.z_sigma - K.square(self.z_mean) - K.exp(self.z_sigma), axis=-1)
        #vae_loss = K.mean(xent_loss + kl_loss)
        #self.vae_model.add_loss(vae_loss)
        self.vae_model.compile(optimizer="Adam", loss="categorical_crossentropy")
        self.vae_model.fit(x_train, x_train,
                           validation_data=(x_val, x_val), 
                           epochs=self.epochs,
                           batch_size=self.batch_size,
                           callbacks=[reduce_lr, early_stopping])
        
    def predict(self, z_test):
        """
        At test time, we evaluate the VAE's ability to generate a new sample. We can remove the 
        encoder as there is no test image. We sample z from a N(0, I), pass it through the decoder.
        There are no good quantitative metrics, only visual appreciation.
        """
        
        return self.decoder.predict(z_test)
     
    def make_encoder(self):
        """
        Transform the input into a distribution, composed of the mean and the variance
        
        Returns:
        model -- the encoder model with the object as input and the z_mean and z_sigma 
        as output
        """
        
        enc_input = Input(shape=(self.original_dim,))
        # We can try without the intermediate dim, directly relating input to latent z
        x = Dense(self.intermediate_dim, activation='relu')(enc_input)
        self.z_mean = Dense(self.latent_dim)(x)
        self.z_sigma = Dense(self.latent_dim)(x)
        model = Model(enc_input, outputs = [self.z_mean, self.z_sigma])
        
        return model
    
    def make_decoder(self):
        """
        Decodes the latent vector z and match it with the original output
        
        Returns:
        model -- the decoder model with the latent variable as input and the original object as output
        """
        
        dec_input = Input(shape=(self.latent_dim,))
        x = Dense(self.intermediate_dim, activation='relu')(dec_input)
        x = Dense(self.original_dim, activation='sigmoid')(x)
        model = Model(dec_input, x)
        
        return model
    
    def sampling(self, params):
        """
        Function that uses the learned mean and sigma from the data and return the latent vector z
        This is the re-parametrization trick. Instead of taking (z -> N(z_mean, z_sigma)), we take
        (epsilon -> N(0, I) and z = z_mean + z_sigma * epsilon). 
        Arguments:
        z_mean -- the learned mean
        z_sigma -- the learned standard deviation
        """
        
        z_mean, z_sigma = params
        epsilon = K.random_normal((self.batch_size, self.latent_dim), 0.0, 1.0)
        z = z_mean + K.exp(z_sigma / 2) * epsilon

        return z
        
    def make_sampling_layer(self):
        sampling_layer = Lambda(self.sampling, output_shape=(self.latent_dim,))
        return sampling_layer
    
    def make_vae(self, sampling_layer, encoder, decoder):
        """
        Compile the entire variational auto-encoder
        """
        
        input_ = Input(shape=(self.original_dim,))
        z_mean, z_sigma = encoder(input_)
        z = sampling_layer([z_mean, z_sigma])
        self.x_decoded = decoder(z)
        model = Model(input_, self.x_decoded)
    
        return model  
    
vae = VAE(1, 50)
vae.fit(x_train, x_val)

In [None]:
z_test = np.random.randn(1,100)
pred = vae.predict(z_test)
pred = pred.reshape((20, 20, 3))

In [None]:
plt.imshow(x_train[0].reshape(20, 20, 3))
plt.show()

In [None]:
plt.imshow(pred)
plt.show()

## 3. Results