# Deep Learning
### Week 8: Normalising flows I: Autoregressive flows

## Contents

[1. Introduction](#introduction)

[2. Change of variables formula](#changeofvariables)

[3. Distributions (\*)](#distributions)

[4. Bijectors (\*)](#bijectors)

[5. Autoregressive flows](#arflows)

[6. Masked autoregressive flow (MAF) (\*)](#maf)

[References](#references)

<a class="anchor" id="introduction"></a>
## Introduction

So far in this module we have covered many of the fundamental building blocks of deep learning: from mathematical neurons and multilayer perceptrons, to optimisation and regularisation of deep learning models, and important architectures such as convolutional and recurrent neural networks.

In the remaining weeks of the module, we will use these building blocks to focus our attention on the probabilistic approach to deep learning. This is a branch of deep learning that aims to make use of tools from probability theory to account for noise and uncertainty in the data. Probabilistic deep learning models make direct use of probability distributions and latent random variables in the model architecture.

In this week of the course we will begin to familiarise ourselves with the [TensorFlow Probability](https://www.tensorflow.org/probability/) (TFP) library, which is built on TensorFlow to enable a closer integration between deep learning, probabilistic modelling and statistical analysis. In particular, we will learn about `Distribution` and `Bijector` objects in TFP.

These objects will provide the tools we need to develop normalising flow deep learning models. Normalising flows are a class of generative models, that were first popularised in the context of variational inference by [Rezende & Mohamed 2015](#Rezende15), and in the context of density estimation by [Dinh et al 2015](#Dinh15). In this week, we will focus on using normalising flows to estimate continuous data distributions.

When trained as a density estimator, this type of model is able to produce new instances that could plausibly have come from the same dataset that it is trained on, as well as tell you whether a given example instance is likely. However, for complex datasets the data distribution can be very difficult to model, so this is a highly nontrivial task in general. This is where the power of deep learning models can be leveraged to learn highly multimodal and complicated data distributions, and this type of model has been successfully applied to domains such as image generation ([Ho et al 2019](#Ho19)), noise modelling ([Abdelhamed et al 2019](#Abdelhamed19)), audio synthesis ([Prenger et al 2019](#Prenger19)), and video generation ([Kumar et al 2019](#Kumar19)).

In this week, we will see how to construct autoregressive normalising flow architectures, including the masked autoregressive flow ([Papamakarios et al 2017](#Papamakarios17)).

<a class="anchor" id="changeofvariables"></a>
## Change of variables formula

The approach taken by normalising flows to solve the density estimation task is to take an initial, simple density, and transform it - possibly using a series of parameterised transformations - to produce a rich and complex distribution. 

If these transformations are smooth and invertible, then we are able to evaluate the density of the complex transformed distribution. This property is important, because it then allows to train such a model using maximum likelihood. This is the idea behind normalising flows. The invertible transformations themselves are what constitute the bijectors module in the Tensorflow Probability library.

We'll start this week by reviewing the change of variables formula, which forms the mathematical basis of normalising flows.

#### Statement of the formula
Let $Z := (z_1,\ldots,z_D)\in\mathbb{R}^D$ be a $D$-dimensional continuous random variable, and suppose that $f:\mathbb{R}^D\mapsto\mathbb{R}^D$ is a smooth, invertible transformation. Now consider the change of variables $X = f(Z)$, with $X=(x_1,\ldots,x_D)$, and denote the probability density functions of the random variables $Z$ and $X$ by $p_Z$ and $p_X$ respectively.

The change of variables formula states that

$$
p_X(\mathbf{x}) = p_Z(\mathbf{z})\cdot\left|\det J_f(\mathbf{z}) \right|^{-1},\label{cov_f}\tag{1}
$$

where $\mathbf{x}, \mathbf{z}\in\mathbb{R}^D$, and $J_f(\mathbf{z})\in\mathbb{R}^{D\times D}$ is the **Jacobian** of the transformation $f$, given by the matrix of partial derivatives

$$
J_f(\mathbf{z}) = \left[ 
\begin{array}{ccc}
\frac{\partial f_1}{\partial z_1}(\mathbf{z}) & \cdots & \frac{\partial f_1}{\partial z_D}(\mathbf{z})\\
\vdots & \ddots & \vdots\\
\frac{\partial f_D}{\partial z_1}(\mathbf{z}) & \cdots & \frac{\partial f_D}{\partial z_d}(\mathbf{z})\\
\end{array}
\right],
$$

and $\left|\det J_f(\mathbf{z}) \right|$ is the absolute value of the determinant of the Jacobian matrix. Note that (1) can also be written in the log-form

$$
\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) - \log \hspace{0.1ex}\left|\det J_f(\mathbf{z}) \right|. \label{logcov_f}\tag{2}
$$

Furthermore, we can equivalently consider the transformation $Z = f^{-1}(X)$. Then the change of variables formulae can be written as

$$
\begin{align}
p_Z(\mathbf{z}) &= p_X(\mathbf{x})\cdot\left|\det J_{f^{-1}}(\mathbf{x}) \right|^{-1},\label{cov_finv}\tag{3}\\
\log p_Z(\mathbf{z}) &= \log p_X(\mathbf{x}) - \log \hspace{0.1ex}\left|\det J_{f^{-1}}(\mathbf{x}) \right|.\label{logcov_finv}\tag{4}
\end{align}
$$

#### A simple example
We will demonstrate the change of variables formula with a simple example. Let $Z=(z_1, z_2)$ be a 2-dimensional random variable that is uniformly distributed on the unit square $[0, 1]^2 =: \Omega_Z$. We also define the transformation $f:\mathbb{R}^2 \mapsto \mathbb{R}^2$ as

$$
\begin{align}
f(z_1, z_2) = (\lambda z_1, \mu z_2)
\end{align}
$$

for some nonzero $\lambda, \mu\in\mathbb{R}$. The random variable $X=(x_1, x_2)$ is given by $X = f(Z)$. 

<img src="figures/change_of_variables.pdf" alt="Change of variables example in 2D" style="width: 750px;"/>
<center>Linearly transformed uniformly distributed random variable</center>

Since $\int_{\Omega_Z}p_Z(\mathbf{z})d\mathbf{z} = 1$ and $\mathbf{z}$ is uniformly distributed, we have that 

$$
p_Z(\mathbf{z}) = 1 \quad\text{for}\quad \mathbf{z}\in\Omega_Z.
$$

The random variable $X$ is uniformly distributed on the region $\Omega_X = f(\Omega_Z)$ as shown in the figure above (for the case $\lambda, \mu>0$). Since again $\int_{\Omega_X}p_X(\mathbf{x})d\mathbf{x} = 1$, the probability density function for $X$ must be given by 

$$
p_X(\mathbf{x}) = \frac{1}{|\Omega_X|} = \frac{1}{|\lambda\mu |}\quad\text{for}\quad \mathbf{x}\in\Omega_X.
$$

This result corresponds to the equations \eqref{cov_f}-\eqref{logcov_finv} above. In this simple example, the transformation $f$ is linear, and the Jacobian matrix is given by

$$
\begin{align}
J_f(\mathbf{z}) = \left[
\begin{array}{cc}
\lambda & 0\\
0 & \mu
\end{array}
\right].
\end{align}
$$

The absolute value of the determinant is $\left|\det J_{f^{-1}}(\mathbf{x}) \right| = |\lambda\mu | \ne 0$. Equation \eqref{cov_f} then implies

$$
\begin{align}
p_X(\mathbf{x}) &= p_Z(\mathbf{z})\cdot\left|\det J_f(\mathbf{z}) \right|^{-1}\\
&= \frac{1}{|\lambda\mu|}.
\end{align}
$$

Writing in the log-form as in equation \eqref{logcov_f} gives

$$
\begin{align}
\log p_X(\mathbf{x}) &= \log p_Z(\mathbf{z}) - \log \hspace{0.1ex}\left|\det J_f(\mathbf{z}) \right|\\
&= \log (1) - \log |\lambda\mu|\\
&= - \log |\lambda\mu|.
\end{align}
$$

#### Sketch of proof in 1-D
We now provide a sketch of the proof of the change of variables formula in one dimension. Let $Z$ and $X$ be random variables such that $X = f(Z)$, where $f : \mathbb{R}\mapsto\mathbb{R}$ is a $C^k$ diffeomorphism with $k\ge 1$. The change of variables formula in one dimension can be written

$$
p_X(x) = p_Z(z)\cdot\left| \frac{d}{dz}f(z) \right|^{-1},\qquad\text{(cf. equation \eqref{cov_f})}
$$

or equivalently as

$$
p_X(x) = p_Z(z)\cdot\left| \frac{d}{dx}f^{-1}(x) \right|.\qquad\text{(cf. equation \eqref{cov_finv})}
$$

_Sketch of proof._ For $f$ to be invertible, it must be strictly monotonic. That means that for all $x^{(1)}, x^{(2)}\in\mathbb{R}$ with $x^{(1)} < x^{(2)}$, we have $f(x^{(1)}) < f(x^{(2)})$ (strictly monotonically increasing) or $f(x^{(1)}) > f(x^{(2)})$ (strictly monotonically decreasing).

<img src="figures/change_of_variables_monotonic.pdf" alt="Monotonic functions" style="width: 600px;"/>
<center>Sketch of monotonic functions: (a) strictly increasing, (b) strictly decreasing</center>

Suppose first that $f$ is strictly increasing. Also let $F_X$ and $F_Z$ be the cumulative distribution functions of the random variables $X$ and $Z$ respectively. Then we have

$$
\begin{align}
F_X(x) &= P(X \le x)\\
&= P(f(Z) \le x)\\
&= P(Z \le f^{-1}(x))\qquad\text{(since $f$ is monotonically increasing)}\\
&= F_Z(f^{-1}(x))
\end{align}
$$

By differentiating on both sides with respect to $x$, we obtain the probability density function:

$$
\begin{align}
p_X(x) &= \frac{d}{dx}F_X(x)\\
&= \frac{d}{dx} F_Z(f^{-1}(x))\\
&= \frac{d}{dz}F_Z(z)\cdot\frac{d}{dx}f^{-1}(x)\\
&= p_Z(z)\frac{d}{dx}f^{-1}(x) \label{pdfx_inc}\tag{5}
\end{align}
$$

Now suppose first that $f$ is strictly decreasing. Then

$$
\begin{align}
F_X(x) &= P(X \le x)\\
&= P(f(Z) \le x)\\
&= P(Z \ge f^{-1}(x))\qquad\text{(since $f$ is monotonically decreasing)}\\
&= 1 - F_Z(f^{-1}(x))
\end{align}
$$

Again differentiating on both sides with respect to $x$:

$$
\begin{align}
p_X(x) &= \frac{d}{dx}F_X(x)\\
&= -\frac{d}{dx} F_Z(f^{-1}(x))\\
&= -F_Z'(f^{-1}(x))\frac{d}{dx}f^{-1}(x)\\
&= -p_Z(z)\frac{d}{dx}f^{-1}(x) \label{pdfx_dec}\tag{6}
\end{align}
$$

Now note that the inverse of a strictly monotonically increasing (resp. decreasing) function is again strictly monotonically increasing (resp. decreasing). This implies that the quantity $\frac{d}{dx} f^{-1}(x)$ is positive in \eqref{pdfx_inc} and negative in \eqref{pdfx_dec}, and so these two equations can be combined into the single equation:

$$
p_X(x) = p_Z(z)\left|\frac{d}{dx}f^{-1}(x)\right|
$$

which completes the proof.

#### Application to normalising flows
Normalising flows are a class of models that exploit the change of variables formula to estimate an unknown target data density. 

Suppose we have data samples $\mathcal{D}:=\{\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}\}$, with each $\mathbf{x}^{(i)}\in\mathbb{R}^D$, and assume that these samples are generated i.i.d. from the underlying distribution $p_X$. 

A normalising flow models the distribution $p_X$ using a random variable $Z$ (also of dimension $D$) with a simple distribution $p_Z$ (e.g. an isotropic Gaussian), such that the random variable $X$ can be written as a change of variables $X = f_\theta(Z)$, where $\theta$ is a parameter vector that parameterises the smooth invertible function $f_\theta$. 

The function $f_\theta$ is modelled using a neural network with parameters $\theta$, which we want to learn from the data. An important point is that this neural network must be designed to be invertible, which is not the case in general with deep learning models. In practice, we often construct the neural network by composing multiple simpler blocks together. In TensorFlow Probability, these simpler blocks are the _bijectors_ that we will study in the first part of the week.

We use the principle of maximum likelihood to learn the optimal parameters $\theta$; that is:

$$
\begin{align}
\theta_{ML} &:= \arg \max_{\theta} P(\mathcal{D}; \theta)\\
&= \arg \max_{\theta} \log P(\mathcal{D}; \theta).
\end{align}
$$

In order to compute $\log P(\mathcal{D}; \theta)$ we can use the change of variables formula:

$$
\begin{align}
P(\mathcal{D}; \theta) &= \prod_{\mathbf{x}\in\mathcal{D}}  p_Z(f_\theta^{-1}(\mathbf{x})) \cdot\left|\hspace{0.1ex}\det J_{f_\theta^{-1}}(\mathbf{x}) \hspace{0.1ex}\right|\\
\log P(\mathcal{D}; \theta) &= \sum_{x\in\mathcal{D}} \log p_Z(f_\theta^{-1}(\mathbf{x})) + \log \hspace{0.1ex}\left|\hspace{0.1ex}\det J_{f_\theta^{-1}}(\mathbf{x}) \hspace{0.1ex}\right|\label{logliknf}\tag{7}
\end{align}
$$

The term $p_Z(f_\theta^{-1}(\mathbf{x}))$ can be computed for a given data point $\mathbf{x}\in\mathcal{D}$ since the neural network $f_\theta$ is designed to be invertible, and the distribution $p_Z$ is known. The term $\det J_{f_\theta^{-1}}(\mathbf{x})$ is also computable, although this also highlights another important aspect of normalising flow models: they should be designed such that the determinant of the Jacobian can be efficiently computed.

The log-likelihood \eqref{logliknf} is usually optimised as usual in minibatches, with gradient-based optimisation methods.

<a class="anchor" id="distributions"></a>
## Distributions

In this section we will look at `Distribution` objects in TensorFlow Probability, which are naturally one of the fundamental building blocks of the library. The main operations that we'll be using are sampling, and computing log-probabilities. 

A key point in understanding the interface and behaviour of these objects is that they are designed to perform vectorised computations for efficiency. This means that single objects are able to handle batches of distributions, samples, and log-probability computations. We'll see that this means we have to think quite carefully about the shapes of Tensors that we're using, and what these shapes mean. 

In [None]:
import tensorflow as tf

We will use the following namespace for the TensorFlow Probability library.

In [None]:
import tensorflow_probability as tfp
tfd = tfp.distributions

#### Univariate distributions
We will first create some univariate distributions. There is a wide range of distributions available in the [distributions module](https://www.tensorflow.org/probability/api_docs/python/tfp/distributions), of which we will only be using a few.

In [None]:
# Create a univariate Normal distribution

normal = tfd.Normal(loc=0., scale=1.)
normal

In [None]:
# Sample from the distribution

normal.sample()

In [None]:
# Draw multiple samples

normal.sample(5)

In [None]:
# We can pass a shape to the sample method

normal.sample((2, 3))

In [None]:
# Plot some samples

import matplotlib.pyplot as plt

z = normal.sample(10000).numpy()
plt.hist(z, bins=50, density=True)
plt.show()

In [None]:
# Compute prob / log-prob of test points

normal.log_prob(0)

In [None]:
# Compute prob / log-prob of a batch of test points

test_pts = tf.random.normal((3, 2))
normal.log_prob(test_pts)

A single `Distribution` object can represent a batch of distributions of the same type:

In [None]:
# Create an exponential distribution

exp = tfd.Exponential(rate=1)
exp

In [None]:
# Create a batched exponential distribution

batch_of_exps = tfd.Exponential(rate=[0.5, 1.0, 1.5])
batch_of_exps

In [None]:
# Sample from the distribution

batch_of_exps.sample()
batch_of_exps.sample(2)
batch_of_exps.sample((2, 1))

We can see a first use of broadcasting when computing log-probabilities with a batched distribution.

In [None]:
# Compute log-probs

test_pts = tf.random.uniform((4, 1))
batch_of_exps.log_prob(test_pts)

#### Multivariate distributions
In the distributions seen so far, the `event_shape` property has been empty, indicating that the distribution is univariate. Here, we look at multivariate distributions.

In [None]:
# Create a multivariate Gaussian distribution

mvn = tfd.MultivariateNormalDiag(loc=[0, 3], scale_diag=[1, 2])
mvn

In [None]:
# Sample from the distribution

mvn.sample(3)

In [None]:
# Plot samples from the multivariate Gaussian

samples = mvn.sample(1000)
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5)
plt.axis('equal')
plt.show()

In [None]:
# Compute log-probs

test_pts = tf.random.normal((3, 2))
mvn.log_prob(test_pts)

We can also create a multivariate Gaussian using [`MultivariateNormalTriL`](https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalTriL), by passing in the lower triangular matrix $L$ such that $LL^T = \Sigma$, where $\Sigma$ is the covariance matrix. This is the Cholesky decomposition (see also [`tf.linalg.cholesky`](https://www.tensorflow.org/api_docs/python/tf/linalg/cholesky)).

In [None]:
# Construct a multivariate Gaussian with MultivariateNormalTriL

mu = [0., 0.]
scale_tril = [[1.,  0.],
              [0.6, 0.8]]
mvn2 = tfd.MultivariateNormalTriL(loc=mu, scale_tril=scale_tril)
mvn2

In [None]:
# Plot samples from the multivariate Gaussian

samples = mvn2.sample(1000)
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5)
plt.axis('equal')
plt.show()

There are further ways of constructing a multivariate Gaussian: see the docs for [`MultivariateNormalDiagPlusLowRank`](https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalDiagPlusLowRank), [`MultivariateNormalFullCovariance`](https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalFullCovariance) and [`MultivariateNormalLinearOperator`](https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalLinearOperator).

Multivariate distributions can also be batched together, as in the following example.

In [None]:
# Create a batched multivariate Gaussian

batched_mvn = tfd.MultivariateNormalDiag(loc=[[0., -1., -0.5], [1., 0.5, 0.]], scale_diag=[0.5, 1.5, 1.])
batched_mvn

In [None]:
# The batch and event shape are properties of the distribution

batched_mvn.event_shape

Of course we can also sample from a batched, multivariate distribution. The following shows the ordering of shapes that we should always keep in mind when working with `Distribution` objects:

`(sample_shape, batch_shape, event_shape)`

In [None]:
# Sample from the batched multivariate Gaussian

batched_mvn.sample(4)

_Exercise._ Take a look at the following Distribution object and call to the `log_prob` method. Work out what the shape of the resulting Tensor will be before you run the cell.

In [None]:
mvn3 = tfd.MultivariateNormalDiag(loc=[[[2., 0., 0.5], [1., -0.5, 2.]]], scale_diag=[0.5, 1., 1.5])
test_pts = tf.random.normal((5, 1, 2, 1))
mvn3.log_prob(test_pts).shape

#### Independent distribution
The `Independent` distribution is often useful to manipulate batch and event shapes, and define multivariate distributions from univariate objects.

In [None]:
# Create a batched Bernoulli distribution

bernoulli = tfd.Bernoulli(probs=[[0.1, 0.3, 0.5], [0.4, 0.8, 0.7]])
bernoulli

In [None]:
# Transfer the second batch dimension into the event space

ind_bernoulli = tfd.Independent(bernoulli, reinterpreted_batch_ndims=1)
ind_bernoulli

In [None]:
# Compute log-probs on both distributions

import numpy as np

test_pts = np.random.choice([0, 1], (2, 3))
print(bernoulli.log_prob(test_pts))
print(ind_bernoulli.log_prob(test_pts))

In [None]:
# Transfer all batch dimensions into the event space

ind_bernoulli = tfd.Independent(bernoulli, reinterpreted_batch_ndims=2)
ind_bernoulli

In [None]:
# Compute log-probs with the new distribution

ind_bernoulli.log_prob(test_pts)

_Exercise._ Construct a distribution object over a three-dimensional event space $(X_1, X_2, X_3)$, where each $X_i$ are independently distributed according to a Bernoulli distribution where the probability of a 0 event is equal to 0.9, 0.7 and 0.5 respectively. Use your distribution object to show that the log probability of the event $P(X_1, X_2, X_3) = (1, 1, 1)$ is equal to -4.199705.

In [None]:
b = tfd.Independent(tfd.Bernoulli(probs=[0.1, 0.3, 0.5]), reinterpreted_batch_ndims=1)
b.log_prob(tf.ones(3))

<a class="anchor" id="bijectors"></a>
## Bijectors

In this section we will look at bijectors, which are another fundamental building block in TensorFlow Probability. Bijectors constitute the invertible and differentiable transformations that we will use to construct normalising flows. The [bijectors module](https://www.tensorflow.org/probability/api_docs/python/tfp/bijectors) has a range of in-built bijector functions, which can be composed to make complex transformations.

In [None]:
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

Two simple bijectors are the `Scale` and `Shift` bijectors.

In [None]:
# Create Scale and Shift bijectors

scale = tfb.Scale(3.0)
shift = tfb.Shift(-5)

In [None]:
# Draw samples from a standard Normal distribution

normal = tfd.Normal(loc=0., scale=1.)
z = normal.sample(1000)

In [None]:
# Pass the samples through the forward method of each bijector

h = scale(z)
x = shift(h)
x.shape

In [None]:
# Plot the original and transformed samples

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(13, 4))

plt.subplot(1, 2, 1)
plt.hist(z.numpy(), bins=50)
plt.title("Original samples")

plt.subplot(1, 2, 2)
plt.hist(x.numpy(), bins=50)
plt.title("Transformed samples")
plt.show()

In [None]:
# Chain the bijectors together

scale_and_shift = tfb.Chain([shift, scale])
scale_and_shift = shift(scale)

In [None]:
# Pass the transformed samples through the inverse method of each bijector

x_inv = scale_and_shift.inverse(x)
plt.hist(x_inv.numpy(), bins=50)
plt.show()

#### Batched bijectors

In [None]:
# Create a batched Softfloor bijector

softfloor = tfb.Softfloor(temperature=[0.01, 0.1, 1.])

In [None]:
# Pass some test points through the bijector

import numpy as np

test_pts = np.linspace(-2, 2, 1000)
transformed_pts = softfloor(test_pts[..., tf.newaxis])

In [None]:
# Plot the transformed samples

plt.figure(figsize=(12, 5))
plt.plot(test_pts, transformed_pts[:, 0], label='temperature=0.01')
plt.plot(test_pts, transformed_pts[:, 1], label='temperature=0.1')
plt.plot(test_pts, transformed_pts[:, 2], label='temperature=1.')
plt.legend()
plt.show()

#### Computing log-probs of transformed samples

In [None]:
# Create an Exp bijector

exp = tfb.Exp()

In [None]:
# Compute the exp of the standard Normal samples

x = exp.forward(z)

Recall the change of variables formula:

$$
\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) - \log \hspace{0.1ex}\left|\det J_f(\mathbf{z}) \right|
$$

In [None]:
# Use log_det_jacobian to compute log_probs of transformed samples

logprob_x = normal.log_prob(z) - exp.forward_log_det_jacobian(z, event_ndims=0)
logprob_x.shape

And using the inverse transformation:

$$
\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) + \log \hspace{0.1ex}\left|\det J_{f^{-1}}(\mathbf{x}) \right|
$$

In [None]:
# Repeat the calculation with the inverse_log_det_jacobian

logprob_x2 = normal.log_prob(z) + exp.inverse_log_det_jacobian(x, event_ndims=0)
np.allclose(logprob_x, logprob_x2)

#### The TransformedDistribution
The `TransformedDistribution` class provides a consistent API for distributions defined by bijectors and base distributions.

In [None]:
# Define the log-normal distribution with TransformedDistribution

transformed_normal = tfd.TransformedDistribution(normal, exp)

In [None]:
# Confirm the log-probs of the transformed samples are the same as under a log-normal distribution

np.allclose(transformed_normal.log_prob(x), tfd.LogNormal(loc=0., scale=1.).log_prob(x))

`TransformedDistribution` objects can also be defined by calling the bijector on the base distribution.

In [None]:
# Recreate the log-normal TransformedDistribution

lognormal = exp(normal)
lognormal

The `TransformedDistribution` infers the batch shape by broadcasting the batch shapes of the base distribution and the bijector.

In [None]:
# Create a TransformedDistribution from a batched bijector

softfloor = tfb.Softfloor(temperature=[0.01, 0.1, 1.])
normal = tfd.Normal(loc=0., scale=1.)
trans_dist = tfd.TransformedDistribution(normal, softfloor)
trans_dist

In [None]:
# Test the new TransformedDistribution

print(trans_dist.sample())
trans_dist.log_prob(tf.random.normal((3,)))

In [None]:
# Set a scaling lower triangular matrix

fill_tril = tfb.FillScaleTriL()
scale_tril = fill_tril(tf.random.normal((2, 6)))
scale_tril

In [None]:
# Define a bijector that operates on a rank >= 1 event space

scale_matvec_tril = tfb.ScaleMatvecTriL(scale_tril)

In [None]:
# Define TransformedDistribution with a batch and event shape

mv_normal = tfd.MultivariateNormalDiag(loc=[0., 0., 0.], scale_diag=[1., 1., 1.])
mvn_tril = tfd.TransformedDistribution(mv_normal, scale_matvec_tril)
mvn_tril

In [None]:
# Sample from the transformed distribution

mvn_tril.sample()
mvn_tril.log_prob(tf.random.normal((3,)))

_Exercise._ Construct the distribution $\mathcal{N}(\mu, \Sigma)$, where $\mu = [0.5, -0.5]^T$ and $\Sigma = \left[\begin{array}{cc} 2 & 1\\ 1 & 2\end{array}\right]$, first using a `tfd.MultivariateNormalTriL` object, and then using a `tfd.TransformedDistribution` object with a zero-mean Gaussian with identity covariance matrix as a base distribution. Verify that the two representations are mathematically equivalent by computing log probs on a given sample.

<a class="anchor" id="arflows"></a>
## Autoregressive flows

Autoregressive networks are a class of density estimators that explicitly model the data distribution as the product of conditional distributions

$$
p(\mathbf{x}) = \prod_{i=1}^D p(x_i\mid x_{1:i-1})
$$

where $\mathbf{x}\in\mathbb{R}^D$ is a data example, and $x_{1:i-1}$ is the subvector of $\mathbf{x}$ consisting of all elements up to the $(i-1)$-th entry. Autoregressive models are a powerful class of generative model and several architectures have been developed, including the MADE ([Germain et al 2015](#Germain15)), PixelRNN ([van den Oord et al 2016a](#vandenOord16a)), PixelCNN ([van den Oord et al 2016b](#vandenOord16b)), NADE ([Larochelle & Murray 2011](#Larochelle11)), and WaveNet ([van den Oord et al 2016c](#vandenOord16c)).

An example is a model whose conditional density is a Gaussian:

$$
p(x_i \mid x_{1:i-1}) = \mathcal{N}(x_i \mid \mu_i, \sigma_i^2),\quad i=1,\ldots, D\label{gaussian_ar}\tag{8}
$$

where the mean and standard deviations are provided by a neural network

$$
\left.
\begin{array}{rcl}
\mu_i \hspace{-1ex}&=& \hspace{-1ex}f_{\mu_i}(x_{1:i-1})\\
\sigma_i \hspace{-1ex}&=& \hspace{-1ex}f_{\sigma_i}(x_{1:i-1})
\end{array}
\quad
\right\}
\quad
i=1,\ldots D.
$$

To sample from such a model, we sample from a noise vector

$$
\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}),\qquad \mathbf{z}\in\mathbb{R}^D
$$

and recursively compute

$$
x_i = f_{\sigma_i}(x_{1:i-1})z_i + f_{\mu_i}(x_{1:i-1}),\qquad i=1,\ldots, D.\label{ar_forward}\tag{9}
$$

This procedure can be visualised in the following diagram.

<img src="figures/ar_flow_sampling.png" alt="Sampling in the autoregressive model" style="width: 500px;"/>
<center>Sampling $x_i$ sequentially with \eqref{ar_forward} in the autoregressive model \eqref{gaussian_ar}</center>

This recursive computation is a deterministic transformation of the underlying noise variable $\mathbf{z}$ and can be interpreted as a normalising flow:

$$
p(\mathbf{z}) \overset{f}{\Rightarrow} p(\mathbf{x})
$$

The forward transformation is given by \eqref{ar_forward}, where the $x_i$ are computed sequentially. Note that we need to enforce $f_{\sigma_i} > 0$ to ensure invertibility. In this case, the transformation can be inverted, and the inverse transformation is given by

$$
z_i = \frac{x_i - f_{\mu_i}(x_{1:i-1})}{f_{\sigma_i}(x_{1:i-1})},\qquad i=1,\ldots, D.\label{ar_inverse}\tag{10}
$$

There is an important practical difference between \eqref{ar_forward} and \eqref{ar_inverse}. In the forward pass \eqref{ar_forward}, each $x_i$ depends on all the elements $x_{1:i-1}$ and so we need to make $D$ passes through the network to sample a data point $\mathbf{x}\in\mathbb{R}^D$.

However, in the inverse pass \eqref{ar_inverse}, the calculation of each $z_i$ is independent from each other, and so the inverse transformation can be parallelised:

$$
\mathbf{z} = \frac{\mathbf{x} - f_{\mathbf{\mu}}(\mathbf{x})}{f_{\mathbf{\sigma}}(\mathbf{x})},\label{ar_inverse_parallel}\tag{11}
$$

where we denote $f_{\mathbf{\mu}}(\mathbf{x}):=\left(f_{\mu_i}(x_{1:i-1})\right)_{i\in\{1,\ldots,D\}}$ and $f_{\mathbf{\sigma}}(\mathbf{x}):=\left(f_{\sigma_i}(x_{1:i-1})\right)_{i\in\{1,\ldots,D\}}$.

This calculation can be visualised in the following diagram.

<img src="figures/ar_flow_inference.png" alt="Sampling in the autoregressive model" style="width: 500px;"/>
<center>In the inverse transformation \eqref{ar_inverse} all $x_i$ are known and so each $z_i$ can be computed independently, enabling the parallelised computation \eqref{ar_inverse_parallel}</center>

#### The Jacobian determinant
Note that due to the autoregressive property, we have

$$
\frac{\partial f_{\mu_i}}{\partial x_j} = \frac{\partial f_{\sigma_i}}{\partial x_j} = 0,\qquad \text{for }j\ge i.
$$

This gives us

$$
\frac{\partial z_i}{\partial x_j} = 
\left\{
\begin{array}{ll}
\frac{1}{f_{\sigma_i}(x_{1:i-1})},&j=i\\
0,&j>i
\end{array}
\right.
$$

so the Jacobian matrix $\partial \mathbf{z} / \partial\mathbf{x} \in\mathbb{R}^{D\times D}$ is lower triangular, and the determinant can easily be calculated as the product of entries on the diagonal.

We can then apply the change of variables formula to derive the expression for the transformed log-density:

$$
\begin{align}
\log p(\mathbf{x}) &= \log p(\mathbf{z}) + \log\,\left|\,\det\frac{\partial\mathbf{z}}{\partial\mathbf{x}}\,\right|\\
&= \log p(\mathbf{z}) - \sum_{i=1}^D \log f_{\sigma_i}(x_{1:i-1})
\end{align}
$$

#### Masked autoregressive flow (MAF)
An implementation of the conditional-Gaussian autoregressive model described above is the masked autoregressive flow (MAF) ([Papamakarios et al 2017](#Papamakarios17)), where the autoregressive functions $f_{\mu_i}(x_{1:i-1})$ and $f_{\sigma_i}(x_{1:i-1})$ are implemented using the masked autoencoder (MADE) architecture ([Germain et al 2015](#Germain15)).

The MADE is based on an autoencoder network, where the inputs and outputs are the same, and the network is trained to reconstruct the inputs. An example is the MLP depicted in the following diagram.

<img src="figures/unmasked_autoencoder.png" alt="Autoencoder model" style="width: 650px;"/>
<center>An autoencoder model, where the outputs are a reconstruction of the inputs</center>

The MADE turns the autoencoder into an autoregressive network by applying a masking pattern to the weights of the network. For example, in the autoencoder network shown above, the weight matrices $\mathbf{W}^{(0)}\in\mathbb{R}^{4\times3}$, $\mathbf{W}^{(1)}\in\mathbb{R}^{4\times4}$ and $\mathbf{W}^{(2)}\in\mathbb{R}^{3\times4}$ could be multiplied elementwise by the following binary masks respectively:

$$
\mathbf{M}^{(0)}=
\left(
\begin{array}{ccc}
0 & 1 & 1\\
0 & 1 & 0\\
0 & 1 & 1\\
0 & 1 & 1
\end{array}
\right\},\quad
\mathbf{M}^{(1)}=
\left(
\begin{array}{cccc}
0 & 1 & 0 & 0\\
1 & 1 & 1 & 1\\
1 & 1 & 1 & 1\\
0 & 1 & 0 & 0
\end{array}
\right)\quad
\mathbf{M}^{(2)}=
\left(
\begin{array}{cccc}
1 & 1 & 1 & 1\\
0 & 0 & 0 & 0\\
1 & 0 & 0 & 1
\end{array}
\right)
$$

which would produce the following connectivity pattern:

<img src="figures/masked_autoencoder.png" alt="Masked autoencoder (MADE)" style="width: 650px;"/>
<center>A masked autoencoder (MADE) model, where selected weight connections are set to zero in order to produce the autoregressive property</center>

In the above figure, the red connections carry information that only depends on the input $x_2$, whereas the green connections carry information that depends on the inputs $x_2$ and $x_3$. Therefore the output $\hat{x}^{(3)}_2$ is independent of all the inputs (it only depends on the bias $b^{(3)}_2$), the output $\hat{x}^{(3)}_3$ depends only on the input $x_2$ (which will be known after a first pass through the network), and the output $\hat{x}^{(3)}_1$ depends on $x_2$ and $x_3$ (which will both be known after the second pass through the network).

In fact, the MADE outputs mean and standard deviation parameters for each output, and so the architecture is more precisely depicted in the following figure.

<img src="figures/made.png" alt="Masked autoencoder (MADE)" style="width: 750px;"/>
<center>The MADE network produces mean and standard deviation parameters for each output</center>

The forward pass of the masked autoregressive flow (MAF) is summarised in the following pseudocode. The forward input `z` is a length-$D$ vector that is sampled from a diagonal Gaussian $\mathcal{N}(\mathbf{0}, \mathbf{I})$, and a trained `made` network that outputs two length-$D$ arrays `mu` and `sigma`.

-------
>```
> def forward(z):
> 	x = tf.zeros_like(z)
> 	for _ in range(D):
>     	mu, sigma = made(x)
>     	x = z * sigma + mu
> 	return x
>```
-------

Notice how $D$ passes are required through the loop. At each pass, one element of the array `x` is updated correctly. Correctly updated elements of `x` are unaffected by subsequent passes.

Pseudocode for the inverse pass is given below. Here, the input `x` is a length-$D$ data example, and again the `made` network outputs two length-$D$ arrays `mu` and `sigma`.

-------
> ```
> def inverse(x):
> 	mu, sigma = made(x)
>  	return (x - mu) / sigma
> ```
-------

The inverse pass does not require any loops, and can be quickly computed in parallel.

#### Inverse autoregressive flow (IAF)
The autoregressive flow presented above can trivially be inverted, so that the forward pass becomes

$$
\mathbf{x} = \frac{\mathbf{z} - f_{\mathbf{\mu}}(\mathbf{z})}{f_{\mathbf{\sigma}}(\mathbf{z})},
$$

and the inverse pass is given by the recursive relation

$$
z_i = f_{\sigma_i}(z_{1:i-1})x_i + f_{\mu_i}(z_{1:i-1}),\qquad i=1,\ldots, D.
$$

This inverse autoregressive flow (IAF) was proposed in [Kingma et al 2016](#Kingma16) to improve the posterior approximation in variational autoencoders. It was also used in [van den Oord et al 2018](#vandenOord18) as a proposed solution to the slow sequential sampling of audio in the [WaveNet](#vandenOord2016c) architecture.

<a class="anchor" id="maf"></a>
## Masked autoregressive flow (MAF)

The masked autoregressive flow is implemented in TensorFlow Probability with the `AutoregressiveNetwork` and `MaskedAutoregressiveFlow` bijectors.

In [None]:
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

The `AutoregressiveNetwork` implements the MADE network architecture. Despite being found in the `bijectors` module, this object does not represent a bijective transformation, or have the same methods as other bijectors.

In [None]:
# Create a MADE network with the AutoregressiveNetwork bijector

made = tfb.AutoregressiveNetwork(2, event_shape=[3], hidden_units=[16, 16], activation='sigmoid')

In [None]:
# Pass some dummy inputs through the MADE

inputs = tf.random.normal((1, 3))
made(inputs)

In [None]:
# Inspect the network variables

made.variables

The `AutoregressiveNetwork` instance can be used in the `MaskedAutoregressiveFlow` bijector.

In [None]:
# Create a MAF bijector

maf_bijector = tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=made)

In [None]:
# The variables are automatically tracked by the MAF bijector

maf_bijector.variables

Note that the outputs of the MADE are used as shift and log-scale predictions in the MAF for improved numerical stability.

In [None]:
# Test the bijector on a dummy input

maf_bijector.forward(tf.random.normal((3,)))

In [None]:
# An inverse autoregressive flow (IAF) can be created using the Invert bijector

iaf_bijector = tfb.Invert(maf_bijector)

In [None]:
# Define the transformed distribution

normal = tfd.Normal(loc=0, scale=1)
maf = tfd.TransformedDistribution(normal, maf_bijector)
maf

#### Toy dataset example
We will demonstrate the MAF by reproducing the first example from the paper ([Papamakarios et al 2017](#Papamakarios17)), where the target density is defined by

$$
p(x_1,x_2) = \mathcal{N}(x_2 |0, 4)\mathcal{N}(x_1 \mid \frac{1}{4}x_2^2, 1).
$$

We will implement this target distribution using the `JointDistributionSequential` class.

In [None]:
# Define the target distribution

target = tfd.JointDistributionSequential([
    tfd.Normal(loc=0., scale=4.),
    lambda x2: tfd.Normal(loc=0.25*(x2**2), scale=1.)
])
target.sample(3)

In [None]:
# Make a contour and scatter plot from the target density

import numpy as np
import matplotlib.pyplot as plt

X, Y = np.meshgrid(np.linspace(-5, 20, 1000), np.linspace(-10, 10, 500))
Z = target.log_prob(Y, X)

fig = plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
levels = np.log(np.linspace(0.001, 0.05, 15))
plt.contour(X, Y, Z, levels, alpha=0.6, cmap='RdYlBu')
plt.title("Target log-density contour plot")
plt.colorbar()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

plt.subplot(1, 2, 2)
plt.title("Target density scatter plot")
samples = target.sample(500)
plt.scatter(samples[1], samples[0], alpha=0.5)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.show()

#### Create a training Dataset

In [None]:
# Define a generator for the dataset

batch_size = 128  # First do 4
def datagen():
    while True:
        batch = target.sample(batch_size)
        batch = tf.stack([batch[1], batch[0]], axis=-1)
        yield batch

In [None]:
# Use the generator to make a Dataset

dataset = tf.data.Dataset.from_generator(datagen, output_types=tf.float32)

In [None]:
# Inspect a dataset element

for elem in dataset.take(1):
    print(elem.shape)

#### Define and train the MAF

In [None]:
# Define the MAF

base = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag=[1., 1.])
made = tfb.AutoregressiveNetwork(params=2, hidden_units=[10, 10], activation='relu')
maf_bijector = tfb.MaskedAutoregressiveFlow(made)
maf = tfd.TransformedDistribution(base, maf_bijector)

In [None]:
# Use a custom training loop to train the flow

epochs = 15
steps_per_epoch = 100
rmsprop = tf.keras.optimizers.RMSprop()

@tf.function
def train_step(inputs):
    with tf.GradientTape() as tape:
        nll = -tf.reduce_mean(maf.log_prob(inputs))
    grads = tape.gradient(nll, maf.trainable_variables)
    rmsprop.apply_gradients(zip(grads, maf.trainable_variables))
    return nll

for epoch in range(epochs):
    epoch_loss = tf.keras.metrics.Mean()
    for inputs in dataset.take(steps_per_epoch):
        batch_loss = train_step(inputs)
        epoch_loss.update_state(batch_loss)
    print("End of epoch {}, loss: {}".format(epoch + 1, epoch_loss.result()))

In [None]:
# Make a contour and scatter plot from the target density

X, Y = np.meshgrid(np.linspace(-5, 20, 1000), np.linspace(-10, 10, 500))
X, Y = X.astype(np.float32), Y.astype(np.float32)
Z = maf.log_prob(tf.stack([X, Y], axis=-1))

fig = plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
levels = np.log(np.linspace(0.001, 0.05, 15))
plt.contour(X, Y, Z, levels, alpha=0.6, cmap='RdYlBu')
plt.title("MAF log-density contour plot")
plt.colorbar()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

plt.subplot(1, 2, 2)
plt.title("MAF inverse transformation of target data")
target_samples = target.sample(500)
samples = maf_bijector.inverse(tf.stack([target_samples[1], target_samples[0]], axis=-1))
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.show()

#### Multilayer MAF
The above normalising flow is not yet powerful enough to fit the data, note that the conditionals are unimodal by construction. However, we can make more flexible normalising flows by stacking multiple layers of MAF together.

In [None]:
# Define the stack of bijectors
    
def get_maf_bijector():
    made = tfb.AutoregressiveNetwork(params=2, hidden_units=[10, 10], activation='relu')
    return tfb.MaskedAutoregressiveFlow(made)

num_layers = 3
maf_bijs = []
for _ in range(num_layers):
    maf_bijs.append(get_maf_bijector())
    maf_bijs.append(tfb.Permute([1, 0]))
    
maf_bijector = tfb.Chain(list(reversed(maf_bijs)))
maf = tfd.TransformedDistribution(base, maf_bijector)

In [None]:
# Use a custom training loop to train the flow

epochs = 15
steps_per_epoch = 100
rmsprop = tf.keras.optimizers.RMSprop()

@tf.function
def train_step(inputs):
    with tf.GradientTape() as tape:
        nll = -tf.reduce_mean(maf.log_prob(inputs)) 
    grads = tape.gradient(nll, maf.trainable_variables)
    rmsprop.apply_gradients(zip(grads, maf.trainable_variables))
    return nll

for epoch in range(epochs):
    epoch_loss = tf.keras.metrics.Mean()
    for inputs in dataset.take(steps_per_epoch): 
        batch_loss = train_step(inputs)
        epoch_loss.update_state(batch_loss)
    print("End of epoch {}, loss: {}".format(epoch + 1, epoch_loss.result()))

In [None]:
# Make a contour and scatter plot from the target density

X, Y = np.meshgrid(np.linspace(-5, 20, 1000), np.linspace(-10, 10, 500))
X, Y = X.astype(np.float32), Y.astype(np.float32)
Z = maf.log_prob(tf.stack([X, Y], axis=-1))

fig = plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
levels = np.log(np.linspace(0.001, 0.05, 15))
plt.contour(X, Y, Z, levels, alpha=0.6, cmap='RdYlBu')
plt.title("MAF log-density contour plot")
plt.colorbar()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

plt.subplot(1, 2, 2)
plt.title("MAF inverse transformation of target data")
target_samples = target.sample(500)
samples = maf_bijector.inverse(tf.stack([target_samples[1], target_samples[0]], axis=-1))
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.show()

_Exercise._ Try removing the `tfb.Permute` bijectors from the list of bijectors above and re-run the training. What happens to the learned distribution? Can you explain why?

<a class="anchor" id="references"></a>
## References

<a class="anchor" id="Abdelhamed19"></a>
* Abdelhamed, A., Brubaker, M. A. & Brown, M. S. (2019), "Noise flow: Noise modeling with conditional normalizing flows", in *Proceedings of the IEEE International Conference on Computer Vision*, 3165–3173.
<a class="anchor" id="Dinh15"></a>
* Dinh, L., Krueger, D. & Bengio, Y. (2015),"NICE: Non-linear Independent Components Estimation", in *3rd International Conference on Learning Representations, (ICLR)*, San Diego, CA, USA, May 7-9, 2015.
<a class="anchor" id="Germain15"></a>
* Germain, M., Gregor, K., Murray, I. & Larochelle, H. (2015), "MADE: Masked Autoencoder for Distribution Estimation", *Proceedings of Machine Learning Research*, **37**, 881-889.
<a class="anchor" id="Ho19"></a>
* Ho, J., Chen, X., Srinivas, A., Duan, Y., & Abbeel, P. (2019), "Flow++: Improving flow-based generative models with variational dequantization and architecture design", in *Proceedings of the 36th International Conference on Machine Learning, ICML*.
<a class="anchor" id="Kingma16"></a>
* Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016), "Improved variational inference with inverse autoregressive flow", in *Advances in Neural Information Processing Systems*, **29**, 4743–4751.
<a class="anchor" id="Kumar19"></a>
* Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L. & Kingma, D. (2019), "VideoFlow: A Flow-Based Generative Model for Video", in *Workshop on Invertible Neural Nets and Normalizing Flows*, ICML, 2019.
<a class="anchor" id="Larochelle11"></a>
* Larochelle, H. & Murray, I. (2011), "The Neural Autoregressive Distribution Estimator", *Proceedings of Machine Learning Research*, **15**, 29-37.
<a class="anchor" id="Papamakarios17"></a>
* Papamakarios, G., Murray, I., & Pavlakou, T. (2017), "Masked autoregressive flow for density estimation", in *Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., (editors), Advances in Neural Information Processing Systems* **30**, 2335–2344.
<a class="anchor" id="Prenger19"></a>
* Prenger, R., Valle, R., & Catanzaro, B. (2019), "Waveglow: A flow-based generative network for speech synthesis", in *Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*, IEEE, 3617-3621.
<a class="anchor" id="Rezende15"></a>
* Rezende, D. & Mohamed, S. (2015), "Variational Inference with Normalizing Flows", in *Proceedings of Machine Learning Research*, **37**, 1530-1538.
<a class="anchor" id="vandenOord16a"></a>
* van den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. (2016a), "Pixel Recurrent Neural Networks", *Proceedings of Machine Learning Research*, **48**, 1747-1756.
<a class="anchor" id="vandenOord16b"></a>
* van den Oord, A., Kalchbrenner, N., Espeholt, L., Kavukcuoglu, K., Vinyals, O. & Graves, A. (2016b), "Conditional Image Generation with PixelCNN Decoders", *Advances in Neural Information Processing Systems*, **29**, 4790-4798.
<a class="anchor" id="vandenOord16c"></a>
* van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. & Kavukcuoglu, K. (2016c), "WaveNet: A Generative Model for Raw Audio", arXiv preprint, abs/1609.03499.
<a class="anchor" id="vandenOord18"></a>
* van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., & Hassabis, D. (2018), "Parallel WaveNet: Fast high-fidelity speech synthesis", in *Proceedings of the 35th International Conference on Machine Learning*, **80**, 3918–3926.