# Inverse autoregressive flows

__Objective:__ implement an inverse autoregressive normalizing flow (NF) for 2-dimensional data, then see how moving along a trajectory in "simple space" reflects onto the "complicated space".

Source: [here](https://github.com/tensorchiefs/dl_book/blob/master/chapter_06/nb_ch06_04.ipynb) (section "Understanding the mixture")

**Idea:** define a NF parametrized by neural networks. For this to work, a number of requirements must be satisfied, in particular the flow (transformation) must be **invertible** and the Jacobian must be **lower triangular** and with diagonal components that can be computed quickly (so that the computation of the determinant is fast).

As with usual NF, we start from a probability density function $p_x$ in $d$ dimensions generating points $\mathbf{x}\in \mathbb{R}^d$ and we want to find a transformation $\mathbf{G}$ whose inverse maps it to another distribution $p_z$ that has a simple form, i.e. (in the "direct" direction) $\mathbf{x} = \mathbf{G}(\mathbf{z})$. The transformation law between the two distributions is
$$
p_x(x) = p_z(z)\, \left| \det\left( \left[ \frac{\partial G_i}{\partial z_j} \right] \right) \right|^{-1},
$$
the matrix $J\equiv \left[ \frac{\partial G_i}{\partial z_j} \right]$ being the Jacobian of the transformation.

Inverse autoregressive flows correspond to the followinf particular choice for the map $\mathbf{G}$:

$$
G_i = G_i(z_1, \ldots, z_i) = \exp\left( \alpha_i(z_1, \ldots, z_{i-1}) \right)\, z_i + b_i(z_1, \ldots, z_{i-1}),\quad i = 1, \ldots, d\,,
$$

with the "affine parameters" $\alpha_i$ and $b_i$ being given by neural networks and thus containing nonlinearities.

Observations:
- $G_i$ doesn't depend on $z_j$ if $j>i$: this creates a lower triangular Jacobian.
- $G_i$ depends on $z_i$ only in an affine way: this gives diagonal entries of $J$ that are easy to compute.
- $G_i$ depends on the $z_i$ for $i<j$ through $\alpha_i$ and $b_i$.
- The exponential is used so that the diagonal entries of $J$ are all positive, which gives an invertible transformation by construction.

The above form for $\mathbf{G}$ is achieved by using "autoregressive networks", which are implemented in Tensorflow Probability.

Tensorflow Probability provides the following implementations:
- `tfp.bijectors.Bijector`: the general class implementing an invertible transformation (a normalizing flow).
- `tfp.bijectors.MaskedAutoregressiveFlow`: the bijector implementing the full inverse autoregressive flow ("masked" refers to the fact that the NN used to get the parameters achieve the autoregressive properties via masking).
- `tfp.bijectors.AutoregressiveNetwork`: the bijector implementing the masked NN used to parametrize the inverse autoregressive flow.

In [None]:
import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import seaborn as sns

tfd = tfp.distributions
tfb = tfp.bijectors

sns.set_theme()

## Generate data

In [None]:
n_samples = 2500

x2_samples = tfd.Normal(loc=0., scale=4.).sample(n_samples)

x1_samples = tfd.Normal(loc=.25 * tf.square(x2_samples), scale=tf.ones(n_samples, dtype=tf.float32)).sample()

samples = tf.stack(
    [x1_samples, x2_samples],
    axis=1
) / 40.

samples

In [None]:
fig = plt.figure(figsize=(14, 6))

sns.scatterplot(
    x=samples[:, 0],
    y=samples[:, 1],
)

plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Samples', fontsize=14)

## Build an inverse autoregressive flow

In [None]:
class MaskedARFlow(tf.keras.Model):
    """
    Subclass of Keras `Model` implementing an inverse autoregressive
    flow (using masked autoregressive networks).
    """
    def __init__(self, **kwargs):
        """
        Class constructor.
        """
        super().__init__()
        
        # List of the NNs giving the affine parameters.
        self.nets = []
        
        # List of all bijectors that will compose the
        # full one.
        bijectors = []
        
        # Number layers, each one composed of a
        # `MaskedAutoregressiveFLow` and a `Permute`
        # bijector.
        n_layers = 5
        h = 32
        
        # Loop over the number of layers. Each layer is a block of bijecctors
        # that gets added to the list.
        for i in range(n_layers):
            # Creates the function building the NN parametrizing the affine
            # parameters (it's constructed so it can be passed to the
            # `shift_and_log_scale_fn` argument of the `MaskedAutoregressiveFlow`
            # bijector object.
            net = tfb.AutoregressiveNetwork(
                # This SHOULD be the number of parameters we want to generate (which
                # in our case should be the 2 affine parameters alpha and b).
                params=2,
                hidden_units=[4, 4]
            )
            # Original code in source: doesn't work as the trainable variables are not
            # tracked!
            # net = tfb.masked_autoregressive_default_template(hidden_layers=[h, h])
            
            # Create a `MaskedAutoregressiveFlow` bijector and append it to the
            # list of bijectors.
            bijectors.append(
                tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=net)
            )
            
            # Create a `Permute` bijector and append it to the list of bijectors.
            # The permutation passed to its constructor inverts the order of the
            # two dimensions so that no dimension is singled out in the procedure.
            # Note: this of course works only for 2 dimensions.
            bijectors.append(tfb.Permute([1, 0]))
            
            # Append the neural network to the list of the networks to keep track
            # of it.
            self.nets.append(net)
            
        # Create the full bijector defining the NF from the list of built bijectors
        # taken in reversed order. This way, in each block the permutation operation
        # is applied before the masked autoregressive flow. The last entry of the list
        # is dropped so that, in the reversed list, we don't start with a useless
        # permutation.
        bijector = tfb.Chain(list(reversed(bijectors[:-1])))
        
        # Define the flow, i.e. the distribution object implementing the transformation
        # from a source simple distribution via the full bijector.
        self.flow = tfd.TransformedDistribution(
            distribution=tfd.MultivariateNormalDiag(loc=[0., 0.]),
            bijector=bijector
        )
        
    def call(self, *inputs):
        """
        Forward pass.
        """
        return self.flow.bijector.forward(*inputs)

In [None]:
# Sample the source distribution and transform the samples via the
# untrained flow. This should give nothing sensible!
# Note: every time the model is instantiated, the NN weights
#       in it are re-initialized and a different transformation
#       is obtained.
transformed_samples_untrained = MaskedARFlow()(MaskedARFlow().flow.distribution.sample(2500))

fig = plt.figure(figsize=(14, 6))

sns.scatterplot(
    x=samples[:, 0].numpy(),
    y=samples[:, 1].numpy(),
    label='Original samples'
)

sns.scatterplot(
    x=transformed_samples_untrained[:, 0].numpy(),
    y=transformed_samples_untrained[:, 1].numpy(),
    label='Transformed samples (untrained flow)'
)

plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Samples', fontsize=14)

Training.

In [None]:
def nll(samples, distr):
    """
    Negative log likelihood of `samples` according to the
    distribution `distr`.
    """
    return - tf.reduce_mean(distr.log_prob(samples))

In [None]:
model = MaskedARFlow()

loss_history = []

In [None]:
@tf.function
def training_step(x):
    with tf.GradientTape() as tape:
        loss = nll(x, model.flow)
        
    grad = tape.gradient(loss, model.trainable_variables)

    optimizer.apply_gradients(zip(grad, model.trainable_variables))
    
    return loss

In [None]:
epochs = 50000

optimizer = tf.keras.optimizers.SGD(learning_rate=1e-5)

for i in range(epochs):
    # with tf.GradientTape() as tape:
    #     loss = nll(samples, model.flow)
    loss = training_step(samples)
        
    loss_history.append(loss.numpy())

    # grad = tape.gradient(loss, model.trainable_variables)

    # optimizer.apply_gradients(zip(grad, model.trainable_variables))
    
    if (i < 10) or (i % 100 == 0):
        print(f'Epoch: {i} | Loss: {loss_history[-1]}')
    
loss_history.append(nll(samples, model.flow).numpy())

In [None]:
fig = plt.figure(figsize=(14, 6))

sns.lineplot(
    x=range(len(loss_history)),
    y=loss_history
)

plt.title('Training loss', fontsize=14)
plt.xlabel('Epoch')
plt.ylabel('Loss value')


fig = plt.figure(figsize=(14, 6))

# Sample the source distribution and transform the samples via the
# trained flow. This should give something that looks much more
# like the original samples, as the flow should have understood
# how to map the simple space into the complicated one.
transformed_samples = model(model.flow.distribution.sample(2500))

sns.scatterplot(
    x=samples[:, 0].numpy(),
    y=samples[:, 1].numpy(),
    label='Original samples'
)

sns.scatterplot(
    x=transformed_samples[:, 0].numpy(),
    y=transformed_samples[:, 1].numpy(),
    label='Transformed samples'
)

plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Samples', fontsize=14)