## An introduction of Autoregressive flows

Recently, I am trying to apply normalizing flows to the time-series problems, and I found that Autoregressive flows are more suitable for time-series modelling than coupling flows. 

The joint distribution $p(\boldsymbol{x})$ of a known time series can be expressed as follows:

$$p(\boldsymbol{x}_{1:T})=p(\boldsymbol{x}_{1})p(\boldsymbol{x}_{2}|\boldsymbol{x}_{1})p(\boldsymbol{x}_{3}|\boldsymbol{x}_{2},\boldsymbol{x}_{1})p(\boldsymbol{x}_{4}|\boldsymbol{x}_{3},\boldsymbol{x}_{2},\boldsymbol{x}_{1})\ldots=\prod_{t=1}^{T}p(\boldsymbol{x}_{t}|\boldsymbol{x}_{1:t-1})$$

This is an **auto-regressive model**, and each node depends on its predecessors in the ordering. Specifically, $\boldsymbol{x}_{t}$ is only dependent on the previous time $\boldsymbol{x}_{1:t-1}$. 

It is suitable to use **autoregressive flows** to transform this joint distribution. And the dependency of special $x_i$ can be expressed as:

$$x_{t}=h(u_{t};\Theta(\boldsymbol{x}_{1:t-1})),\quad t=1,\ldots,T.$$

where $\boldsymbol{u}$ contains $T$ scalar elements, that is, $\boldsymbol{u}=(u_{1},\ldots,u_{T})\in\mathbb{R}^{T}$. 

Specifically, we assume $h$ is affine transformation, and $\Theta$ is a designed neural network. Its input is $\boldsymbol{x}_{1:t-1}$, and it outputs two scalar values $\alpha_i$ and $\mu_i$, and they are used as scale and offset parameters in the affine transformation:

$$x_{i}=u_{i}\exp(\alpha_{i})+\mu_{i}$$


 **Affine autoregressive flows** is illustrated in the following figure:

<center> <img src="imgs/af01.jpg" width="700px" /> </center>

The order of dependency is important for autoregressive flows. Here is an example:

<center> <img src="imgs/af02.jpg" width="700px" /> </center>


In the figure, (a) is the target density, where $x_1$ is depends on $x_2$.  In (b), we model the dependency with inverse order, resulting in poor results.  However, in (c), we can use a deep model to avoid this poor fitting.

We have already discussed that $\Theta$ is a designed neural network, and its input is $\boldsymbol{x}_{1:t-1}$. Naively, we need to train a neural network for each $\boldsymbol{x}_{1:t-1}$ for $t\in T$. Germain et al. proposed using a designed mask to improve efficiency. In his method, it will use only one neural network, illustrated as:

<center> <img src="imgs/af03.jpg" width="500px" /> </center>

The right part of the figure is the modified neural network. The number in the circle node is the dependency order (this is hyperparameters, and can be assign to different order in each layer of flows), which represent the dependency: $(x_2, x_3, x_1)$, that is, $x_1$  depends on $x_2, x_3$,  and $x_3$  depends on $x_2$. The relationship is also reflected in the inner connections between neurons in the neural network.

Here, we use the idea  of the mask to modify the connection within the neural network but change the output of conditional probability to affine parameters: $\boldsymbol{\alpha}$ and $\boldsymbol{\mu}$

The following code is a straightforward implementation of Autoregressive flows. I will adapt this code and use it in our project:

https://github.com/ikostrikov/pytorch-flows/blob/master/flows.py

## References

<div STYLE="text-indent: -36px; padding-left: 36px;">
    
<p>Germain, M., Gregor, K., Murray, I. & Larochelle, H. (2015). MADE: Masked Autoencoder for Distribution Estimation. <em>Proceedings of the 32nd International Conference on Machine Learning</em>, pages 881–889.</p>

<p>Jang, E. (2018) Normalizing Flows Tutorial. Retrieved from https://blog.evjang.com/2018/01/nf2.html</p>
    
<p>Murphy, K.P. (in press). <em>Probabilistic Machine Learning: Advanced Topics</em>. MIT Press.</p>
    
<p>Papamakarios, G., Pavlakou, T. & Murray, I. (2017). “ Masked Autoregressive Flow for Density Estimation”. In: <em>NIPS</em>. </p>

</div>
