<center><b><h1> Stacked Approximation Regression Machines from First Principles </b></h1></center>

There has been some buzz on [reddit](https://www.reddit.com/r/MachineLearning/comments/50tbjp/stacked_approximated_regression_machine_a_simple/) about the paper [Stacked Approximated Regression Machine: A Simple Deep Learning Approach](https://arxiv.org/abs/1608.04062). Approaching the paper with a measured dose of skeptcism, I was pleasantly surprised to find the paper containing a beautiful kernel of an idea - one I can see quickly becoming a fixture in our deep learning toolkit. This is a new kind of layer known as the $k$-Approximated Regression Machine ($k$-ARM). Here is my attempt to explain this layer from first principles

##### Background

To establish notation, recall a deep neural network is a function which takes an input $x$ into an output, $F(x)$. The word “deep” in deep learning refers to the network being a composition of layer functions,  
$$F(x)=\Psi^{1}(\Psi^{2}( \cdots \,\Psi^{k}(x)\,\cdots) )$$

The most traditional choice of a layer function takes the form $\Psi^{k}(x)=\sigma(W^{T}_k x)$. Here $W^{T}_k$ is a matrix, representing a linear transform such as a convolution, and $\sigma$ is a choice of a non-linear [activation function](https://en.wikipedia.org/wiki/Activation_function). The goal in deep learning is to shape our function $F$, by any means possible, by tweaking the weights till they fit the data.

##### The Sparse Regression Layer

As we move away from traditional layers, we enter a facinating zoo of possible tinker toys which we can use in our model. Here I will define one such hypothetical layer, defined implicitly as the solution of a sparse coding problem

$$\Psi(x)=\underset{y}{\mbox{argmin }}\big\{\tfrac{1}{2}\|Wy-x\|^{2}+\lambda\|y\|_{1}\big\}.$$

I dub this the sparse regression layer, the layer maps its inputs $x$ to the weights of the best sparse linear combinations of $W$ which best reconstruct $x$. 

<img src="sparse.svg" width = 450px>

Despite its implicit definition, we can still take it's (sub) gradient,
$$\begin{align}
\nabla\Psi(x) & =W^{T}(Wy^{\star}-x)+\lambda\mbox{sign}(y^{\star})\\
\frac{\partial}{\partial W}\Psi(x) & =\frac{\partial}{\partial W}\frac{1}{2}\|Wy^{\star}-x\|^{2}
\end{align}$$

where $y^{\star} =\Psi(x)$ and hence integrate it into any modular deep learning framework. Unfortunately this is kinda expensive. Not ridiculously expensive, this is a small optimization problem (on the order of the number of features in a datapoint), but still an order of magnitude more difficult than a simple matrix multiplication passed through a nonlinearity.

##### The $k$ Approximate Sparse Regression Layer

Let us, instead try to approximate this function. It is a well known fact that $\Psi(x)$ is a fixed point of the map 
$$y\mapsto\Psi_{x}'(y)$$

Where
$$
\Psi_{x}'(y) :=\sigma(y-\tfrac{\alpha}{2}\nabla\|W\cdot-x\|^{2}(y))
 =\sigma(y-\tfrac{\alpha}{2} W^{T}(Wy-x))
$$

and $\sigma$ is the soft threshold function
$$\sigma(x)_{i}=\mbox{sign}(x_{i})(\left| x_{i} \right|-\lambda)_{i},$$

and $\alpha$ is a small number which depends on the eigenvalues of $L^T L$. If you're familiar at all with $L_1$ regularization, this iteration will be familiar to you as by one of it's many aliases, Iterative Soft Thresholding (or by proxy of his famous cousion, FISTA), [Proximal Gradient](http://www.seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf) or by it's oldest name, Forward-Backwards Splitting. In informal but suggestive notation, 
$$\Psi(x)=\Psi_{x}'(\Psi_{x}'(\Psi_{x}'(\cdots)))$$

The above formula suggest this series of approximations $\Psi\approx\Psi_{k}$ where
$$\begin{align}
\Psi_{0}(x) & =\Psi_{x}'(\mathbf{0})\\
\Psi_{1}(x) & =\Psi_{x}'(\Psi_{x}'(\mathbf{0}))\\
\Psi_{2}(x) & =\Psi'_{x}(\Psi_{x}'(\Psi_{x}'(\mathbf{0})))\\
 & \vdots\\
\lim_{k\rightarrow\infty}\Psi_{k}(x) & =\Psi(x).
\end{align}$$

We take the initial point as ${\bf 0}$ for simplicity. Let's take a look at $\Psi_{0}$. 
$$
\Psi_{0}(x)	
  =\Psi_{x}'(\mathbf{0})
  =\sigma(\mathbf{0}-\tfrac{1}{\alpha}W^{T}(W\mathbf{0}-x))
  =\sigma(\tfrac{1}{\alpha}W^{T}x)
$$

Aha! This should look familiar! Why, it's nothing more than our generic layer discussed at the beginning! So I will do my best impression of Geoff Hinton and say, “$\Psi$ can be seen as a deep neural network with tied weights and an infinite number of layers." to which I add the footnote - to which we approximate with a $k$-ARM. Architecturally, the $k$-ARM looks like

<p>
<img src = "diagram.svg" width = 500px>
<p>

The fascinating insight here is in what deep nets may have been secretly doing regression all along.

In [5]:
from IPython.core.display import HTML
HTML(open("style.css","r").read())
# Yes I wrote this in ipython.