<center><b><h1> Stacked Approximation Regression Machines from First Principles </b></h1> (Gabriel Goh)</center>

<img src = "bases.png">

There has been some buzz on [reddit](https://www.reddit.com/r/MachineLearning/comments/50tbjp/stacked_approximated_regression_machine_a_simple/) about the paper [Stacked Approximated Regression Machine: A Simple Deep Learning Approach](https://arxiv.org/abs/1608.04062). Approaching the paper with a measured dose of skepticism, I was pleasantly surprised to find the paper containing a beautiful kernel of an idea - one I can see becoming a fixture in our deep learning toolkit. This is a new kind of layer known as the $k$-Approximated Regression Machine ($k$-ARM). 

### Background - Deep Nets

A deep neural network is a function which takes as input, $x$, data, into an output, $F(x)$ - whatever you want (usually labels). The word “deep” in deep learning refers to the network being a composition of layer functions, like so
$$F(x)=\Psi^{1}(\,\Psi^{2}(\, \cdots \,\Psi^{k}(x)\,\cdots\,) \,).$$

A traditional choice of a layer function looks like $\Psi^{k}(x)=\sigma(W^{T}_k x)$. Here $W^{T}_k$ is a matrix, representing a linear transform, such as a convolution or a fully connected layer. $\sigma$ is a choice of a non-linear [activation function](https://en.wikipedia.org/wiki/Activation_function). The goal in deep learning is to shape our function $F$, by any means possible, by tweaking the weights till they fit the data.

### The Layer We Deserve - The Sparse Regression Layer

The traditional layer is powerful, but recent research has triggered a cambrian explosion of different layers which enjoy tremendous success. These include magic dust like batch normalization and dropout layers, all now conveniently accessible from your favorite deep learning toolkit. Yet there is a layer missing from all these tools. Despite this layer not being there, it seems like a useful way to think about layers in a deep net. This layer isn't definied explicitly, with a formula, but is instead defined instead implicitly as the solution of a *sparse coding problem*
$$\Psi(x)=\underset{y}{\mbox{argmin }}\big\{\tfrac{1}{2}\|Wy-x\|^{2}+\lambda\|y\|_{1}\big\}.$$

I dub this the sparse regression layer. 

This layer behaves very differently from a regular deep learning layer, and it might be worthwhile to consider what it does. Firstly, to get things out of the way, $W$ is a set of weights, to be trained, and $\lambda$ is a parameter controlling how sparse the output is. Assuming a trained model, with sensible $W$, this layer is an encoder. It takes $x$, it's input and transforms it into its representation in structural primitives (the columns of $W$). 

It is useful to think of a musical metaphor - the input $x$ could be a chord, a linear superposition our notes, $y$. Our map reverse this process, taking our chord, and peels it apart into the individual notes. In the language of engineering, this layer solves a (linear) inverse problem. The forward model is the problem of data generation

<img src="sparse.svg" width = 450px>

And the inverse problem, is one of inference. 

<img src="inverse.svg" width = 550px>

This model is, by it's DNA, generative. It is also the mechanism proposed by [Olshausen et al](http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf) to represent the early stages of visual processing, and hence is a way of thinking about how information can be transformed in layers of abstraction in the brain.

Despite its implicit definition, we can still take its (sub) gradient,
$$\begin{align}\nabla\Psi(x) & =W^{T}(W\Psi(x)-x)+\lambda\mbox{sign}(\Psi(x))\\
\frac{\partial}{\partial W}\Psi(x) & =\frac{\partial}{\partial W}\frac{1}{2}\|W\Psi(x)-x\|^{2}
\end{align}$$

and hence there is no technical roadblock to integrate this into a deep learning framework. 

All this said, why isn't this layer in more use? The trouble with this layer is that it is expensive. Computing this layer requires the solution of a small optimization problem (on the order of the number of features in a datum). Measured in milliseconds, this may be cheap, but it is still an order of magnitude more expensive than a simple matrix multiplication passed through a nonlinearity. Adding a single layer like this would potentially increase the cost of every training step to increase a hundredfold.

### The Layer We Have - $k$ Approximate Sparse Regression Layer

Let us instead try to approximate this layer. First, I define a cryptic operator
$$
\Psi_{x}'(y) := \sigma(y-\alpha W^{T}(Wy-x)), \qquad \sigma(x)_{i}=\mbox{sign}(x_{i})(\left| x_{i} \right|-\lambda)_{i}
$$

where $\alpha$ is a small number which depends on the eigenvectors of $W^{T}W$. Let's think about what this operator does. If $f(y)=\tfrac{1}{2}\|W\cdot-x\|^{2}$, the term inside the $\sigma$ is just $y - \alpha\nabla f(y)$, a gradient step.  $\sigma$, our thresholding operator snaps all components of this new iterate between $[-\lambda, \lambda]$ below to zero. The first step decreases the quadratic function, and the second step promotes sparsity. So this makes sense. In fact, we can say more about this operator. Indeed,

1. $\Psi_x$ is the unique fixed point of the map $y\mapsto \Psi_{x}'(y),$ i.e. $\Psi(x)=\Psi_{x}'(\Psi(x))$
2. From any inital point $y$, repeated application of the map will always converge to $\Psi(x).$

In informal but suggestive notation, 
$$\Psi(x)=\Psi_{x}'(\Psi_{x}'(\Psi_{x}'(\cdots))).$$

Which suggests a series of approximations $\Psi\approx\Psi_{k}$, (starting at $0$ for simplicity),
$$\begin{align}
\Psi_{0}(x) & =\Psi_{x}'(\mathbf{0})\\
\Psi_{1}(x) & =\Psi_{x}'(\Psi_{x}'(\mathbf{0}))\\
\Psi_{2}(x) & =\Psi'_{x}(\Psi_{x}'(\Psi_{x}'(\mathbf{0})))\\
 & \vdots\\
\lim_{k\rightarrow\infty}\Psi_{k}(x) & =\Psi(x).
\end{align}.$$

This approximation has an architectural diagram looking like this

<p>
<img src = "diagram.svg" width = 500px>
<p>

And our most aggressive approximation, $\Psi_0(x)$ has the form
$$
\Psi_{0}(x)	
  =\Psi_{x}'(\mathbf{0})
  =\sigma(\mathbf{0}-\tfrac{1}{\alpha}W^{T}(W\mathbf{0}-x))
  =\sigma(\tfrac{1}{\alpha}W^{T}x)
$$

which should look familiar. Why, it's nothing more than our traditional layer discussed at the beginning! A tantalizing coincidence.

### Proximal Gradient Descent

Let's take a step back and think about the deivation however. The previous section was quite the tease - our operator $\Psi_{x}'(y)$ does not come from nowhere. There is, in fact, a powerful framework in convex optimization, proximal methods, which gives us the tools to craft such operators. These methods deal with problems of the form
$$y^\star = \mbox{argmin}\{f(y)+g(y)\}$$

Where $f$ is smooth + convex and $g$ is just convex. For reasons which will be clear later, $f$ is usually chosen to be the vessel of our data, and $g$ to be simple. Take sparse coding, discussed earlier, a special case of this where
$$f(y)=\tfrac{1}{2}\|Wy-x\|^{2},\qquad g(y)=\lambda\|y\|_{1}$$

Proximal methods gives us a recipe for replacing $g$ with any other convex function we'd like. We can do it with any norm, the two norm, total variation norms, and even can include constraints, with a bit of syntactic sugar:
$$y^{\star}=\underset{y\in S}{\mbox{argmin}}\{f(y)\}=\underset{y}{\mbox{argmin}}\{f(y)+\delta(y \, | \, S)\},\qquad\delta(y)=\begin{cases}
0 & y\in S\\
\infty & y\notin S
\end{cases}.$$
Given $f$ and $g$, the proximal gradient method defines the map
$$\Psi_{f}^{'}(x)=\sigma_{g}(y_{k}+\alpha\nabla f(y)),\qquad\sigma_{g}(y)=\mbox{argmin}\{\tfrac{1}{2}\|\bar{y}-y\|^{2}+g(y)\}$$

so that, you guessed it

1. $y^\star$ is the unique fixed point of the map $y\mapsto \Psi_{f}'(y)$, i.e. $y^\star = \Psi_{f}'(y^\star)$
2. From any $y$, for small enough $\alpha>0$ repeated application of the map will always converge to $y^\star$

The details of the theory have been covered in detail in the Boyd's [slides](https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf) but here is just a sampling of different proximal functions for differents $g$'s
$$\begin{alignat*}{2}
 & g(y)=0 &  & \sigma_{g}(y)=y\\
 & g(y)=\lambda\|y\|_{1} &  & \sigma_{g}(y)_{i}=\mbox{sign}(x_{i})(\left|x_{i}\right|-\lambda)_{i}\\
 & g(y)=\lambda\|y\|_{2} &  & \sigma_{g}(y)=\max\{0,1-\lambda/\|y\|_{2}\}y\\
 & g(y)=\delta(y,\mathbf{R}_{+}) & \qquad & \sigma_{g}(y)_{i}=\max\{0,y_{i}\}\\
 & g(y)=\delta(y,\mathbf{R}_{+})+\lambda\|y\|_{1} & \qquad & \sigma_{g}(y)_{i}=\max\{0,y_{i}-\lambda\}
\end{alignat*}$$

Which again, look tantalizingly like activation functions, including the `ReLu` and the `Leaky ReLu`. 

### A Footnote on Training

Much ado has been made about how this network is trained in a totally different way from regular ones. The method of training - repeated k-SVDs, resemble our retro and forgotten hero the [Restricted Boltzman Machine (RBM)](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine), which kick-started the rebirth of the neural net. The trouble with RBMs is that nobody understood quite *why* it worked, just that it did. It felt like a gift form an alien god, powerful and incomprehensible. The deep learning community seemed happy to ditch it when it learned how to train networks with gradient descent alone.

But the above interpretation of deep neural nets give us a decent starting point in making sense of all of this. Perhaps its time to revisit unsupervised pre-training. Interpreting deep nets as stacked ARMs give us a justification for training it greedily, one layer at a time, using dictionary learning. This is the method used in the paper, with excellent results.

Time will bear this speculation out, but for now the SARMs show impressive benchmarks with a minimal amount of training. I hope, at least, that you are convinced this idea is one worth thinking about.


Oh, and here's a link to my [website](http://gabgoh.github.io).

Yes, I am seeking employment.

Picture in the title, as well as the sparse encoding picture, is taken from [Olshausen et al, Sparse Coding With An Overcomplete Basis Set: A Strategy Employed by V12?](http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf)

In [23]:
from IPython.core.display import HTML
HTML(open("style.css","r").read())