<center><b><h1> Stacked Approximation Regression Machines from First Principles </b></h1></center>

<img src = "bases.png">

There has been some buzz on [reddit](https://www.reddit.com/r/MachineLearning/comments/50tbjp/stacked_approximated_regression_machine_a_simple/) about the paper [Stacked Approximated Regression Machine: A Simple Deep Learning Approach](https://arxiv.org/abs/1608.04062). Approaching the paper with a measured dose of skepticism, I was pleasantly surprised to find the paper containing a beautiful kernel of an idea - one I can see becoming a fixture in our deep learning toolkit. This is a new kind of layer known as the $k$-Approximated Regression Machine ($k$-ARM).

##### Background - Deep Nets

A deep neural network is a function which takes an input $x$ into an output, $F(x)$. The word “deep” in deep learning refers to the network being a composition of layer functions, like so
$$F(x)=\Psi^{1}(\,\Psi^{2}(\, \cdots \,\Psi^{k}(x)\,\cdots\,) \,).$$

A traditional choice of a layer function looks like $\Psi^{k}(x)=\sigma(W^{T}_k x)$. Here $W^{T}_k$ is a matrix, representing a linear transform, such as a convolution or a fully connected layer. $\sigma$ is a choice of a non-linear [activation function](https://en.wikipedia.org/wiki/Activation_function). The goal in deep learning is to shape our function $F$, by any means possible, by tweaking the weights till they fit the data.

##### The Sparse Regression Layer

Moving away from traditional layers, here is hypothetical one. This layer, $\Psi$ does not admit a formula, but is defined instead implicitly as the solution of a sparse coding problem
$$\Psi(x)=\underset{y}{\mbox{argmin }}\big\{\tfrac{1}{2}\|Wy-x\|^{2}+\lambda\|y\|_{1}\big\}.$$
I dub this the sparse regression layer. This layer maps its inputs $x$ to the weights of the best sparse linear combinations of $W$ which best reconstruct $x$. 

<img src="sparse.svg" width = 450px>

Despite its implicit definition, we can still take it's (sub) gradient,
$$\begin{align}
\nabla\Psi(x) & =W^{T}(Wy^{\star}-x)+\lambda\mbox{sign}(y^{\star})\\
\frac{\partial}{\partial W}\Psi(x) & =\frac{\partial}{\partial W}\frac{1}{2}\|Wy^{\star}-x\|^{2}
\end{align}$$

where $y^{\star} =\Psi(x)$ and hence integrate it into any modular deep learning framework. Unfortunately computing this layer requires the solution of a small optimization problem (on the order of the number of features in a datapoint). This may be cheap, but it is still an order of magnitude more expensive than a simple matrix multiplication passed through a nonlinearity.

##### The $k$ Approximate Sparse Regression Layer

Let us instead try to approximate this function. First, I will define a cryptic operator
$$
\Psi_{x}'(y) := \sigma(y-\tfrac{\alpha}{2} W^{T}(Wy-x)), \qquad \sigma(x)_{i}=\mbox{sign}(x_{i})(\left| x_{i} \right|-\lambda)_{i}
$$

and $\alpha$ is a small number which depends on the eigenvectors of $W^{T}W$. Let's take the following two facts as given

1. $\Psi_x$ is the unique fixed point of the map $y\mapsto \Psi_{x}'(y),$ i.e. $\Psi(x)=\Psi_{x}'(\Psi(x))$
2. From any inital point $y$, repeated application of the map will always converge to $\Psi(x).$

In informal but suggestive notation, 
$$\Psi(x)=\Psi_{x}'(\Psi_{x}'(\Psi_{x}'(\cdots))).$$

Which suggests this series of approximations $\Psi\approx\Psi_{k}$, starting at $0$ for simplicity,
$$\begin{align}
\Psi_{0}(x) & =\Psi_{x}'(\mathbf{0})\\
\Psi_{1}(x) & =\Psi_{x}'(\Psi_{x}'(\mathbf{0}))\\
\Psi_{2}(x) & =\Psi'_{x}(\Psi_{x}'(\Psi_{x}'(\mathbf{0})))\\
 & \vdots\\
\lim_{k\rightarrow\infty}\Psi_{k}(x) & =\Psi(x).
\end{align}.$$

With an architectural diagram looking like this

<p>
<img src = "diagram.svg" width = 500px>
<p>

And our most aggressive approximation, $\Psi_0(x)$ has the form
$$
\Psi_{0}(x)	
  =\Psi_{x}'(\mathbf{0})
  =\sigma(\mathbf{0}-\tfrac{1}{\alpha}W^{T}(W\mathbf{0}-x))
  =\sigma(\tfrac{1}{\alpha}W^{T}x)
$$

which should look familiar. Why, it's nothing more than our generic layer discussed at the beginning! Now, bear in mind this be nothing more than a tantalizing coincidence. But read on.

##### Proximal Gradient Descent

But hold on, you say. Where did $\Psi_x'$ come from? IF you're familiar at all with $L_1$ regularization, this may be all too familiar to you. Indeed, the map  $\Psi_x'$ comes from a general framework which solves problems with the structure
$$y^{\star}=\mbox{argmin}\{f(y)+g(y)\}$$

Where $f$ is smooth + convex and $g$ is just convex. The sparse coding problem uses this structure, with
$$f(y)=\tfrac{1}{2}\|Wy-x\|^{2},\qquad g(y)=\lambda\|y\|_{1}$$

But the generality of this framework allows you to replace $g$ with a whole variety of other functions, including constrained optimization, with a bit of syntactic sugar:
$$y^{\star}=\underset{y}{\mbox{argmin}}\{f(y)+\delta(y \, | \, S)\}=\underset{y\in S}{\mbox{argmin}}\{f(y)\},\qquad\delta(y)=\begin{cases}
0 & y\in S\\
\infty & y\notin S
\end{cases}.$$
Given $f$ and $g$, the proximal gradient method defines the map
$$\Psi_{f}^{'}(x)=\sigma_{g}(y_{k}+\alpha\nabla f(y)),\qquad\sigma_{g}(y)=\mbox{argmin}\{\tfrac{1}{2}\|\bar{y}-y\|^{2}+g(y)\}$$

so that, you guessed it

1. $y^\star$ is the unique fixed point of the map $y\mapsto \Psi_{f}'(y)$, i.e. $y^\star = \Psi_{f}'(y^\star)$
2. From any $y$, for small enough $\alpha>0$ repeated application of the map will always converge to $y^\star$

The details of the theory have been covered in detail in the Boyd's [slides](https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf) but here is just a sampling of different proximal functions for differents $g$'s
$$\begin{alignat*}{2}
 & g(y)=0 &  & \sigma_{g}(y)=y\\
 & g(y)=\lambda\|y\|_{1} &  & \sigma_{g}(y)_{i}=\mbox{sign}(x_{i})(\left|x_{i}\right|-\lambda)_{i}\\
 & g(y)=\lambda\|y\|_{2} &  & \sigma_{g}(y)=\max\{0,1-\lambda/\|y\|_{2}\}y\\
 & g(y)=\delta(y,\mathbf{R}_{+}) & \qquad & \sigma_{g}(y)_{i}=\max\{0,y_{i}\}\\
 & g(y)=\delta(y,\mathbf{R}_{+})+\lambda\|y\|_{1} & \qquad & \sigma_{g}(y)_{i}=\max\{0,y_{i}-\lambda\}\\
 & g(y)=\delta(y,\{\bar{y}\mid\bar{y}_{i}\in[0,\alpha]\}) &  & \sigma_{g}(y)_{i}=\max\{y_{i},\alpha y_{i}\}
\end{alignat*}$$

Which again, look tantalizingly like activation functions, including the `ReLu` and the `Leaky ReLu`. Time will bear this all out, but for now the SARMs show impressive benchmarks with a minimal amount of training, and this idea is a worthy one for meditation.

Oh, and here's a link to my [website](http://gabgoh.github.io).

In [6]:
from IPython.core.display import HTML
HTML(open("style.css","r").read())
# Yes I wrote this in ipython.