# Feed-forward neural networks 

Neural networks are different things to different people. 

Some common and useful perspectives on neural networks (NN) are listed below:

* **Linear algebra view**. An NN is a series of non-linear maps. 


* **Perceptron view**. An NN is a network of computational neurons. 


* **Computational graph view**. An NN is computational procedure involving a loss and updatable parameters (often with automatic differentiation). 


* **Machine learning view**. An NN represents a non-linear prediction function $f(\mathbf{x}; \mathbf{W})$ with complicated internal structure. 

## Topics for today

* Why do we need non-linearieties?
* Forward and backward as instances of dynamic programming.
* Differentiation is a linear approximation.
* Gradient checking.

We'll start with a hypothetical (xkcd-style).

<img src="https://what-if.xkcd.com/imgs/whatif-logo.png">

### What if all the non-linearities in an NN suddenly vanished?

A neural network with an input layer, a middle layer, and an output layer computes the following:

$$\mathbf{y} = g(W^{(0)}g(W^{(1)}g(W^{(0)}\mathbf{x})))$$

$g$ is a non-linearity, which could be different for each layer.

If we change $g$ to a linear function (e.g. a scaling factor), it can simply be multiplied into the weights matrices. Below we assume that $g = 1$, which allows us to simplify the expression:

$$\mathbf{y} = (W^{(0)}(W^{(1)}(W^{(0)}\mathbf{x})))$$

Since matrix multiplication is associative:

$$A(BC) = (AB)C,$$

we can get rid of the brackets altogether:

$$\mathbf{y} = W^{(0)}W^{(1)}W^{(0)}\mathbf{x}.$$

The series of linear transformations can be summarized in a single transformation matrix :

$$T = W^{(0)}W^{(1)}W^{(0)}.$$

And so the prediction of the neural network becomes:

$$\mathbf{y} = T\mathbf{x}.$$

The effective number of parameters in the now non non-linear neural network is $|\mathbf{y}| \times |\mathbf{x}|$, which is precisely the same as a standard linear model.

## Toy network

Image of toy network

## Forward pass (predict)

### Output as a function of the input 

Using the NN structure as a guide, we can write down an expression to calculate the output of the network in terms of the output. 

$$
\begin{align}
y = g_y\Big(& g_s\big(g_r(x_0 (x_0 \to r_0) + x_1 (x_1 \to r_0)) (r_0 \to s_0) + g_r(x_0 (x_0 \to r_1) + x_1 (x_1 \to r_1) (r_1 \to s_0)\big) (s_0 \to y) +\\& g_s\big(g_r(x_0 (x_0 \to r_0) + x_1 (x_1 \to r_0)) (r_0 \to s_1) + g_r(x_0 (x_0 \to r_1) + x_1 (x_1 \to r_1)) (r_1 \to s_1)\big) (s_1 \to y)\Big)
\end{align}
$$

The equation avoids any explicit mentioning of NN concepts such as layer and node. Arguably, this is a silly thing to do. However, it serves a purpose here, namely to demonstrate why these are useful.


### Massive duplication of effort

The input-to-output equation has many repeatet calculations, which are marked in blue below:

$$
\begin{align}
y = g_y \Big(& g_s \big( \color{blue}{g_r(x_0 W_{x_0, r_0} + x_1 W_{x_1, r_0})} W_{r_0, s_0} + \color{blue}{g_r(x_0 W_{x_0, r_1} + x_1 W_{x_1, r_1})} W_{r_1, s_0} \big) W_{s_0, y} +\\& g_s \big( \color{blue}{g_r(x_0 W_{x_0, r_0} + x_1 W_{x_1, r_0})} W_{r_0, s_1} + \color{blue}{g_r(x_0 W_{x_0, r_1} + x_1 W_{x_1, r_1})} W_{r_1, s_1} \big) W_{s_1, y} \Big)
\end{align}
$$

Expresing the output as a set of node-centered equations avoids duplication.

$$
r_0 = g_r(x_0 W_{x_0, r_0} + x_1 W_{x_1, r_0})\\
r_1 = g_r(x_0 W_{x_0, r_1} + x_1 W_{x_1, r_1})\\
s_0 = g_s(r_0 W_{r_0, s_0} + r_1 W_{r_1, s_0})\\
s_1 = g_s(r_0 W_{r_0, s_1} + r_1 W_{r_1, s_1})\\
y = g_y(s_0 W_{s_0, y} + s_1 W_{s_1, y})
$$

The forward pass (**predict**) in a neural network thus uses dynamic programming, although in a simple form.


## Backward pass (derivates)


High level idea: Calculate partial derivatives with respect to each parameter in the network. Again, this can be done inefficiently with duplication of effort as demonstrated below:

$$
\begin{align}
%\frac{\partial f}{\partial w(s_0 \to y)} = & f'(s_0 \to y)\\
%\frac{\partial f}{\partial w(s_1 \to y)} = & f'(s_1 \to y)\\
\frac{\partial f}{\partial (r_0 \to s_0)} = & \frac{\partial f}{\partial (s_0 \to y)} + r_0 \frac{\partial g_s}{\partial (r_0 \to s_0)}\\
\frac{\partial f}{\partial (r_0 \to s_1)} = & \frac{\partial f}{\partial (s_1 \to y)} + r_0 \frac{\partial g_s}{\partial (r_0 \to s_1)}\\
\frac{\partial f}{\partial (r_1 \to s_0)} = & \frac{\partial f}{\partial (s_0 \to y)} + r_0 \frac{\partial g_s}{\partial (r_1 \to s_0)}\\
\frac{\partial f}{\partial (r_1 \to s_1)} = & \frac{\partial f}{\partial (s_0 \to y)} + r_0 \frac{\partial g_s}{\partial (r_1 \to s_1)}\\
\frac{\partial f}{\partial (x_0 \to r_0)} = & x_0 \frac{\partial g_r}{\partial (x_0 \to r_0)} \Bigg( \color{blue}{\Big( \frac{\partial f}{\partial (s_0 \to y)} + r_0 \frac{\partial g_s}{\partial (r_0 \to s_0)} \Big)} + \color{blue}{\Big( \frac{\partial f}{\partial (s_1 \to y)} + r_0 \frac{\partial g_s}{\partial (r_0 \to s_1)} \Big)} \Bigg)\\
\frac{\partial f}{\partial (x_0 \to r_1)} = & x_0 \frac{\partial g_r}{\partial (x_0 \to r_1)} \Bigg( \color{blue}{\Big( \frac{\partial f}{\partial (s_0 \to y)} + r_1 \frac{\partial g_s}{\partial (r_1 \to s_0)} \Big)} + \color{blue}{\Big( \frac{\partial f}{\partial (s_1 \to y)} + r_1 \frac{\partial g_s}{\partial (r_1 \to s_1)} \Big)} \Bigg)\\
\frac{\partial f}{\partial (x_1 \to r_0)} = & x_1 \frac{\partial g_r}{\partial (x_1 \to r_0)} \Bigg( \color{blue}{\Big( \frac{\partial f}{\partial (s_0 \to y)} + r_0 \frac{\partial g_s}{\partial (r_0 \to s_0)} \Big)} + \color{blue}{\Big( \frac{\partial f}{\partial (s_1 \to y)} + r_0 \frac{\partial g_s}{\partial (r_0 \to s_1)} \Big)} \Bigg)\\
\frac{\partial f}{\partial (x_1 \to r_1)} = & x_1 \frac{\partial g_r}{\partial (x_1 \to r_1)} \Bigg( \color{blue}{\Big( \frac{\partial f}{\partial (s_0 \to y)} + r_1 \frac{\partial g_s}{\partial (r_1 \to s_0)} \Big)} + \color{blue}{\Big( \frac{\partial f}{\partial (s_1 \to y)} + r_1 \frac{\partial g_s}{\partial (r_1 \to s_1)} \Big)} \Bigg)\\
\end{align}
$$