## Implementing A Neural Network

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\bigoh}[1]{\mathcal{O}\left(#1\right)}
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\trans}[0]{^\intercal}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\\]

![](assets/neural_network_diagram.jpg)

## Calculating Activations In A Neural Network

We will denote the inputs to a neural network as $x$. These are the activations of the first layer in the network. In my example, $x$ is an $m$-dimensional vector.

The second layer is the first *hidden layer*. In my example there are $n$ units in the second layer.

Each unit of the second layer is effectively a linear regression model. For each unit, we need to do a weighted sum of the previous layer's activations to calculate a $z$ value, then we run this value through the logistic function to calculate $a = \sigma(z)$.

Consider the $i$th unit in the second layer. Then:

\\[
\begin{align}
z_i^2 &:= \theta_i^1 x
\\
a_i^2 &:= \sigma\left(z_i^2\right)
\end{align}
\\]

$\theta_i^1$ is the list of $m$ weights to use to calculate the preactivation of the $i$th unit from a weighted sum of the $m$ values of $x$. Every unit in the layer will use a different $\theta_i^1$ weight vector.

To make things compact, we write all each $\theta_i^1$ as the $i$th row of $n$-by-$m$ matrix denoted $\Theta^1$. Using this matrix we can calculate $\Theta^1 x$, which transforms the $m$-dimensional input into an $n$-dimensional output. This output is exactly $z^2$, the vector of individual $z_i^2$ values.

It is common to denote the $i$-th column of $\Theta^1$ by $\Theta_{i, :}^1$. Here the $:$ symbol means "all the columns."

To calculate the $a^2$ values, the activations of the second layer, we just calculate $\sigma\left(z^2\right)$, which applies the sigmoid to each individual coordinate of $z$.


These activations are then used for the next step of the process, which proceeds the same way.

In this case, the third layer happens to be the *output layer*. In my example it looks like there is only one output value. In principle there could be many outputs, or there could be more hidden layers.

Everything works the same. There is a $1$-by-$n$ matrix $\Theta^2$ that calculates 1-dimensional $z^3$ preactivation for the third layer by $\Theta^2 a^2$. We again calculate $a^3 = \sigma\left(z^3\right)$.


### Backpropagation

To train Neural Networks using Gradient Descent, we need to be able to calculate the partial derivatives of the error function with respect to the parameters.

Calculating these partial derivatives is an exercise in repeated application of the chain rule.

First, we calculate $\fpartial{E}{a_1^3}$. For my purposes it is not important what the error function actually is. To calculate this derivative, we need to know the current value of $a_1^3$. This means that we first must do a *forward pass* to calculate this value on each example.

Once we know $\fpartial{E}{a_1^3}$, we can calculate $\fpartial{E}{z_1^3}$ via the chain rule:

\\[
\fpartial{E}{z_1^3}
=
\fpartial{E}{a_1^3}
\fpartial{a_1^3}{z_1^3}
=
\fpartial{E}{a_1^3}
\sigma'\left(z_1^3\right)
=
\fpartial{E}{a_1^3}
\sigma\left(z_1^3\right)
\left( 1 - \sigma\left(z_1^3\right) \right)
\\]

We can do this because by definition $a_1^3 = \sigma(z_1^3)$.

Once we know $\fpartial{E}{z_1^3}$, we can move on to calculate $\fpartial{E}{\Theta_{i, 1}^2}$. Here is how:

\\[
\begin{align}
z_1^3
&= \Theta^2 a^2
\\
&= \sum_i \Theta_{i, 1}^2 a_i^2
\end{align}
\\]

Therefore:

\\[
\fpartial{z_1^3}{\Theta_{i, 1}^2} = a_i^2
\\]

Using this together with what we know about $\fpartial{E}{z_1^3}$, we have:

\\[
\begin{align}
\fpartial{E}{\Theta_{i, 1}^2}
&=
\fpartial{E}{z_1^3}
\fpartial{z_1^3}{\Theta_{i, 1}^2}
\\
&=
\fpartial{E}{a_1^3}
\sigma\left(z_1^3\right)
\left( 1 - \sigma\left(z_1^3\right) \right)
a_i^2
\end{align}
\\]

This is exactly the partial derivative we will need to use to update $\Theta_{i, 1}^2$.

### Keep Pushing Backward

We also need to find $\fpartial{E}{\Theta_{i, j}^1}$. To do this, we need to first find $\fpartial{E}{a_i^2}$.

Now, via the chain rule we know we can write:

\\[
\begin{align}
\fpartial{E}{a_i^2}
&=
\fpartial{E}{z_1^3}
\fpartial{z_1^3}{a_i^2}
\end{align}
\\]

We have previous calculated $\fpartial{E}{z_1^3}$. The new part is $\fpartial{z_1^3}{a_i^2}$. But we know:

\\[
z_1^3
=
\sum_i
\Theta_{i, 1}^2
a_i^2
\\]

Therefore we have $\fpartial{z_1^3}{a_i^2} = \Theta_{i, 1}^2$. Thus we have:

\\[
\fpartial{E}{a_i^2}
=
\fpartial{E}{z_1^3}
\Theta_{i, 1}^2
\\]

### Backpropagating Further: Vectors!

In the second layer there are multiple $a_i^2$ units. Let me write $\fpartial{E}{a^2}$ to mean the vector of partial derivatives with respect to each $a_i^2$.

Then, to calculate $\fpartial{E}{z^2}$ (again, a vector of partial derivatives), I note:

\\[
a^2 = \sigma\left(z^2\right)
\\]

Thus, I have:

\\[
\begin{align}
\fpartial{E}{z^2}
&=
\fpartial{E}{a^2}
\circ
\fpartial{a^2}{z^2}
\\
&=
\fpartial{E}{a^2}
\circ
\fpartial{}{z^2} \sigma\left(z^2 \right)
\\
&=
\fpartial{E}{a^2}
\circ
\sigma\left(z^2\right)
\circ
\left(1 - \sigma\left(z^2\right)\right)
\end{align}
\\]

Note the $\circ$ operation means coordinatewise multiplication.


### Partials With Respect To $\Theta^1$

Okay, unlike when we were calculating the partials for $\Theta^2$, where multiplied $n$ activations $a^2$ from the previous layer to produce a single value $z^3$, now we have $m$ activations $a^1$ (aka, $x$) multiplied by an $m$-by-$n$ matrix $\Theta^1$ to produce an $n$-dimensional vector $z^2$.

So let's think about this! The entry $\Theta_{i, j}^1$ multiplies $a_j^1$ to contribute a part of $z_i^2$. Therefore,

\\[
\begin{align}
\fpartial{E}{\Theta_{i, j}^1}
&=
\fpartial{E}{z_i^2}
\fpartial{z_i^2}{\Theta_{i, j}^1}
\\
&=
\fpartial{E}{z_i^2}
a_j^1
\end{align}
\\]

This formula shows how to calculate all of the matrix $\fpartial{E}{\Theta^1}$:

\\[
\fpartial{E}{\Theta_{i, :}^1}
=
\fpartial{E}{z_i^2}
\circ
a^1
\\]

Sometimes this is written using the *outer product* operation:

\\[
\fpartial{E}{\Theta^1}
=
\fpartial{E}{z^2}
\otimes
a^1
\\]

Honestly, I frequently forget the definition of the outer product, so I just have to re-derive the formula for this matrix of partial derivatives.

You don't have to memorize any of these as formulas. You need to be able to apply the chain rule to figure them out when you need them.

### Partials With Respect To $a^1$

So $x = a^1$, so you don't need to calculate these partial derivatives, because you aren't allowed to change $x$, that's just your input.

Let's do it anyway, because we haven't backpropagated across an $m$-by-$n$ matrix yet.

So let's think. The value $a_j^1$ effects all the $z^2$ values, by virtue of the column $\Theta_{:, j}^1$. Therefore:

\\[
\begin{align}
\fpartial{E}{a_j^1}
&=
\sum_{i = 1}^n
\fpartial{E}{z_i^2}
\Theta{i, j}
\\
&=
\fpartial{E}{z^2}
\cdot
\Theta{:, j}
\end{align}
\\]

Basically, to calculate $\fpartial{E}{a_j^1}$, we take the dot product of $\fpartial{E}{z^2}$ with the column $\Theta{:, j}$. Normal matrix multiplication is dot products with rows, but this is with columns.

That suggests maybe we can use a transpose of $\Theta^1$ to write this in terms of regular matrix multiplication. We have:

\\[
\fpartial{E}{a_j^1}
=
\Theta^{1\intercal}_{j, :}
\fpartial{E}{z^2}
\\]

Indeed, we can calculate all the $\fpartial{E}{a^1}$ vector in one go:

\\[
\fpartial{E}{a^1}
=
\Theta\trans
\fpartial{E}{z^2}
\\]

### The End!

This is it! This is how neural network partial derivatives are calculated for the parameters $\Theta^i$. Using these partial derivatives we can use normal gradient descent to train the model.

We did it!