## Implementing A Neural Network

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\bigoh}[1]{\mathcal{O}\left(#1\right)}
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\\]

![](assets/neural_network_diagram.jpg)

## Calculating Activations In A Neural Network

We will denote the inputs to a neural network as $x$. These are the activations of the first layer in the network. In my example, $x$ is an $m$-dimensional vector.

The second layer is the first *hidden layer*. In my example there are $n$ units in the second layer.

Each unit of the second layer is effectively a linear regression model. For each unit, we need to do a weighted sum of the previous layer's activations to calculate a $z$ value, then we run this value through the logistic function to calculate $a = \sigma(z)$.

Consider the $i$th unit in the second layer. Then:

\\[
\begin{align}
z_i^2 &:= \theta_i^1 x
\\
a_i^2 &:= \sigma\left(z_i^2\right)
\end{align}
\\]

$\theta_i^1$ is the list of $m$ weights to use to calculate the preactivation of the $i$th unit from a weighted sum of the $m$ values of $x$. Every unit in the layer will use a different $\theta_i^1$ weight vector.

To make things compact, we write all each $\theta_i^1$ as the $i$th row of $n$-by-$m$ matrix denoted $\Theta^1$. Using this matrix we can calculate $\Theta^1 x$, which transforms the $m$-dimensional input into an $n$-dimensional output. This output is exactly $z^2$, the vector of individual $z_i^2$ values.

It is common to denote the $i$-th column of $\Theta^1$ by $\Theta_{i, :}^1$. Here the $:$ symbol means "all the columns."

To calculate the $a^2$ values, the activations of the second layer, we just calculate $\sigma\left(z^2\right)$, which applies the sigmoid to each individual coordinate of $z$.


These activations are then used for the next step of the process, which proceeds the same way.

In this case, the third layer happens to be the *output layer*. In my example it looks like there is only one output value. In principle there could be many outputs, or there could be more hidden layers.

Everything works the same. There is a $1$-by-$n$ matrix $\Theta^2$ that calculates 1-dimensional $z^3$ preactivation for the third layer by $\Theta^2 a^2$. We again calculate $a^3 = \sigma\left(z^3\right)$.


### Backpropagation

**TODO**