# Week 4: Neural Networks: Representation

## Motivations

### Non-linear Hypotheses

Say we had a classification task with two features and the 'line' separating the positive class from the negative was a fairly complicated shape, such that to represent this line we needed to include the quadratic and 3rd degree polynomial terms in the hypothesis.

Our hypothesis would look something like:

$$ h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2 + \theta_5x_1x_2 + \theta_6x_1^3 + \theta_7x_2^3 + \theta_8x_1^2x_2 + \dots )$$

Even with just two original features and including up to only the 3rd degree polynomial, the number of overall features has increased enormously.

A lot of problems do require a lot of input features. For example, in computer vision if you use pixels as input features even if you restricted yourself to just grayscale 50x50 images you would have 2500 original input features. If you wanted to include the quadratic terms that gives you 3 million features, already too large to be reasonable.

### Neurons and the Brain

Neural networks are biologically inspired learning algorithms, i.e. algorithms that mimic the brain. Recent resurgence as it was only recently that computers became fast enough to run large scale neural networks.

"One learning algorithm" hypothesis:  
The brain uses a single learning algorithm. Evidence: successful neuro-rewiring experiments, e.g. rewiring auditory cortex so that it's being fed signals from the optic nerve; auditory cortex then learns to see. Idea is that if a single piece of brain tissue that can process sight / sound / touch then maybe there's one learning algorithm that can process sight / sound / touch. If we can figure out what that learning algorithm is and it we'll be sorted.

## Neural Networks

### Model Representation I

Neurons have dendrites, which can be thought of as "input wires", and an axon, which can be thought of as an "output wire". When a neuron receives a bunch of electrical pulses via its dendrites its axon will either fire an electrical pulse (to be received as an input to another neuron's dendrites) or not.

We represent a single neuron as a "logistic unit":

![logistic unit; credit: Andrew Ng Coursera](https://beths3test.s3.amazonaws.com/machine-learning-notes/neuron.png)

In this diagram:

$$ x = \begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
x_3 \\
\end{bmatrix},
\quad
\theta = \begin{bmatrix}
\theta_0 \\
\theta_1 \\
\theta_2 \\
\theta_3 \\
\end{bmatrix}\\
\\
h_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}}
$$

You can draw the $x_0$ node (the "bias unit/node"), but since it's always 0 it often gets left out.  
As we're using a sigmoid function, this neuron is said to have a sigmoid (logistic) activation function.

A neural network is a load of these logistic units connected together:

![neural network; credit: Andrew Ng Coursera](https://beths3test.s3.amazonaws.com/machine-learning-notes/neural-network.png)

In this diagram, layer 1 is the input layer, layer 3 is the output layer and layer 2 the hidden layer. Again, for each layer (except the output layer) there is a bias unit that isn't always drawn.

$a^{(j)}_i$ is the "activation" of unit $i$ in layer $j$.  
$\Theta^{(j)}$ is the matrix of weights controlling function mapping from layer $j$ to layer $j + 1$.

$h_\Theta(x)$ in this network gets computed the following way:

First, the activations of the nodes in the hidden layer are calculated:

$$a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2) + \Theta_{13}^{(1)}x_3) \\
a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2) + \Theta_{23}^{(1)}x_3) \\
a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2) + \Theta_{33}^{(1)}x_3)$$

Then, these results are used to calculate the output:

$$ h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)}) + \Theta_{13}^{(2)}a_3^{(2)}) $$

If a network has $s_j$ units in layer $j$, $s_{j + 1}$ units in layer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j + 1} \times (s_j + 1)$.

In our diagram, there are 3 units in layer 1 and 3 units in layer 2, and so $\Theta^{(1)}$ is a $(3\times4)$ matrix. There are 3 units in layer 2 and 1 in layer 3, so $\Theta^{(2)}$ is a $(1\times4)$ matrix.

### Model Representation II

Let's try and work out a vectorized implementation of the neural network described above.

Before we described our network as:

$$ a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2) + \Theta_{13}^{(1)}x_3) \\
a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2) + \Theta_{23}^{(1)}x_3) \\
a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2) + \Theta_{33}^{(1)}x_3) \\
h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)}) + \Theta_{13}^{(2)}a_3^{(2)}) $$

Let's write the parameters passed to $g(z)$ like this:

$$ z_1^{(2)} = \Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2) + \Theta_{13}^{(1)}x_3 \\
z_2^{(2)} = \Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2) + \Theta_{23}^{(1)}x_3 \\
z_3^{(2)} = \Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2) + \Theta_{33}^{(1)}x_3 $$

So now we can say that:

$$ a_1^{(2)} = g(z_1^{(2)}) \\
a_2^{(2)} = g(z_2^{(2)}) \\
a_3^{(2)} = g(z_3^{(2)})$$

Let's define the vector $z^{(2)}$ as:

$$ z^{(2)} = \begin{bmatrix}
z_1^{(2)} \\
z_2^{(2)} \\
z_3^{(2)} \\
\end{bmatrix}
$$

If you look at how the element's of $z^{(2)}$ are defined you can see that $z^{(2)} = \Theta^{(1)}x$. So we can calculate the vector $a^{(2)}$ (the second, hidden layer) in just two steps:

$$ z^{(2)} = \Theta^{(1)}x \\
   a^{(2)} = g(z^{(2)})$$
   
Now we need to add the bias unit $a_0^{(2)} (= 1)$ to the hidden layer so that:

$$ a^{(2)} = \begin{bmatrix}
a_0^{(2)} \\
a_1^{(2)} \\
a_2^{(2)} \\
a_3^{(2)} \\
\end{bmatrix}
$$

And now we can write:

$$z^{(3)} = \Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)}) + \Theta_{13}^{(2)}a_3^{(2)}$$

As:

$$ z^{(3)} = \Theta^{(2)}a^{(2)} $$

So that the final output of the network is:

$$ h_\Theta(x) = a^{(3)} = g(z^{(3)}) $$

This process is known as *forward propagation* because each layer is computed and the result used to compute the next layer until we arrive at the output.

This is cool because instead of being constrained to use the raw features $x$, this network gets to learn the features $a^{(2)}$, and use those as inputs to logistic regression. This might give us a better hypothesis than just using the raw features would.