# Neural Networks and Deep Learning Notes

Notes and equations from [neuralnetworksanddeeplearning.com](http://neuralnetworksanddeeplearning.com/)

# Chapter 1

## Perceptrons

The **simple perceptron** computes its output as a weighted sum of the inputs

$
\begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\
      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}
      \end{array} \right.
\tag{1}\end{eqnarray}
$

The notation can be adjusted to use a dot product and a **bias** term instead of the threshold. The bias controls how easy or difficult it is to get the perceptron to output a 1.

$
\begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } w\cdot x + b \leq 0 \\
      1 & \mbox{if } w\cdot x + b > 0
    \end{array}
  \right.
\tag{2}\end{eqnarray}
$

The perceptron can implement a NAND gate:

![](http://neuralnetworksanddeeplearning.com/images/tikz2.png)

$
\\ 00 = -2 * 0 + -2 * 0 + 3 = 3 \rightarrow 1
\\ 01 = -2 * 0 + -2 * 1 + 3  = 1 \rightarrow 1
\\ 10 = -2 * 1 + -2 * 0 + 3  = 1 \rightarrow 1
\\ 11 = -2 * 1 + -2 * 1 + 3  = -1 \rightarrow 0
$

The perceptron can implement a two bit adder. All weights are -2 except for the one marked -4 and all biases are 3:

![](http://neuralnetworksanddeeplearning.com/images/tikz6.png)

## Sigmoid Neurons

Perceptrons are very sensitive to changes in their input (small input change can cause large output change), which makes it difficult to apply learning algorithms to them. Sigmoid neurons help to alleviate this because the sigmoid function has smaller changes in its output - effectively a smoothed out perceptron.

The **sigmoid function**:

$\begin{eqnarray} 
  \sigma(z) \equiv \frac{1}{1+e^{-z}}.
\tag{3}\end{eqnarray}$

The output of a sigmoid neuron with weights $w_i$, inputs $x_i$, and bias $b$:

$\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.
\tag{4}\end{eqnarray}$

Behavior: when $w \cdot x + b$ is large, the sigmoid output approaches 1; when it's small, the output approaches 0. The outputs are continuous values between 0 and 1, so you need some heuristic or convention to interpret meaning from them. For example, you might say that any value greater than or equal to 0.5 indicates a "yes", and any value less than 0.5 indicates a "no".

## Exercises

1) Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a constant, $c > 0$. Show that the behavior of the network doesn't change.

The perceptron's output is determined by the weighted sum of the inputs and bias compared to 0, $w \cdot x + b > 0$. Multiplying both $w$ and $b$ by a constant $c$ gives us the inequality $cw \cdot x + cb = c(w \cdot x + b) > 0$. We can show they are equal:

$
cw \cdot x + cb > 0
\\ c(w \cdot x + b) > 0
\\ w \cdot x + b > \frac{0}{c}
\\ w \cdot x + b > 0
$

The same can be easily shown for the $\leq 0$ inequality.

2) Given a network of perceptrons and a fixed input to the network. The weights and biases are such that $w \cdot x + b \neq 0$ for input $x$ to any perceptron in the network. Replace all perceptrons by sigmoid neurons. Multiply the weights and biases by a positive constant $c > 0$. Show that in the limit as $c \rightarrow \infty$, the behavior of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when $w \cdot x + b = 0$ for one of the perceptrons?

It seems I should show that as $c \rightarrow \infty$, $\frac{1}{1 + e ^ {-c(w \cdot x + b)}}$ "behaves" the same as $c(w \cdot x + b)$. I'm not certain how to approach the topic of "behavior" at a network level - should I show that they increase at the same rate? A single case could probably be worked out using the two bit adder network above.

## The Architecture of Neural Networks

Terminology

- Multilayer perceptrons are actually made up of sigmoid neurons (there can be other activation functions used as well).

## A simple network to classify handwritten digits

- Images will be 28 by 28, meaning the input layer has 27 x 27 = 784 neurons.
- There are ten output neurons - each will have an activation value between 0 and 1. The digit gets classified by taking the neuron with the highest activation value.
- You could also do a bit-wise output representation. This would have four neurons, each of which can represent 0 or 1. This would work because all digits from 0 to 9 can be represented with four binary bits (e.g. 0 = 0000, 4 = 0100, 9 = 1001). It turns out the 10-neuron representation performs better.

## Exercise

Suppose you had the 10-neuron output layer but wanted to downsample it to a 4-neuron (bit-wise) output layer. Find a set of weights that makes this possible. Assume that correct outputs in the 3rd layer (old output layer) have activation gte 0.99 and incorrect have activation < 0.01.

![](http://neuralnetworksanddeeplearning.com/images/tikz13.png)

We can consider this to be a linear system of equations, $Aw = B$, where $A$ is the activation values from the old output layer, $w$ is the matrix of weights we are looking for, and $B$ is the activation of the bit-wise encoded new output layer.

This allows us to simply populate the matrices $A$ and $B$ and solve for the weight matrix $w$.

$A$ and $B$ will look like this:

$A = \{\{0.99, 0.001, ...\}, \{0.001, 0.99, 0.001, ...\}, \{0.001, 0.001, 0.99, 0.001, ...\}, ... \}$
$B = \{\{0, 0, 0, 0\}, \{0, 0, 0, 1\}, ...\}$

See code below for the computed solution to this sytem of equations.

In [40]:
import numpy as np

A = np.full((10,10), 0.001)
B = np.full((10,4), 0)

# Fill in each digit with a 0.99 in A.
# Fill in the binary representation of each digit in B.
for i in range(0,10): 
    A[i][i] = 0.99
    B[i] = [int(b) for b in list('{0:04b}'.format(i))]

# Solve for w in Aw = B. This is the weight matrix.
w = np.linalg.solve(A, B)

# Print results.
print('Aw = b, A = :\n', A)
np.set_printoptions(precision=6)
print('Aw = b, w = :\n', w)
np.set_printoptions(precision=1)
np.set_printoptions(suppress=True)
print('Aw = b, B = :\n', np.dot(A,w))

Aw = b, A = :
 [[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]]
Aw = b, w = :
 [[-0.002024 -0.004049 -0.004049 -0.005061]
 [-0.002024 -0.004049 -0.004049  1.006062]
 [-0.002024 -0.004049  1.007074 -0.005061]
 [-0.002024 -0.004049  1.007074  1.006062]
 [-0.002024  1.007074 -0.004049 -0.005061]
 [-0.002024  1.007074 -0.004049  1.006062]
 [-0.002024  1.007074  1.007074 -0.005061]
 [-0.002024  1.007074  1.007074  1.006062]
 [ 1.009098 -0.004049 -0.004049 -0.005061]
 [ 1.009098 -0.004049 -0.004049  1.006062]]
Aw = b, B = :
 [[-0.  0.  0.  0.]
 [-0. -0. -0.  1.]
 [-0. -0.  1. -0.]
 [-0. -0.  1.  1.]
 [-0.  1. -0. -0.



## Learning with Gradient Descent

**Notation**

- $X$ denotes the matrix of all trainign input vectors.
- $x$ denotes a single 28 x 28 training input as a $784 \times 1$ vector.
- $w$ denotes the collection of all weights in the network.
- $b$ denotes the collection of all biases.
- $n$ is the total number of inputs.
- $a$ is the vector of outputs from the network given a single input $x$.
- $y(x)$ is the correct vector of outputs for a single input $x$.
- $\eta$ is the learning rate

**Quadratic cost function (aka mean squared error)**

$C(w,b) = \frac{1}{2n} \sum_{x \in X} (y(x) - a)^2$

- Minimize this measure by adjusting $w$ and $b$ using gradient descent.
- We use this proxy measure of accuracy because classification accuracy alone is not a smooth function, so it's difficult to minimize.

**Gradient Descent Explained Mathematatically for a Two-variable Cost Function**

- Goal: Minimize the cost function $C(v_1, v_2)$.

- Strategy: Iteratively adjust $v_1$ and $v_2$ in order to reach a global minimum for $C$.

When $C(v_1, v_2)$ is changed by $(\Delta v_1, \Delta v_2)$, this results in a $\Delta C$ expressed:

$\Delta C = \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2$

It's important to understand that $\frac{\partial C}{\partial v_1} \Delta v_1$ really means "the rate at which $C$ is changing with respect to $v_1$" (... $\frac{\partial C}{\partial v_1}$ ...) times the amount by which $v_1$ is changing ( ...$\Delta v_1$... ). This is analagous to saying "distance = velocity x duration". $\Delta C$ is just the sum of these distances for both $v_1$ and $v_2$.

- Checkpoint: How does $\Delta C$ affect $C$? 

$C \rightarrow C' = C + \Delta C$.

So if we can find values for $(\Delta v_1, \Delta v_2)$ such that $\Delta C$ is negative, then we can iteratively minimize $C$.

- Checkpoint: How do we find values for $(\Delta v_1, \Delta v_2)$ to make $\Delta C$ negative?

First we define the "gradient" of $C$ as:

$\nabla C = (\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2})^T$

and the change in $v$ is defined as:

$\Delta v = (\Delta v_1, \Delta v_2)^T$

Remembering that $\frac{\partial C}{\partial v_1} \Delta v_1$ is two separate entities $\frac{\partial C}{\partial v_1}$ and $\Delta v_1$, we can rewrite $\Delta C$ in a more convenient way:

$
\Delta C = ((\frac{\partial C}{\partial v_1})(\Delta v_1) + (\frac{\partial C}{\partial v_2})(\Delta v_2))
\\ \\ \Delta C = (\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2})^T \cdot (\Delta v_1, \Delta v_2)^T
\\ \\ \Delta C = \nabla C \cdot \Delta v
$

So we've re-written $\Delta C$ as the dot product of $\nabla C$ and $\Delta v$.

- Checkpoint: How do we choose $\Delta v$ to make $\Delta C$ negative?

We already know what $\nabla C$ and $\eta$ are. So for convenience, lets choose:

$\Delta v = -\eta \nabla C$.

This lets us re-write $\Delta C$ as:

$\Delta C = \nabla C \cdot -\eta \Delta v = -\eta (\nabla C)^2$

Because $-\eta (\nabla C)^2 < 0$, $\Delta C$ will always be negative, which is what we wanted in the first place.

- Checkpoint: How do we update $v$ to "move" towards minimizing $C$?

We've determined that choosing $\Delta v = -\eta \nabla C$ will eventually minimize $C$, so then we actually update $v$ as follows:

$v \rightarrow v' = v + \Delta v = v - \eta \nabla C$

Again, this guarantees that $\Delta C$ will always decrease.