## Exploration of Gradient Descent and Neural Networks

### Brendan Schlaman

The purpose of this notebook is to incrementally translate the mathematics of
gradient descent into code in the context of a simple neural network.

We will start with the simplest possible neural network: a 1-wide ($m=1$), single layer ($L = 1$) network, and build up from there.

### Notation

The following conventions will be used:

| Variable or symbol | Definition |
|---|---|
| $\mathbf{w}$ | A single weights matrix that feeds into a hidden layer.  $w$ is an $(m \times n)$ matrix, where $n$ is the width (size) of the inputs, and $m$ is the size of the outputs (the width of the hidden layer). |
| `w1` | When using a fixed number of layers *L*, `w1` will be the weights feeding into the first hidden layer (or the output layer).  `wn` in the code corresponds to $w^{(n)}$ in the math notation. |
| $\mathbf{b}$ | A single bias matrix that contributes to a hidden layer.  $\mathbf{b}$ is a $(m \times 1)$ matrix, where $m$ is the size (width) of the associated hidden layer.  |
| `b1` | When using a fixed number of layers *L1*, `b1` will be the weights feeding into the first hidden layer (or the output layer).  `bn` in the code corresponds to $b^{(n)}$ in the math notation. |
| `z` | A preactivated layer.  It is the result of the linear combination (weights times inputs plus bias) before an activation function is applied. |
| `a` | An activated layer, i.e. $\sigma (z)$. |
| $L$ | The depth of the neural network; the number of hidden layers plus the output layer. |
| $C_k$ | The cost of the neural network for training example $k$. |


\*Note that when using a fixed number of layers *L > 1*, the *input layer* is treated as the “0th” layer and does not have wieghts or biases.
Think of this like each layer owning the weights and biases that feed into it.

### Key equations

$z^{(L)} = w^{(L)}a^{(L-1)} + b^{(L)}$

$a^{(L)} = \sigma (z^{(L)})$

We'll use the squared error loss function:

$C_0(\dots) = (a^{(L)} - y)^2$

For a single layer, 1-wide network (with squared error loss):

$\frac{\partial C_0}{\partial w^{(L)}} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial C_0}{\partial a^{(L)}} = a^{(L-1)} \sigma'(z^{(L)}) 2(a^{(L)} - y)$

Over all examples $x_k$, the total loss partial derivative is

$\frac{\partial C}{\partial w^{(L)}} = \frac{1}{n} \sum_{k=0}^{n-1} \frac{\partial C_k}{\partial w^{(L)}}$

The classic softmax equation:

$\sigma(z)_j = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}}$


### TODO: !Important!  Add a section describing how NN depth + width and training data size are reflected in the dimensionality of my datastructures

What is `w[0]`, `w[0][0]`, `b[0]`, `z[0]`?

### 1-Wide Neural Network (L = 1)
#### Two layered network

Start with the simplest NN - a single input node connected to a single output node.


In [None]:
import numpy as np

def init_params():
    w1 = np.random.randn(1) * 0.01
    b1 = np.random.randn(1)
    return w1, b1

init_params()

$asdf$