## Exploration of Gradient Descent and Neural Networks

### Brendan Schlaman

**Dependencies:**
- Jupyter (with MathJax support for LaTeX)
- `numpy`

The purpose of this notebook is to incrementally translate the mathematics of
gradient descent into code in the context of a simple neural network.

We will use only simple math libraries like `numpy` to gain a more fundamental intuition of the concepts; these concepts are
abstracted away in more purpose-built libraries like `pytorch` or TensorFlow.

We will start with the simplest possible neural network: a 1-wide ($m=1$), single layer ($L = 1$) network, and build up from there.

### Notation

The following conventions will be used:

| Variable or symbol | Definition |
|---|---|
| $\mathbf{w}$ | A single weights matrix that feeds into a hidden layer.  $w$ is an $(m \times n)$ matrix, where $n$ is the width (size) of the inputs, and $m$ is the size of the outputs (the width of the hidden layer). |
| `w1` | When using a fixed number of layers *L*, `w1` will be the weights feeding into the first hidden layer (or the output layer).  `wn` in the code corresponds to $w^{(n)}$ in the math notation. |
| $\mathbf{b}$ | A single bias matrix that contributes to a hidden layer.  $\mathbf{b}$ is a $(m \times 1)$ matrix, where $m$ is the size (width) of the associated hidden layer.  |
| `b1` | When using a fixed number of layers *L1*, `b1` will be the weights feeding into the first hidden layer (or the output layer).  `bn` in the code corresponds to $b^{(n)}$ in the math notation. |
| `z` | A preactivated layer.  It is the result of the linear combination (weights times inputs plus bias) before an activation function is applied. |
| `a` | An activated layer, i.e. $\sigma (z)$. |
| $L$ | The depth of the neural network; the number of hidden layers plus the output layer. |
| $C_k$ | The cost of the neural network for training example $k$. |


> Note that when using a fixed number of layers *L > 1*, the *input layer* is treated as the “0th” layer and does not have wieghts or biases.
Think of this like each layer owning the weights and biases that feed into it.

### Key equations

#### Neural network definitions

$\mathbf{z}^{(L)} = \mathbf{w}^{(L)}\mathbf{a}^{(L-1)} + \mathbf{b}^{(L)}$

$\mathbf{a}^{(L)} = \sigma (\mathbf{z}^{(L)})$

> Going forward, I'll drop the boldface notation for vectors and matrices.  $w$ is always a matrix, and $b$, $z$, are always vectors.

#### Nonlinear activation functions

The standard softmax function $\sigma : \mathbb{R}^K \mapsto (0, 1)^K; \; K \geq 1$:

$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \text{ for } 1, \dots, K \text{ and } \mathbf{z} = (z_1, \dots, z_K) \in \mathbb{R}^K$

#### Computing loss

We'll use the **squared error loss function***:

$C_0(\dots) = (a^{(L)} - y)^2$

For a single layer, 1-wide network (with squared error loss):

$\frac{\partial C_0}{\partial w^{(L)}} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial C_0}{\partial a^{(L)}} = a^{(L-1)} \sigma'(z^{(L)}) 2(a^{(L)} - y)$

The full gradient of the cost function over the entire network $\nabla C$ comprises the above derivatives across all layers.

$
\newcommand{\arraystretch}{1.5}
\nabla C = 
    \begin{bmatrix}
    \frac{\partial C}{\partial w^{(1)}} \\
    \frac{\partial C}{\partial b^{(1)}} \\
    \vdots \\
    \frac{\partial C}{\partial w^{(L)}} \\
    \frac{\partial C}{\partial b^{(L)}}
    \end{bmatrix}
$

Over all examples $x_k$, the total loss derivative is

$\frac{\partial C}{\partial w^{(L)}} = \frac{1}{n} \sum_{k=0}^{n-1} \frac{\partial C_k}{\partial w^{(L)}}$


Cross entropy loss:

$C = -\sum_{j=1}^{K} y_j \log(\hat{y}_j)$

$C = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\hat{y}_{ij})$

### 1-Wide Neural Network (L = 1)
#### Two layered network

Start with the simplest NN - a single input node connected to a single output node.


In [None]:
import numpy as np

def init_params():
    w1 = np.random.randn(1) * 0.01
    b1 = np.random.randn(1)
    return w1, b1

init_params()