# A Simple Example
---
## Network
We are going to start of with a very simple example, a neural network with 2 hidden layer, and 1 node at each layer (input, hidden layer 1, hidden layer 2, and the output layer). To further simplify things, the bias terms were not included in the graphic, every layer will use the same activation function, and we will be dealing with one sample at a time.

<img src="imgs/n1/network1.jpg" alt="nn" width="600"/>

The forward propogation of this network would look like this:

\begin{align}
\large
x \xrightarrow[w_1x + b_1]{\text{linear}} \bar{h}_1 \xrightarrow[f(\bar{h}_1)]{\text{activation}} h_1 \xrightarrow[w_2h_1 + b_2]{\text{linear}} \bar{h}_2 \xrightarrow[f(\bar{h}_2)]{\text{activation}} h_2 \xrightarrow[w_3h_2 + b_3]{\text{linear}} \bar{y} \xrightarrow[f(\bar{y})]{\text{activation}} \hat{y}
\end{align}

Inputs and Outputs
- $x$ is the inputs ($x_1, x_2$)
- $\bar{h}_1$ is the hidden layer inputs
- $h_1$ is the hidden layer outputs
- $\bar{h}_2$ is the hidden layer inputs
- $h_2$ is the hidden layer outputs
- $\bar{y}$ is the output layer inputs
- $\hat{y}$ is the predictions


Parameters
- $w_1$ is the weight matrix connecting the input layer to the hidden layer
- $b_1$ is the bias vector for the input layer to the hidden layer
- $w_2$ is the weight matrix connecting the hidden layer to the output layer
- $b_2$ is the bias vector for the hidden layer to the output layer
- $w_3$ is the weight matrix connecting the hidden layer to the output layer
- $b_3$ is the bias vector for the hidden layer to the output layer


Activation Function
- $f(x)$ is an activation function

### Activation Function
For this example, we are going to use the **sigmoid (logistic)** function as our activation for every layer. We will also need the **derivative** of the sigmoid function later on, so that is also included below (you can find the derivation [here](https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e) if you are intrested).

\begin{align}
\large
\sigma (x) = \frac{1}{1+e^{-x}}
\end{align}

\begin{align}
\large
\sigma ' (x) = \sigma (x) (1 - \sigma (x))
\end{align}

---
## Error
We are already familiar with what error is and how it will be the basis for gradient descent. We will be using the error function that was defined in the intro notebook, **Mean Squared Error**.

\begin{align}
\large
E(y, \hat{y}) = \frac{1}{NM} \sum_{i=1}^N  \sum_{j=1}^M \frac{1}{2}(y_{ij} - \hat{y_{ij}})^2
\end{align}

Now remember, the above is the error of the **entire dataset**, if we want the error for a single sample, we can get rid of the $N$ summation and the $\frac{1}{N}$. In addition, since we only have one output node, $M=1$ therefore we can remove that term as well. So the error for a single sample would look like this:

\begin{align}
\large
E(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2
\end{align}

We'll need the **derivative** of this function, the derivative in this case is very easy to derive (use power rule):

\begin{align}
\large
E'(y, \hat{y}) = y - \hat{y}
\end{align}

---
## Gradient
As mentioned in the intro, if we could find the gradient of the **error with respect to the weights/biases**, the negative of that gradient would tell us how to adjust the weights/biases to decrease loss. The gradient vector of the error with respect to the weights and biases would look like this:

\begin{align}
\large
\nabla E_w = (\frac{\partial E}{\partial w_1}, \frac{\partial E}{\partial w_2}, \frac{\partial E}{\partial w_3})
\end{align}

\begin{align}
\large
\nabla E_b = (\frac{\partial E}{\partial b_1}, \frac{\partial E}{\partial b_2}, \frac{\partial E}{\partial b_3})
\end{align}



In [1]:
# Imports
import numpy as np

# Activation functions
def sigmoid(x):
    return 1 / (1 - np.exp(x))
def d_sigmoid(x):
    return x*(1-x)

In [None]:
# Weight and Bias Initialization
weights = np.random.normal(loc = 0., scale = 1/2, size = (3))
biases = np.array([0., 0., 0.])

In [8]:
# Feed Forward
def feed_forward(X):
    outputs = []
    
    h1_inputs = X * weights[0] + biases[0]
    h1_outputs = sigmoid(h1_inputs)
    outputs.append(h1_outputs)
    
    h2_inputs = h1_outputs * weights[1] + biases[1]
    h2_outputs = sigmoid(h2_inputs)
    outputs.append(h2_outputs)
    
    h3_inputs = h2_outputs * weights[2] + biases[2]
    h3_outputs = sigmoid(h3_outputs)
    outputs.append(h3_outputs)
    
    return outputs

In [None]:
def backpropagation(X, y, y_hat, outputs):
    weights_gradients = []
    bias_gradients = []
    
    d1 = y - y_hat
    d2 = d_sigmoid(outputs[2])
    error = d1 * d2
    gradients.append()
    
    d3 = weights[2]