# Neural Network Backpropagation from Scratch

In [1]:
import numpy as np

NN is just a big function. Remember a forward pass for one neuron is basically this:

$ReLU(\sum{[inputs*weights]} + bias)$

in code that is

$ReLU(x_0w_0+x_1w_1+x_2w_2)$

which should familar -- dot product. Weighted inputs plus bias send through an activation function -- in this case $ReLU$.

One should see the need for differentiation coming up. How do the different parameters -- the $x$s and the $w$s -- influence the result? Let's solve for $x_0$. How does $ReLU()$ (params omitted for brevity) change with respect to $x_0$?

There are three functions nested: $ReLU$, $sum$ and $mul$.

Take note using $mul$ and $sum$ makes it a little more obvious that we're talking about functions.

$x_0w_0+x_1w_1+x_2w_2=sum(mul(x_0,w_0)+mul(x_1,w_1)+mul(x_2,w_2))$

We apply the **chain rule**.

$
\begin{equation}
    \cfrac{\partial}{x_0}[ReLU(sum(mul(x_0,w_0)+mul(x_1,w_1)+mul(x_2,w_2)))]\\
    =\cfrac{dReLU(sum(mul(x_0,w_0)))}{dsum(mul(x_0,w_0))}\cdot\cfrac{\partial sum(mul(x_0,w_0))}{\partial mul(x_0w_0)}\cdot\cfrac{\partial mul(x_0w_0)}{\partial w_0}\\
    =ReLU(sum(mul(x_0,w_0))'(sum(mul(x_0,w_0)))\cdot sum(mul(x_0,w_0))'(mul(x_0,w_0))\cdot mul(x_0,w_0)'(x_0)\\
\end{equation}
$

That's it in Leibniz and Lagrange notation. We're interested in $ReLU(x_0w_0)$ as we omit the $+x_1w_1+x_2w_2$ as they don't influence $x_0$. \
This process gets us the derivative of the weight $w_0$ and therefore the influence of it on the greater $ReLU$ function. This is also how we know to tweak this weight for less loss.

In [2]:
def ReLU(x):
    return max(0, x)

In [3]:
X = [1., -2., 3.]  # three inputs to the neuron
weights = np.array([-3., -1., 2.])
bias = 1.

## Forward Pass

In [4]:
fp = ReLU(np.dot(X, weights)) + bias
fp

6.0

## Backward Pass

In this example we're assuming we've alread done a backward pass and that gave us a gradient of $1$ as that layer also had just one neuron.

Let's do the math we discussed above for the first weight `weights[0]`.

$
\begin{equation}
    \cfrac{\partial}{x_0}[ReLU(sum(mul(x_0,w_0)+mul(x_1,w_1)+mul(x_2,w_2)))]\\
    =\cfrac{dReLU(sum(mul(x_0,w_0)))}{dsum(mul(x_0,w_0))}\cdot\cfrac{\partial sum(mul(x_0,w_0))}{\partial mul(x_0w_0)}\cdot\cfrac{\partial mul(x_0w_0)}{\partial w_0}\\
    =\cfrac{dReLU(sum(mul(1,w_0)))}{dsum(mul(1,w_0))}\cdot\cfrac{\partial sum(mul(1,w_0))}{\partial mul(1w_0)}\cdot\cfrac{\partial mul(1,w_0)}{\partial w_0}\\
    =1\cdot1\cdot1\\
    =1
\end{equation}
$


In [5]:
d_w0 = 1.

Now for `weights[1]`:

$
\begin{equation}
    \cfrac{\partial}{x_0}[ReLU(sum(mul(x_0,w_0)+mul(x_1,w_1)+mul(x_2,w_2)))]\\
    =\cfrac{dReLU(sum(mul(-2,w_1)))}{dsum(mul(-2,w_1))}\cdot\cfrac{\partial sum(mul(-2,w_1))}{\partial mul(-2,w_0)}\cdot\cfrac{\partial mul(-2,w_1)}{\partial w_1}\\
    =1\cdot1\cdot-2\\
    =-2
\end{equation}
$


In [6]:
d_w1 = -2.

More elaborate for `weights[2]`:

$
\begin{equation}
    \cfrac{\partial}{x_0}[ReLU(sum(mul(x_0,w_0)+mul(x_1,w_1)+mul(x_2,w_2)))]\\
    =\cfrac{dReLU(sum(mul(3,w_2)))}{dsum(mul(3,w_2))}\cdot\cfrac{\partial sum(mul(3,w_2))}{\partial mul(3,w_0)}\cdot\cfrac{\partial mul(3,w_2)}{\partial w_2}\\
    =\cfrac{\partial max(w_2, 3)}{\partial w_2}
     \cdot1\cdot
     \bigg(3\cfrac{\partial mul(w_2)}{\partial w_2}\bigg)\\
    =1(w_2>0)\cdot1\cdot(3\cdot1)\\
    =1\cdot1\cdot(3\cdot1)\\
    =1\cdot1\cdot3\\
    =3
\end{equation}
$

Pay attention to how $ReLU$ is derived and that an activation of $0$ would inevitably lead to a derivation of the backward pass of $0$. That's how neurons die.

Interlude: $d$ is actually preferred over $\partial$ when the function is univariate and therefore the number of first-order partial derivatives is 1. Albeit $\partial$ explicitly refers to partial derivative which is what we want -- we treat all other variables as constants.

In [7]:
d_w2 = 3.

The same differential operations with respect for each `X`. This is not used to optimze the weights and bias, but it's the gradient for the next backward pass.

In [8]:
d_x0 = -3.
d_x1 = -1.
d_x2 = 2.

A gradient is a vector of partial derivatives.

In [18]:
X_grad = [d_x0, d_x1, d_x2]
w_grad = [d_w0, d_w1, d_w2]
lr = .001  # learning rate (part of optimizer)
X_grad, weights, w_grad

([-3.0, -1.0, 2.0], array([-3., -1.,  2.]), [1.0, -2.0, 3.0])

At last, the updated weights would be:

In [12]:
optim_w = -lr * np.array(w_grad) + weights
optim_w

array([-3.001, -0.998,  1.997])

In [17]:
# another forward pass shows the slight change in result
ReLU(np.dot(X, optim_w)) + bias

5.986000000000001