# Neural Network From Scratch

**Author: rNLKJA**

This notebook contains the code & notes for YouTube Video [Neural Network from Scratch | Mathematics & Python Code](https://www.youtube.com/watch?v=pauPCy_s0Ok&t=1671s).

You may also find the original codes on [Github](https://github.com/TheIndependentCode/Neural-Network)

I might add some notation for better understanding of How Neural Network works.

### Layer

A layer will receive an input and produce the output, which means the layer works like an function. 

<!-- <img src="./images/layer1.png" width=500px/> -->

When layer receive an input and produce the output, we called this process a forward propagation. However, when we take the output to update its parameters, we call this backward propagation.

To calculate the deriviatives, we could use chain rule which converts the derivatives to another form:

$$
\cfrac{\partial E}{\partial W} = \cfrac{\partial E}{\partial Y}\cfrac{\partial Y}{\partial W}
$$

$$
\cfrac{\partial E}{\partial X} = \cfrac{\partial E}{\partial Y}\cfrac{\partial Y}{\partial X}
$$

For a neural nework, it constructs by a sequential model, which means current layer output is the input for the next layer.

That means assume a network has three layers, then we have: $Y_1 = X_1$, $Y_2 = X_2$, $\cfrac{\partial E}{\partial Y_1} = \cfrac{\partial E}{\partial X_2}$ and $\cfrac{\partial E}{\partial Y_2} = \cfrac{\partial E}{\partial X_3}$.

In [67]:
# import required modules
import numpy as np
from tqdm import tqdm

#### Create the base layer

In [48]:
# implmenet the base layer
class Layer:
    def __init__(self):
        self.input = None
        self.output = None
    
    # a forward method takes the input to produce the output
    def forward(self, input):
        # TODO: return output
        pass
    
    # the backward method takes the output to update its input
    # 1. update the training parameters
    # 2. return the derivative of error based on the learning rate
    def backward(self, output_gradient, learning_rate):
        # TODO: update parameters and return input gradient
        pass

### Dense Layer

A dense layer connects the set of $i$ inputs (e.g. $x_1, x_2, x_3$) to a set of $j$ output (e.g. $y_1, y_2, y_3, y_4$).

Each connection represented by a variable weights $w$. We write the notation as $w_{ji}$.

In addition we define the bias term $b_j$ for each output. Hence we have the following formulas:

$$
\begin{equation}
y_1 = x_1w_{11} + x_2w_{12} + \dots + x_iw_{1i} + b_1 \\
y_2 = x_1w_{21} + x_2w_{22} + \dots + x_iw_{2i} + b_2 \\
y_3 = x_1w_{31} + x_2w_{32} + \dots + x_iw_{3i} + b_3 \\
\vdots \\
y_j = x_1w_{j1} + x_2w_{j2} + \dots + x_iw_{ji} + b_j \\
\end{equation}
$$

Then we rewrite these functions use matrices: 

$$
\begin{bmatrix}
   y_1 \\ y2 \\ \vdots \\ y_j
\end{bmatrix} = \begin{bmatrix}
    w_{11} & w_{12} & \dots & w_{1i} \\
    w_{11} & w_{12} & \dots & w_{1i} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{j1} & w_{j2} & \dots & w_{ji}
\end{bmatrix} \begin{bmatrix}
x_1 \\ x_2 \\ \vdots \\ x_i
\end{bmatrix} + \begin{bmatrix}b_1 \\ b_2 \\ \vdots \\ b_j\end{bmatrix} \rightarrow Y =W \cdot X + B
$$

For a layer, $W$ and $B$ are trainable parameters. In the mean time we also need to calculate the error.

We want to use derivative respect to Y to calculate the derivatives of $W, B, X$.

$$
\cfrac{\partial E}{\partial Y} = \begin{bmatrix}
 \cfrac{\partial E}{\partial y_1} \\ \cfrac{\partial E}{\partial y_2} \\ \vdots \\ \cfrac{\partial E}{\partial y_j} 
\end{bmatrix} \rightarrow \cfrac{\partial E}{\partial W} = \begin{bmatrix}
    \cfrac{\partial E}{\partial w_{11}} & \cfrac{\partial E}{\partial w_{12}} & \dots & \cfrac{\partial E}{\partial w_{1i}} \\
    \cfrac{\partial E}{\partial w_{21}} & \cfrac{\partial E}{\partial w_{22}} & \dots & \cfrac{\partial E}{\partial w_{2i}} \\
    \vdots & \vdots & \ddots & \vdots \\
    \cfrac{\partial E}{\partial w_{j1}} & \cfrac{\partial E}{\partial w_{j2}} & \dots & \cfrac{\partial E}{\partial w_{ji}} \\
\end{bmatrix}
$$

Given the above formulas, then for a single term 

$$
\cfrac{\partial E}{\partial w_{12}} = \cfrac{\partial E}{\partial y_1}\cfrac{\partial y_1}{\partial w_{12}} + \cfrac{\partial E}{\partial y_2} \cfrac{\partial y_2}{\partial w_{12}} + \dots + \cfrac{\partial E}{\partial y_j}\cfrac{\partial y_j}{\partial w_{12}} = \cfrac{\partial E}{\partial y_1}\cfrac{\partial y_1}{\partial w_{12}}
$$

Then we can have the following calculation:

$$
\cfrac{\partial E}{\partial W} = \begin{bmatrix}
    \cfrac{\partial E}{\partial y_1}x_1 & \cfrac{\partial E}{\partial y_1}x_2 & \dots & \cfrac{\partial E}{\partial y_1}x_1 \\
    \cfrac{\partial E}{\partial y_1}x_1 & \cfrac{\partial E}{\partial y_2}x_2 & \dots & \cfrac{\partial E}{\partial y_1}x_2 \\
    \vdots & \vdots & \ddots & \vdots \\
    \cfrac{\partial E}{\partial y_1}x_1 & \cfrac{\partial E}{\partial y_j}x_2 & \dots & \cfrac{\partial E}{\partial y_1}x_j \\
\end{bmatrix} = \begin{bmatrix}
    \cfrac{\partial E}{\partial y_1} \\ \cfrac{\partial E}{\partial y_2} \\ \vdots \\ \cfrac{\partial E}{\partial y_j}
\end{bmatrix} \begin{bmatrix}x_1 & x_2 & \dots & x_i \end{bmatrix} = \cfrac{\partial E}{\partial Y}\cdot X^t
$$

Do the same work for $B$:

$$
\cfrac{\partial E}{\partial Y} = \begin{bmatrix}
    \cfrac{\partial E}{\partial y_1} \\ \cfrac{\partial E}{\partial y_2} \\ \vdots \\ \cfrac{\partial E}{\partial y_j} 
\end{bmatrix} \rightarrow \cfrac{\partial E}{\partial B} = \begin{bmatrix}
    \cfrac{\partial E}{\partial b_1} \\ \cfrac{\partial E}{\partial b_2} \\ \vdots \\ \cfrac{\partial E}{\partial b_j} 
\end{bmatrix}
$$

Take an example of $b_1$:

$$
\cfrac{\partial E}{\partial b_1} = \cfrac{\partial E}{\partial y_1}\cfrac{\partial y_1}{\partial b_1} + \cfrac{\partial E}{\partial y_2}\cfrac{\partial y_2}{\partial b_1}+\dots+\cfrac{\partial E}{\partial y_j}\cfrac{\partial y_j}{\partial b_1} = \cfrac{\partial E}{\partial y_1}
$$

$$
\cfrac{\partial E}{\partial b_j} = \cfrac{\partial E}{\partial y_j} \rightarrow \cfrac{\partial E}{\partial B} = \cfrac{\partial E}{\partial Y}
$$

The last work is calculate derivatives for X:

$$
\cfrac{\partial E}{\partial Y} = \begin{bmatrix}
    \cfrac{\partial E}{\partial y_1} \\ \cfrac{\partial E}{\partial y_2} \\ \vdots \\ \cfrac{\partial E}{\partial y_j} 
\end{bmatrix} \rightarrow \cfrac{\partial E}{\partial X} = \begin{bmatrix}
    \cfrac{\partial E}{\partial x_1} \\ \cfrac{\partial E}{\partial x_2} \\ \vdots \\ \cfrac{\partial E}{\partial x_j} 
\end{bmatrix}
$$

$$
\begin{equation}
    \begin{split}
        \cfrac{\partial E}{\partial x_i} &= \cfrac{\partial E}{\partial y_1}\cfrac{\partial y_1}{\partial x_1} + \cfrac{\partial E}{\partial y_2}\cfrac{\partial y_2}{\partial x_1} + \dots + \cfrac{\partial E}{\partial y_j}\cfrac{\partial y_j}{\partial x_1}
        &= \cfrac{\partial E}{\partial y_1}w_{1i} + \cfrac{\partial E}{\partial y_1}w_{2i} + \dots + \cfrac{\partial E}{\partial y_1}w_{ji}
    \end{split}
\end{equation}
$$

$$
\cfrac{\partial E}{\partial X} = \begin{bmatrix}
    \cfrac{\partial E}{\partial y_1}w_{11} + \cfrac{\partial E}{\partial y_2}w_{21} + \dots + \cfrac{\partial E}{\partial y_j}w_{j1} \\
    \cfrac{\partial E}{\partial y_1}w_{12} + \cfrac{\partial E}{\partial y_2}w_{22} + \dots + \cfrac{\partial E}{\partial y_j}w_{j2} \\
    \vdots \\
    \cfrac{\partial E}{\partial y_1}w_{1i} + \cfrac{\partial E}{\partial y_2}w_{2i} + \dots + \cfrac{\partial E}{\partial y_j}w_{ji} \\
\end{bmatrix} = \begin{bmatrix}
    w_{11} & w_{21} & \dots & w_{j1} \\
    w_{12} & w_{22} & \dots & w_{j2} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{1j} & w_{2j} & \dots & w_{ij} \\
\end{bmatrix}\begin{bmatrix}
    \cfrac{\partial Y}{\partial y_1} \\ \cfrac{\partial Y}{\partial y_2} \\ \vdots \\ \cfrac{\partial Y}{\partial y_j}
\end{bmatrix} = W^t \cdot \cfrac{\partial E}{\partial Y}
$$

#### Create Dense Layer

In [50]:
class Dense(Layer):
    def __init__(self, input_size, output_size):
        self.weights = np.random.randn(output_size, input_size)
        self.bias = np.random.randn(output_size, 1)

    def forward(self, input):
        """
        The forward method simply calculate the value based on Y = WX + B
        """
        self.input = input
        return np.dot(self.weights, self.input) + self.bias

    def backward(self, output_gradient, learning_rate):
        """
        output_gradient: the derivatives of bias => output gradient
        """
        weights_gradient = np.dot(output_gradient, self.input.T) # calculate the derivatives of error with respects to the weights
        
        self.weights -= learning_rate * weights_gradient
        self.bias -= learning_rate * output_gradient
        
        input_gradient = np.dot(self.weights.T, output_gradient) # calculate the derivatives of X
        return input_gradient

### Activation Layer

The activation layer takes the number of input to produce the number of outputs based on given activation function. e.g.

$$
y_1 = f(x_1)
$$

Then for a forward propagation, we have the fomula: $Y = f(X)$

And again, in order to update the parameters, we still need to calculate the derivatives: $\cfrac{\partial E}{\partial Y}$ and $\cfrac{\partial E}{\partial X}$.

For $x_1$ we have:

$$
\cfrac{\partial E}{\partial x_1} = \cfrac{\partial E}{\partial y_1}\cfrac{\partial y_1}{\partial x_1} + \cfrac{\partial E}{\partial y_2}\cfrac{\partial y_2}{\partial x_2} + \dots + \cfrac{\partial E}{\partial y_j}\cfrac{\partial y_j}{\partial x_i} = \cfrac{\partial E}{\partial y_1} f^\prime(x_1)
$$

$$
\cfrac{\partial E}{\partial Y} = \cfrac{\partial E}{\partial Y} \odot f^\prime(X)
$$

$\odot$: element-wise multiplication

#### Create Activation Layer

In [29]:
class Activation(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation # activation function
        self.activation_prime = activation_prime # derivative of an activatino function

    def forward(self, input):
        self.input = input
        return self.activation(self.input)

    def backward(self, output_gradient, learning_rate):
        return np.multiply(output_gradient, self.activation_prime(self.input))

### Implement activation functions and loss function

In [37]:
# Hyperbolic Tangent
class Tanh(Activation):
    def __init__(self):
        tanh = lambda x: np.tanh(x)
        tanh_prime = lambda x: 1 - np.tanh(x)**2
        super().__init__(tanh, tanh_prime) # call super constructor

In [38]:
# Mean Squared Error
def mse(y_true, y_pred):
    return np.mean(np.power(y_true - y_pred, 2))

# Derivative of Mean Squared Error formula
def mse_prime(y_true, y_pred):
    return 2 * (y_pred - y_true) / np.size(y_true)

### Solve XOR (MINST)

Logic gate issue:

| $x_1$ | $x_2$ | $y_1$ |
| ---- | ---- | ---- |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

In [39]:
# generate sample dataset
X = np.reshape([[0, 0], [0, 1], [1, 0], [1, 1]], (4, 2, 1))
Y = np.reshape([[0], [1], [1], [0]], (4, 1, 1))

In [40]:
# inspect the data
display(X, Y)

array([[[0],
        [0]],

       [[0],
        [1]],

       [[1],
        [0]],

       [[1],
        [1]]])

array([[[0]],

       [[1]],

       [[1]],

       [[0]]])

In [41]:
# construct the neural network
network = [
    Dense(2, 3),
    Tanh(),
    Dense(3, 1),
    Tanh()
]

In [74]:
epochs = 10000
learning_rate = 0.1

In [76]:
p_bar = tqdm(range(epochs))

for epoch in p_bar:
    error = 0
    for x, y in zip(X, Y):
        
        # perform the forward propagation: compute the output of the neural network
        output = x
        for layer in network:
            output = layer.forward(output)
        
        # compute the error
        error += mse(y, output)
        
        # perform the backward propagation: learning step
        grad = mse_prime(y, output)
        for layer in reversed(network):
            grad = layer.backward(grad, learning_rate)
            
    error /= len(X)
    p_bar.set_description(f"Train: {epoch+1}/{epochs}, error={error:.5f}")

Train: 10000/10000, error=0.00000: 100%|█████████████████████████████████████████| 10000/10000 [00:18<00:00, 547.06it/s]


In [88]:
# predict function
output = X[0]
for layer in network:
    output = layer.forward(output)
    
round(output.flatten()[0], 1)

0.0