# ANN from scratch 
We will be making an Artificial Neural Network without any libraries such as tensorflow and pytorch. The only library we will be using will be numpy and pandas. 

What better to learn about ANNs than to do it yourself, without using any of the fancy libraries and getting it done with just a few lines of ann.add to get a full Neural Network.

In [2]:
import numpy as np
import pandas as pd

## Dataset Used
We will be using the famous MNIST dataset. Which is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning.

![](./images/MnistExamplesModified.png)

__Sources__
- http://yann.lecun.com/exdb/mnist/
- https://www.wikiwand.com/en/MNIST_database
- https://www.kaggle.com/competitions/digit-recognizer/data


In [3]:
df = pd.read_csv('./datasets/train.csv')
df

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41996,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41997,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41998,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
data = np.array(df)
data

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [7, 0, 0, ..., 0, 0, 0],
       [6, 0, 0, ..., 0, 0, 0],
       [9, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Problem Statement

We want our final product to be inputed a 28x28 image of a handwritten digit and for it to predict from 0 to 9. 

![](./images/Screenshot%202023-10-04%20182409.png)


### How are we going to tackle this problem?
Through Artificial Neural Networks! 


## What is a Artificial Neural Network?
It is a model inspried by the structure and function of neurons in the human brain. Which in our case takes in 784 pixels as our input for the first layer of our ANN and using some maths and magic, we get a output of what the machine thinks the image.

![](./images/Screenshot%202023-10-05%20150956.png)

We can think of each neuron in the network as just a number between 0 - 1. Where the neuron lights up when its value is close to 1 and remains inactive when near 0. We can represent each neuron as an activation function $a_n$ where n is the placement of the neuron along each layer. 



## MATH
Together with the connections between each neuron in the input layer and 1 neuron of the 2nd hidden layer (784 connections), we can get a formula of $w_1a_1 + w_2a_2 + w_3a_3 ... w_na_n$ where we get the weighted sum of all the activations. 

When we compute a weighted sum like this, but for this network we want the activation for the neuron to be somewhere between 0 - 1. A common thing to do is to put this weighted sum in a some function that squishes the real number into the range between 0 and 1. We will be using a sigmoid function or a RELU function to squish our formula. 

$σ(w_1a_1 + w_2a_2 + w_3a_3 ... w_na_n)$

What if you want for some bias for the neuron to be inactive, we'll just add in the number into the formula before we apply the sigmoid squishfycation function.

$σ(w_1a_1 + w_2a_2 + w_3a_3 ... w_na_n + b)$

So far each neuron from the input layer has to be connected to each neuron in the other hidden layers and then they have to be connected to the output layer, this makes our network to have a total of 13,002 total weights and bias to tweak and optimise!!

To get the math equation for the full transition of activations from 1 layer to the next. 

We organise all the activations from the first layer into a column as a vector. 
$$\begin{bmatrix} a_0(0) \\ a_1(0) \\ ... \\ a_n(0) \end{bmatrix}$$

Then organise the weights as a matrix where each row corresponds to the connections between one layer and a particular neuron in the next layer. 

$$
\begin{bmatrix} w_0,0 &  w_0,1 & \cdots & w_0,n \\ w_1,0 &  w_1,1 & \cdots & w_1,n \\ \vdots & \vdots & \ddots & \vdots \\ w_k,0 &  w_k,1 & \cdots & w_k,n \end{bmatrix}
\begin{bmatrix} a_0(0) \\ a_1(0) \\ \vdots \\ a_n(0) \end{bmatrix} = 
\begin{bmatrix} ? \\ ? \\ \vdots \\ ? \end{bmatrix}
$$
Where we get the product of each neuron

Next we add the bias, where we organise it into a vector and adding it to our matrix vector product.
$$
\begin{bmatrix} w_0,0 &  w_0,1 & \cdots & w_0,n \\ w_1,0 &  w_1,1 & \cdots & w_1,n \\ \vdots & \vdots & \ddots & \vdots \\ w_k,0 &  w_k,1 & \cdots & w_k,n \end{bmatrix}
\begin{bmatrix} a_0(0) \\ a_1(0) \\ \vdots \\ a_n(0) \end{bmatrix} + 
\begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_n \end{bmatrix}
$$

Finally we squish each component from our result
$$
σ\left(
\begin{bmatrix} w_0,0 &  w_0,1 & \cdots & w_0,n \\ w_1,0 &  w_1,1 & \cdots & w_1,n \\ \vdots & \vdots & \ddots & \vdots \\ w_k,0 &  w_k,1 & \cdots & w_k,n \end{bmatrix}
\begin{bmatrix} a_0(0) \\ a_1(0) \\ \vdots \\ a_n(0) \end{bmatrix} + 
\begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_n \end{bmatrix}
\right)
$$

We can represent this formula with the matrixs as their own symbols 
$$
a(1) = σ(Wa(0) + b)
$$

## How does the Network learn? and the MORE Math
To understand how a neural network learns, let's dive deeper into the concepts of cost functions, gradient descent, and backpropagation.

### Cost Function
So how do we know how bad or good our model is performing on a given dataset? We use a cost function, which is a mathematical function that measures how well or poorly a neural network is performing on the data. In our case we add up the squares of the differences between the unwanted output activations and the value we want them to have in the output layer. 

For example, we will be using a picture of 3 as our input. The ANN spits out trash outputs but we want it to predict a 3. 
$$
(0.43 - 0.00)^2 +(0.23 - 0.00)^2 +(0.54 - 0.00)^2 +\colorbox{green}{$(0.88 - 1.00)^2$} +(0.54 - 0.00)^2 +(0.02 - 0.00)^2 +(0.25 - 0.00)^2 +(0.12 - 0.00)^2 +(0.77 - 0.00)^2 +(0.63 - 0.00)^2
$$
We get a result of 2.3002. This higher the cost function the shitter the model is, with 0 being the perfect prediction.
### Gradient Descent
But just knowing how shit the model is, isnt very helpful. We need to tell it how to change the weights and bias so that it gets better aka reduce the cost function. Thats where gradient descent comes in.
### Back Propagation

## Code

## Final Result