# Building a Neural Network from Scratch

https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65 

### Abstract Base Class : Layer
The abstract class Layer, which all other layers will inherit from, handles simple properties which are an input, an output, and both a forward and backward methods.

In [1]:
# Base class
class Layer:
    def __init__(self):
        self.input = None
        self.output = None

    # computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    # computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

In the abstract class above, backward_propagation function has an extra parameter, learning_rate, which is controlling the amount of learning/updating parameters using gradient descent.

### Backward Propagation
Suppose we have a matrix containing the derivative of the error with respect to that layer’s output: $\frac{\partial E}{\partial Y}$

We need :
- The derivative of the error with respect to the parameters ($\frac{\partial E}{\partial W}$, $\frac{\partial E}{\partial B}$)
- The derivative of the error with respect to the input ($\frac{\partial E}{\partial X}$)

Let's calculate $\frac{\partial E}{\partial W}$. This matrix should be the same size as $W$ itself : 

$i x j$ where $i$ is the number of input neurons and $j$ the number of output neurons. We need one gradient for every weight

### Coding the Fully Connected Layer

In [2]:
#from layer import Layer
import numpy as np

# inherit from base class Layer
class FCLayer(Layer):
    # input_size = number of input neurons
    # output_size = number of edges that connects to neurons in next layer
    def __init__(self, input_size, output_size):
        self.weights = np.random.rand(input_size, output_size) - 0.5
        self.bias = np.random.rand(1, output_size) - 0.5

    # returns output for a given input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = np.dot(self.input, self.weights) + self.bias
        return self.output

    # computes dE/dW, dE/dB for a given output_error=dE/dY. Returns input_error=dE/dX.
    def backward_propagation(self, output_error, learning_rate):
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)
        # dBias = output_error

        # update parameters
        self.weights -= learning_rate * weights_error
        self.bias -= learning_rate * output_error
        return input_error

### Activation Layer
All the calculation we did until now were completely linear, may not learn well. We need to add non-linearity to the model by applying non-linear functions to the output of some layers.

Now we need to redo the whole process for this new type of layer!

In [3]:
#from layer import Layer

# inherit from base class Layer
class ActivationLayer(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    # returns the activated input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output

    # Returns input_error=dE/dX for a given output_error=dE/dY.
    # learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_error, learning_rate):
        return self.activation_prime(self.input) * output_error

You can also write some activation functions and their derivatives in a separate file. These will be used later to create an ActivationLayer.

In [4]:
import numpy as np

# activation function and its derivative
def tanh(x):
    return np.tanh(x);

def tanh_prime(x):
    return 1-np.tanh(x)**2;

### Loss Function
Until now, for a given layer, we supposed that ∂E/∂Y was given (by the next layer). But what happens to the last layer? How does it get ∂E/∂Y? We simply give it manually, and it depends on how we define the error.
The error of the network, which measures how good or bad the network did for a given input data, is defined by you. 

There are many ways to define the error, and one of the most known is called MSE — Mean Squared Error.

In [5]:

import numpy as np

# loss function and its derivative
def mse(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2));

def mse_prime(y_true, y_pred):
    return 2*(y_pred-y_true)/y_true.size;

### Network Class
Almost done ! We are going to make a Network class to create neural networks very easily using the building blocks we have prepared so far.


In [6]:
# example of a function for calculating softmax for a list of numbers
from numpy import exp
 
# calculate the softmax of a vector
def softmax(vector):
    e = exp(vector)
    return e / e.sum()

In [7]:
class Network:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_prime = None

    # add layer to network
    def add(self, layer):
        self.layers.append(layer)

    # set loss to use
    def use(self, loss, loss_prime):
        self.loss = loss
        self.loss_prime = loss_prime

        
    # predict output for given input
    def predict(self, input_data):
        # sample dimension first
        samples = len(input_data)
        result = []

        # run network over all samples
        for i in range(samples):
            # forward propagation
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)

        return result

    # train the network 
    
    def fit(self, x_train, y_train, epochs, learning_rate):
        '''
        Fit function does the training. 
        Training data is passed 1-by-1 through the network layers during forward propagation.
        Loss (error) is calculated for each input and back propagation is performed via partial 
        derivatives on each layer.
        '''
        # sample dimension first
        samples = len(x_train)

        # training loop
        for i in range(epochs):
            err = 0
            for j in range(samples):
                # forward propagation
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                # compute loss (for display purpose only)
                err += self.loss(y_train[j], output)

                # backward propagation
                error = self.loss_prime(y_train[j], output)
                for layer in reversed(self.layers):
                    error = layer.backward_propagation(error, learning_rate)

            # calculate average error on all samples
            err /= samples
            print('epoch %d/%d   error=%f' % (i+1, epochs, err))

### Building Neural Networks
Finally ! We can use our class to create a neural network with as many layers as we want ! We are going to build two neural networks : a simple XOR and a MNIST solver.


### Solve XOR
Starting with XOR is always important as it’s a simple way to tell if the network is learning anything at all.

In [8]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

# training data
x_train = np.array([[[0,0]], [[0,1]], [[1,0]], [[1,1]]])
y_train = np.array([[[0]], [[1]], [[1]], [[0]]])

# network
net = Network()
net.add(FCLayer(2, 3))
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(3, 1))
net.add(ActivationLayer(tanh, tanh_prime))

# train
net.use(mse, mse_prime)
net.fit(x_train, y_train, epochs=1000, learning_rate=0.1)

# test
out = net.predict(x_train)
print(out)

epoch 1/1000   error=0.988294
epoch 2/1000   error=0.392466
epoch 3/1000   error=0.316818
epoch 4/1000   error=0.303850
epoch 5/1000   error=0.299938
epoch 6/1000   error=0.298213
epoch 7/1000   error=0.297206
epoch 8/1000   error=0.296487
epoch 9/1000   error=0.295903
epoch 10/1000   error=0.295391
epoch 11/1000   error=0.294920
epoch 12/1000   error=0.294478
epoch 13/1000   error=0.294058
epoch 14/1000   error=0.293654
epoch 15/1000   error=0.293267
epoch 16/1000   error=0.292893
epoch 17/1000   error=0.292532
epoch 18/1000   error=0.292183
epoch 19/1000   error=0.291846
epoch 20/1000   error=0.291521
epoch 21/1000   error=0.291206
epoch 22/1000   error=0.290902
epoch 23/1000   error=0.290608
epoch 24/1000   error=0.290324
epoch 25/1000   error=0.290049
epoch 26/1000   error=0.289783
epoch 27/1000   error=0.289527
epoch 28/1000   error=0.289278
epoch 29/1000   error=0.289039
epoch 30/1000   error=0.288807
epoch 31/1000   error=0.288583
epoch 32/1000   error=0.288367
epoch 33/1000   e

epoch 445/1000   error=0.001473
epoch 446/1000   error=0.001462
epoch 447/1000   error=0.001451
epoch 448/1000   error=0.001440
epoch 449/1000   error=0.001429
epoch 450/1000   error=0.001418
epoch 451/1000   error=0.001408
epoch 452/1000   error=0.001397
epoch 453/1000   error=0.001387
epoch 454/1000   error=0.001377
epoch 455/1000   error=0.001367
epoch 456/1000   error=0.001357
epoch 457/1000   error=0.001347
epoch 458/1000   error=0.001337
epoch 459/1000   error=0.001328
epoch 460/1000   error=0.001318
epoch 461/1000   error=0.001309
epoch 462/1000   error=0.001300
epoch 463/1000   error=0.001291
epoch 464/1000   error=0.001282
epoch 465/1000   error=0.001273
epoch 466/1000   error=0.001264
epoch 467/1000   error=0.001256
epoch 468/1000   error=0.001247
epoch 469/1000   error=0.001239
epoch 470/1000   error=0.001231
epoch 471/1000   error=0.001223
epoch 472/1000   error=0.001215
epoch 473/1000   error=0.001207
epoch 474/1000   error=0.001199
epoch 475/1000   error=0.001191
epoch 47

epoch 948/1000   error=0.000269
epoch 949/1000   error=0.000269
epoch 950/1000   error=0.000268
epoch 951/1000   error=0.000268
epoch 952/1000   error=0.000267
epoch 953/1000   error=0.000267
epoch 954/1000   error=0.000266
epoch 955/1000   error=0.000266
epoch 956/1000   error=0.000266
epoch 957/1000   error=0.000265
epoch 958/1000   error=0.000265
epoch 959/1000   error=0.000264
epoch 960/1000   error=0.000264
epoch 961/1000   error=0.000263
epoch 962/1000   error=0.000263
epoch 963/1000   error=0.000262
epoch 964/1000   error=0.000262
epoch 965/1000   error=0.000261
epoch 966/1000   error=0.000261
epoch 967/1000   error=0.000261
epoch 968/1000   error=0.000260
epoch 969/1000   error=0.000260
epoch 970/1000   error=0.000259
epoch 971/1000   error=0.000259
epoch 972/1000   error=0.000258
epoch 973/1000   error=0.000258
epoch 974/1000   error=0.000258
epoch 975/1000   error=0.000257
epoch 976/1000   error=0.000257
epoch 977/1000   error=0.000256
epoch 978/1000   error=0.000256
epoch 97

### Solve MNIST
We didn’t implemented the Convolutional Layer but this is not a problem. 
All we need to do is to reshape our data so that it can fit into a Fully Connected Layer.
MNIST Dataset consists of images of digits from 0 to 9, of shape 28x28x1. 
The goal is to predict what digit is drawn on a picture.

In [9]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')

# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')

y_test = np_utils.to_categorical(y_test)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
out = net.predict(x_test[0:3])
print("\n")
print("predicted values : ")
print(out, end="\n")
print("true values : ")
print(y_test[0:3])

epoch 1/35   error=0.267268
epoch 2/35   error=0.136776
epoch 3/35   error=0.129875
epoch 4/35   error=0.122785
epoch 5/35   error=0.119283
epoch 6/35   error=0.116729
epoch 7/35   error=0.116644
epoch 8/35   error=0.112951
epoch 9/35   error=0.109218
epoch 10/35   error=0.111558
epoch 11/35   error=0.109720
epoch 12/35   error=0.108266
epoch 13/35   error=0.108011
epoch 14/35   error=0.106802
epoch 15/35   error=0.105264
epoch 16/35   error=0.105535
epoch 17/35   error=0.104213
epoch 18/35   error=0.103155
epoch 19/35   error=0.107140
epoch 20/35   error=0.106981
epoch 21/35   error=0.106873
epoch 22/35   error=0.105653
epoch 23/35   error=0.105620
epoch 24/35   error=0.102567
epoch 25/35   error=0.099447
epoch 26/35   error=0.098495
epoch 27/35   error=0.097795
epoch 28/35   error=0.097155
epoch 29/35   error=0.096501
epoch 30/35   error=0.095805
epoch 31/35   error=0.095044
epoch 32/35   error=0.094211
epoch 33/35   error=0.093738
epoch 34/35   error=0.093314
epoch 35/35   error=0.0

In [10]:
from sklearn.metrics import classification_report

y_pred = [np.argmax(net.predict(x)) for x in x_test]
y_true = [np.argmax(y) for y in y_test]
print(classification_report(y_true, y_pred))
#default

              precision    recall  f1-score   support

           0       0.73      0.37      0.49       980
           1       0.89      0.55      0.68      1135
           2       0.20      0.00      0.00      1032
           3       0.46      0.17      0.25      1010
           4       0.63      0.14      0.22       982
           5       0.43      0.11      0.17       892
           6       0.12      0.90      0.21       958
           7       0.67      0.53      0.59      1028
           8       0.00      0.00      0.00       974
           9       0.45      0.02      0.04      1009

    accuracy                           0.28     10000
   macro avg       0.46      0.28      0.27     10000
weighted avg       0.47      0.28      0.27     10000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


############################################

**What can go wrong if we have a wide range of numbers in our input/output data and we don't do any pre-processing on them and feed the neural network with unprocessed data?**

Since we multiply the input values on each layer (coming from multiple neurons), we may end up with very large numbers flowing in the network in forward propagation.
############################################

**How do we tackle this problem?**


For tackling we can use or Standardization or Normalization .
############################################

#Normalization and Standardization

In [11]:
def normalize(values):
    return (values - values.min())/(values.max() - values.min())

def standardize(values):
    return (values - values.mean())/values.std()

In [12]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')

# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')

y_test = np_utils.to_categorical(y_test)
x_train = normalize(x_train)
x_test = normalize(x_test)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
out = net.predict(x_test[0:3])
print("\n")
print("predicted values : ")
print(out, end="\n")
print("true values : ")
print(y_test[0:3])

epoch 1/35   error=0.240837
epoch 2/35   error=0.100509
epoch 3/35   error=0.078942
epoch 4/35   error=0.065168
epoch 5/35   error=0.055041
epoch 6/35   error=0.047132
epoch 7/35   error=0.041448
epoch 8/35   error=0.036604
epoch 9/35   error=0.032670
epoch 10/35   error=0.029393
epoch 11/35   error=0.026591
epoch 12/35   error=0.024198
epoch 13/35   error=0.022133
epoch 14/35   error=0.020489
epoch 15/35   error=0.019158
epoch 16/35   error=0.018067
epoch 17/35   error=0.017059
epoch 18/35   error=0.016011
epoch 19/35   error=0.015106
epoch 20/35   error=0.014328
epoch 21/35   error=0.013688
epoch 22/35   error=0.013086
epoch 23/35   error=0.012565
epoch 24/35   error=0.012083
epoch 25/35   error=0.011619
epoch 26/35   error=0.011149
epoch 27/35   error=0.010681
epoch 28/35   error=0.010318
epoch 29/35   error=0.009977
epoch 30/35   error=0.009596
epoch 31/35   error=0.009314
epoch 32/35   error=0.009012
epoch 33/35   error=0.008774
epoch 34/35   error=0.008383
epoch 35/35   error=0.0

In [13]:
y_pred = [np.argmax(net.predict(x)) for x in x_test]
y_true = [np.argmax(y) for y in y_test]

print(classification_report(y_true, y_pred))
#normalization

              precision    recall  f1-score   support

           0       0.93      0.91      0.92       980
           1       0.97      0.94      0.96      1135
           2       0.86      0.77      0.81      1032
           3       0.84      0.70      0.76      1010
           4       0.82      0.72      0.77       982
           5       0.59      0.70      0.64       892
           6       0.88      0.76      0.82       958
           7       0.82      0.86      0.84      1028
           8       0.64      0.66      0.65       974
           9       0.60      0.81      0.69      1009

    accuracy                           0.79     10000
   macro avg       0.80      0.78      0.79     10000
weighted avg       0.80      0.79      0.79     10000



In [14]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')

# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')

y_test = np_utils.to_categorical(y_test)
x_train = standardize(x_train)

x_test = standardize(x_test)
# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
out = net.predict(x_test[0:3])
print("\n")
print("predicted values : ")
print(out, end="\n")
print("true values : ")
print(y_test[0:3])

epoch 1/35   error=0.271621
epoch 2/35   error=0.095203
epoch 3/35   error=0.072048
epoch 4/35   error=0.060018
epoch 5/35   error=0.052130
epoch 6/35   error=0.045874
epoch 7/35   error=0.041136
epoch 8/35   error=0.036956
epoch 9/35   error=0.033621
epoch 10/35   error=0.030949
epoch 11/35   error=0.028516
epoch 12/35   error=0.026502
epoch 13/35   error=0.024757
epoch 14/35   error=0.023170
epoch 15/35   error=0.021567
epoch 16/35   error=0.020265
epoch 17/35   error=0.019056
epoch 18/35   error=0.018134
epoch 19/35   error=0.016873
epoch 20/35   error=0.016035
epoch 21/35   error=0.015182
epoch 22/35   error=0.014516
epoch 23/35   error=0.013961
epoch 24/35   error=0.013377
epoch 25/35   error=0.012923
epoch 26/35   error=0.012416
epoch 27/35   error=0.011994
epoch 28/35   error=0.011570
epoch 29/35   error=0.011210
epoch 30/35   error=0.010925
epoch 31/35   error=0.010611
epoch 32/35   error=0.010295
epoch 33/35   error=0.009839
epoch 34/35   error=0.009542
epoch 35/35   error=0.0

In [15]:
y_pred = [np.argmax(net.predict(x)) for x in x_test]
y_true = [np.argmax(y) for y in y_test]

print(classification_report(y_true, y_pred))
#Standardizing

              precision    recall  f1-score   support

           0       0.83      0.85      0.84       980
           1       0.93      0.93      0.93      1135
           2       0.75      0.70      0.72      1032
           3       0.82      0.66      0.73      1010
           4       0.77      0.60      0.67       982
           5       0.73      0.53      0.61       892
           6       0.70      0.76      0.73       958
           7       0.79      0.80      0.80      1028
           8       0.68      0.59      0.64       974
           9       0.44      0.76      0.56      1009

    accuracy                           0.72     10000
   macro avg       0.74      0.72      0.72     10000
weighted avg       0.75      0.72      0.73     10000

