# Implementing a vanilla Backpropagation algorith

In this section, we will learn how to implement the backpropagation algorithm from scratch using Python. 

**What is Backpropagation?**
Back-propagation is the essence of neural net training. It is the method of fine-tuning the weights of a neural net based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and to make the model reliable by increasing its generalization.

Backpropagation is a short form for "backward propagation of errors." It is a standard method of training artificial neural networks. This method helps to calculate the gradient of a loss function with respects to all the weights in the network.

The backpropagation algorithm consists of two phases:
- 1. The forward pass where we pass our inputs through the network to obtain our output
classifications.

-  2. The backward pass (i.e., weight update phase) where we compute the gradient of the loss function and use this information to iteratively apply the chain rule to update the weights in our network.


<img src="images/ex-backpropagation.png">

In [61]:
import numpy as np


class NeuralNetwork:

	def __init__(self, layers, alpha=0.1):
		# initialize the list of weights matrices
		self.weights = []

		# layers architecture
		self.layers = layers

		# learning rate
		self.learning_rates = alpha
        
		self.load_random_weights(layers)

    # Loading the initial weights randomically to all layers
	def load_random_weights(self, layers):
		# start looping from the index of the first layer but stop before we reach the last two layers
		for i in np.arange(0, len(layers) - 2):
			current_layer = layers[i]
			next_layer = layers[i + 1]

			# To account for the bias, we add 1 to the number of current and next layer
			# randomly initialize a weight matrix connecting the number of nodes in each respective layer together,
			w = np.random.randn(current_layer + 1, next_layer + 1)
			self.weights.append(w / np.sqrt(layers[i]))

		# the last two layers are a special case where the input connections need a bias term but the output does not
		last_layer = layers[-1]
		before_last_layer = layers[-2]

		w = np.random.randn(before_last_layer + 1, last_layer)
		self.weights.append(w / np.sqrt(before_last_layer))

	#  this function is useful for debugging:
	def __repr__(self):
		# construct and return a string that represents the network architecture
		return "NeuralNetwork: {}".format("-".join(str(l) for l in self.layers))

	# sigmoid activation function
	def sigmoid(self, x):
		return 1.0 / (1 + np.exp(-x))

	# the derivative of the sigmoid which we’ll use during the backward pass
	def sigmoid_deriv(self, x):
		return x * (1 - x)

	def update_weights(self, activations, deltas):
		for layer in np.arange(0, len(self.weights)):
			# update our weights by taking the dot product of the layer
			# activations with their respective deltas, then multiplying
			# this value by some small learning rate and adding to our
			# weight matrix -- this is where the actual "learning" takes place
			self.weights[layer] += -self.learning_rates * activations[layer].T.dot(deltas[layer])

	def back_propagation(self, activations, y):
		# the first phase of backpropagation is to compute the difference between our *prediction* (the final output
		# activation in the activations list) and the true target value
		# Since the final entry in the activations list activations contains the output of the network,
		# we can access the output prediction via activations[-1]. The value y is the target output for the input data point x
		error = activations[-1] - y
		# from here, we need to apply the chain rule and build our list of deltas ‘deltas‘;
		# the first entry in the deltas is  simply the error of the output layer times the derivative
		# of our activation function for the output value
		deltas = [error * self.sigmoid_deriv(activations[-1])]
		# loop over the layers in reverse order (ignoring the last two since we already have taken them into account)
		for layer in np.arange(len(activations) - 2, 0, -1):
			# the delta for the current layer is equal to the delta of the *previous layer* dotted with the weight matrix
			# of the current layer, followed by multiplying the delta by the derivative of the nonlinear activation function
			# for the activations of the current layer
			delta = deltas[-1].dot(self.weights[layer].T)
			delta = delta * self.sigmoid_deriv(activations[layer])
			deltas.append(delta)
		# since we looped over our layers in reverse order we need to reverse the deltas
		deltas = deltas[::-1]
		return deltas

	def feed_forward(self, x):
		# this list is responsible for storing the output activations for each layer as our data point x
		# forward propagates through the network. We initialize this list with x, which is simply the input data point
		activations = [np.atleast_2d(x)]
		# FEEDFORWARD:
		# loop over the layers in the network
		for layer in np.arange(0, len(self.weights)):
			# feedforward the activation at the current layer by
			# taking the dot product between the activation and the weight matrix -- this is called the "net input"
			# to the current layer
			net = activations[layer].dot(self.weights[layer])
			# computing the "net output" is simply applying our nonlinear activation function to the net input
			out = self.sigmoid(net)
            
			# once we have the net output, add it to our list of activations
			activations.append(out)
		return activations

	def fit(self, data_points, labels, epochs=1000, display_update=100):
		# insert a column of 1’s as the last entry in the feature matrix -- this little trick allows us to treat the bias
		# as a trainable parameter within the weight matrix
		data_points = np.c_[data_points, np.ones((data_points.shape[0]))]

		# loop over the desired number of epochs
		for epoch in np.arange(0, epochs):
			# loop over each individual data point and train our network on it
			for (x, target) in zip(data_points, labels):
				activations = self.feed_forward(x)
				deltas = self.back_propagation(activations, target)
				self.update_weights(activations, deltas)

			# check to see if we should display a training update
			if epoch == 0 or (epoch + 1) % display_update == 0:
				loss = self.calculate_loss(data_points, labels)
				print("[INFO] epoch={}, loss={:.7f}".format(epoch + 1, loss))

	def predict(self, data_point, add_bias=True):
		# initialize the output prediction as the input features -- this
		# value will be (forward) propagated through the network to
		# obtain the final prediction
		p = np.atleast_2d(data_point)
		# check to see if the bias column should be added
		if add_bias:
			# insert a column of 1’s as the last entry in the feature matrix (bias)
			p = np.c_[p, np.ones((p.shape[0]))]

		# loop over our layers in the network
		for layer in np.arange(0, len(self.weights)):
			# computing the output prediction is as simple as taking
			# the dot product between the current activation value ‘p‘
			# and the weight matrix associated with the current layer,
			# then passing this value through a nonlinear activation
			# function
			p = self.sigmoid(np.dot(p, self.weights[layer]))

		# return the predicted value
		return p

	def calculate_loss(self, X, targets):
		# make predictions for the input data points then compute the loss
		targets = np.atleast_2d(targets)
		predictions = self.predict(X, add_bias=False)
		loss = 0.5 * np.sum((predictions - targets) ** 2)
		return loss


## Backpropagation with Python Example: MNIST Sample

Let’s examine our Neural network with the MNIST dataset for handwritten digit recognition. This subset of the MNIST dataset is built-into the scikit-learn library and includes 1,797 example digits, each of which are 8 × 8 grayscale images (the original images are 28 × 28. When flattened, these images are represented by an 8 × 8 = 64-dim vector.

![alt text](images/mnist-sample.png "")


In [64]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import datasets

# load the MNIST dataset and apply min/max scaling to scale the
# pixel intensity values to the range [0, 1] (each image is
# represented by an 8 x 8 = 64-dim feature vector)
print("[INFO] loading MNIST (sample) dataset...")
digits = datasets.load_digits()
data = digits.data.astype("float")
data = (data - data.min()) / (data.max() - data.min())
print("[INFO] samples: {}, dim: {}".format(data.shape[0],data.shape[1]))

[INFO] loading MNIST (sample) dataset...
[INFO] samples: 1797, dim: 64


In [65]:
# construct the training and testing splits
(trainX, testX, trainY, testY) = train_test_split(data,digits.target, test_size=0.25)

# convert the labels from integers to vectors
trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# train the network
print("[INFO] training network...")
nn = NeuralNetwork([trainX.shape[1], 32, 16, 10])

print("[INFO] {}".format(nn))
nn.fit(trainX, trainY, epochs=1000)

[INFO] training network...
[INFO] NeuralNetwork: 64-32-16-10
[INFO] epoch=1, loss=605.1492535
[INFO] epoch=100, loss=6.2829649
[INFO] epoch=200, loss=2.0180309
[INFO] epoch=300, loss=1.0428557
[INFO] epoch=400, loss=0.8497941
[INFO] epoch=500, loss=0.7561554
[INFO] epoch=600, loss=0.7008680
[INFO] epoch=700, loss=0.6644783
[INFO] epoch=800, loss=0.6387617
[INFO] epoch=900, loss=0.6196323
[INFO] epoch=1000, loss=0.6048176


In [66]:
print("[INFO] evaluating network...")
predictions = nn.predict(testX)
predictions = predictions.argmax(axis=1)
print(classification_report(testY.argmax(axis=1), predictions))

[INFO] evaluating network...
              precision    recall  f1-score   support

           0       1.00      0.98      0.99        50
           1       0.98      0.96      0.97        53
           2       1.00      1.00      1.00        55
           3       0.98      0.95      0.97        44
           4       0.94      1.00      0.97        45
           5       0.93      0.98      0.96        44
           6       1.00      1.00      1.00        45
           7       1.00      1.00      1.00        44
           8       0.97      0.88      0.92        32
           9       0.95      0.97      0.96        38

    accuracy                           0.98       450
   macro avg       0.97      0.97      0.97       450
weighted avg       0.98      0.98      0.98       450



Look at the precision column and you will see the percentage of right prediction to each number