# Chapter - 8 : Backpropagation

#### Backpropagation

<img src="backPropFormula.png"/>
<h6>If you are reading this, I don't want to scare you but this topic can be a bit hard to understand. Read it again and again, if you still find it difficult, mail me, I can help you.</h6>

In [69]:
# We will write code for backpropagation method
# Recall using CHAIN RULE, we find derivative of a function w.r.t the function inside it.
# We store this result, called gradient, and use it to multiply to the derivative to previous
# layer's output.

# For simplicity we are going to consider that the gradient we recieved from the
# next layer is 1, since multiplying with one wont change anything

# Let's code now

# FORWARD PASS
x = [1.0, -2.0, 3.0]    # inputs
w = [-3.0, -1.0, 2.0]   # weights
b = 1.0                 # bias

xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# output of dense layer
sum = xw0 + xw1 + xw2 + b

# ReLU function
y = max(sum, 0)

# BACKWARD PASS

# The derivative from previous layer
dvalue = 1.0

# One important thing to note here is that the derivative of ReLU function is
# 1 if the input is greater than 1 else 0
drelu_dsum = dvalue * (1.0 if sum > 0 else 0.0)
print(drelu_dsum)

# Another important thing to note is that the partial derivative
# of a sum is always 1 no matter the inputs

# for x0w0 pair
dsum_dmulxw0 = 1
drelu_dmulxw0 = drelu_dsum * dsum_dmulxw0

# for x1w1 pair
dsum_dmulxw1 = 1
drelu_dmulxw1 = drelu_dsum * dsum_dmulxw1

# for x2w2 pair
dsum_dmulxw2 = 1
drelu_dmulxw2 = drelu_dsum * dsum_dmulxw2

# for b
dsum_db = 1
drelu_db = drelu_dsum * dsum_db

print(dsum_dmulxw0, dsum_dmulxw1, dsum_dmulxw2, drelu_db)

# Continuing with the backpropagation
# One more important thing to note here is that the partial derivative of
# a product is the value with which it is being multiplied.
# For eg.
# d(x*y)/d(x) = y
# d(x*y)/d(y) = x
# Therefore, partial derivative of the first weighted-input w.r.t the input equals the weight

# with respect to x values
dmulxw0_dx0 = w[0]
drelu_dx0 = drelu_dmulxw0 * dmulxw0_dx0

dmulxw1_dx1 = w[1]
drelu_dx1 = drelu_dmulxw1 * dmulxw1_dx1

dmulxw2_dx2 = w[2]
drelu_dx2 = drelu_dmulxw2 * dmulxw2_dx2

# with respect to w values (weights)
dmulxw0_dw0 = x[0]
drelu_dw0 = drelu_dmulxw0 * dmulxw0_dw0

dmulxw1_dw1 = x[1]
drelu_dw1 = drelu_dmulxw1 * dmulxw1_dw1

dmulxw2_dw2 = x[2]
drelu_dw2 = drelu_dmulxw2 * dmulxw2_dw2
print(drelu_dx0, drelu_dx1, drelu_dx2, drelu_dw0, drelu_dw1, drelu_dw2)


1.0
1 1 1 1.0
-3.0 -1.0 2.0 1.0 -2.0 3.0


<img src="./drelu_dz_final1.png" height="300px"/>

In [70]:
# The above completes out code for backpropagation for a single neuron.
# We can optimize the code a bit

drelu_dx0 = drelu_dmulxw0 * dmulxw0_dx0
# here
dmulxw0_dx0 = w[0]
# Therefore,
drelu_dx0 = drelu_dmulxw0 * w[0]

# Now
drelu_dmulxw0 = drelu_dsum * dsum_dmulxw0

# Then
drelu_dx0 = drelu_dsum * dsum_dmulxw0 * w[0]

# Here
dsum_dmulxw0 = 1

# Therefore
drelu_dx0 = drelu_dsum * 1 * w[0]
drelu_dx0 = drelu_dsum * w[0]

# where
drelu_dsum = dvalue * (1.0 if sum > 0 else 0.0)

# Finally
drelu_dx0 = dvalue * (1.0 if sum > 0 else 0.0) * w[0]


<img src="./optimized_form.png" height="300px"/>

In [71]:
# All together, the partial derivatives above, combined into a vector, make up our gradients. Our
# gradients could be represented as:

dx = [drelu_dx0, drelu_dx1, drelu_dx2]  # gradient vector for inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2]  # gradient vector for weights
db = drelu_db 


In [72]:
# Moving forward, we will discuss briefly about optimizers.

print("Original weights and biases: ", w, b)

# We then add a negative fraction of our gradients to these weights and biases
# and try to optimize it.
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

# We are applying -ve fraction because we want to decrease the overall value
print("Weights after optimization: ", w, b)

# Now we'll observe the change that occurend due to out calculations
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

sum = xw0 + xw1 + xw2 + b

# ReLU Activation Function
y = max(sum, 0)
print("New result: ", y)

# We’ve successfully decreased this neuron’s output from 6.000 to 5.985.


Original weights and biases:  [-3.0, -1.0, 2.0] 1.0
Weights after optimization:  [-3.001, -0.998, 1.997] 0.999
New result:  5.985


In [73]:
# Since now we've considered the case of a single neuron for the sake 
# of understanding.
# Now we'll move ahead and write code for a proper network with 
# array of inputs and weights.

# Now, let’s replace the current singular neuron with a layer of neurons. As opposed to a single
# neuron, a layer outputs a vector of values instead of a singular value. Each neuron in a layer
# connects to all of the neurons in the next layer. During backpropagation, each neuron from the
# current layer will receive a vector of partial derivatives the same way that we described for a
# single neuron. With a layer of neurons, it’ll take the form of a list of these vectors, or a 2D array.
# We know that we need to perform a sum, but what should we sum and what is the result supposed
# to be? Each neuron is going to output a gradient of the partial derivatives with respect to all of its
# inputs, and all neurons will form a list of these vectors. We need to sum along the inputs — the
# first input to all of the neurons, the second input, and so on. We’ll have to sum columns.

# To calculate the partial derivatives with respect to inputs, we need the weights — the partial
# derivative with respect to the input equals the related weight. This means that the array of
# partial derivatives with respect to all of the inputs equals the array of weights. Since this array is
# transposed, we’ll need to sum its rows instead of columns. To apply the chain rule, we need to
# multiply them by the gradient from the subsequent function.

# In the code to show this, we take the transposed weights, which are the transposed array of the
# derivatives with respect to inputs, and multiply them by their respective gradients (related to
# given neurons) to apply the chain rule. Then we sum along with the inputs. Then we calculate
# the gradient for the next layer in backpropagation. The “next” layer in backpropagation is the
# previous layer in the order of creation of the model.


In [74]:
# Backward Propagation

import numpy as np

# These are the gradients passed from the next layer.
# Note: gradient is a vector which has values equal to the
# neurons in the previous layer
# Eg. We have 3 neurons in our current layer, therefore, the gradients will
# contain 3 values.
dvalues = np.array([[1.0, 1.0, 1.0]])

# 3 neurons with 4 weights
weights = np.array([
    [0.2, 0.8, -0.5, 1],
    [0.5, -0.91, 0.26, -0.5],
    [-0.26, -0.27, 0.17, 0.87]
])

# We know that the derivative of inputs is the related weight.
# Therefore we transpose weights
# gradients for the current layer
inputs = weights.T
dinputs = np.dot(dvalues, inputs.T)

print("Gradients: ", dinputs)


Gradients:  [[ 0.44 -0.38 -0.07  1.37]]


In [75]:
# in case of batch-of-data
import numpy as np

# 3 batches of inputs
dvalues = np.array([
    [1.0, 1.0, 1.0],
    [2.0, 2.0, 2.0],
    [3.0, 3.0, 3.0],
])

# 3 neurons with 4 weights
weights = np.array([
    [0.2, 0.8, -0.5, 1],
    [0.5, -0.91, 0.26, -0.5],
    [-0.26, -0.27, 0.17, 0.87]
])

inputs = weights
dinputs = np.dot(dvalues, inputs)

print("Gradients w.r.t inputs:\n ", dinputs)


Gradients w.r.t inputs:
  [[ 0.44 -0.38 -0.07  1.37]
 [ 0.88 -0.76 -0.14  2.74]
 [ 1.32 -1.14 -0.21  4.11]]


In [76]:
# similary for dweights
import numpy as np

# 3 batches of inputs
dvalues = np.array([
    [1.0, 1.0, 1.0],
    [2.0, 2.0, 2.0],
    [3.0, 3.0, 3.0],
])

inputs = np.array([
    [1, 2, 3, 2.5],
    [2, 5, -1, 2],
    [-1.5, 2.7, 3.3, -0.8]
])

weights = inputs
dweights = np.dot(weights.T, dvalues)

print("Gradients w.r.t inputs:\n ", dweights)


Gradients w.r.t inputs:
  [[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]


In [77]:
# similarly for biases

# For the biases and derivatives with respect to them, the derivatives come from the sum operation
# and always equal 1, multiplied by the incoming gradients to apply the chain rule. Since gradients
# are a list of gradients (a vector of gradients for each neuron for all samples), we just have to sum
# them with the neurons, column-wise, along axis 0.

# in case of batch-of-data
import numpy as np

# 3 batches of inputs
dvalues = np.array([
    [1.0, 1.0, 1.0],
    [2.0, 2.0, 2.0],
    [3.0, 3.0, 3.0],
])

# 3 neurons with 4 weights
biases = np.array([[2, 3, 0.5]])

dinputs = np.sum(dvalues, axis=0, keepdims=True)

print("Gradients w.r.t biases:\n ", dinputs)


Gradients w.r.t biases:
  [[6. 6. 6.]]


In [78]:
# The last thing to cover here is the derivative of the ReLU function. It equals 1 if the input is
# greater than 0 and 0 otherwise. The layer passes its outputs through the ReLU() activation during
# the forward pass. For the backward pass, ReLU() receives a gradient of the same shape. The
# derivative of the ReLU function will form an array of the same shape, filled with 1 when the
# related input is greater than 0, and 0 otherwise

z = np.array([
    [1, 2, -3, -4],
    [2, -7, -1, -3],
    [-1, 2, 5, -1]
])

dvalues = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
])

drelu = np.zeros_like(z)
drelu[z > 0] = 1

print(drelu)

drelu *= dvalues

print(drelu)


[[1 1 0 0]
 [1 0 0 0]
 [0 1 1 0]]
[[ 1  2  0  0]
 [ 5  0  0  0]
 [ 0 10 11  0]]


In [89]:
# Full code with forward and backward propagation

import numpy as np

dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])


inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

biases = np.array([[2, 3, 0.5]])

# Forward pass
layer_outputs = np.dot(inputs, weights) + biases  # Dense layer
relu = np.maximum(0, layer_outputs)  # ReLU activation

# optimized...
# drelu = relu_outputs.copy()
# drelu[layer_outputs <= 0] = 0

dinputs = np.dot(drelu, weights.T)      # shape => 3,4
dweights = np.dot(inputs.T, drelu)      # shape => 3,4
dbiases = np.sum(drelu, axis=0, keepdims=True)

weights += -0.001 * dweights
biases += -0.001 * dbiases

print(weights.T)
print(biases)


[[ 0.179515   0.742093  -0.510153   0.971328 ]
 [ 0.5003665 -0.9152577  0.2529017 -0.5021842]
 [-0.262746  -0.2758402  0.1629592  0.8636583]]
[[1.98489  2.997739 0.497389]]


In [83]:
print(dinputs)
print(dweights.T)
print(weights.T)

[[6. 6. 6.]]
[[ 11.   27.   -2.   12.5]
 [-13.   31.   39.   -3. ]
 [-16.5  29.7  36.3  -8.8]
 [  0.    0.    0.    0. ]]
[[ 0.2   0.8  -0.5   1.  ]
 [ 0.5  -0.91  0.26 -0.5 ]
 [-0.26 -0.27  0.17  0.87]]


<h3>Our full code upto this point</h3>

In [90]:
# Our Full Code

import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

# Dense Layer
class Layer_Dense:

    # Layer init
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))

    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

# ReLU activation function
class Activation_ReLU:

    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

# Softmax Activation Function
class Activation_Softmax:

    def forward(self, inputs):
        expo_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        norm_values = expo_values / np.sum(expo_values, axis=1, keepdims=True)
        self.output = norm_values

# Common Loss
class Loss:

    # output => model's prediction
    # y => ground truth
    def calculate(self, output, y):
        # forward method is of specific loss function eg. Cross Entropy
        sample_losses = self.forward(output, y)
        
        data_loss = np.mean(sample_losses)

        return data_loss


# Cross Entropy Loss:
class Loss_Categorical_Cross_Entropy(Loss):
    
    def forward(self, y_pred, y_true):
        
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

        # check if y_true is sparse or one-hot-coded
        if len(y_true.shape) == 1:
            correct_confidence = y_pred_clipped[range(len(y_pred_clipped)), y_true]
        else:
            correct_confidence = np.sum(y_pred_clipped * y_true, axis=1)

        # Losses
        neg_log = -np.log(correct_confidence)
        return neg_log


X, y = spiral_data(samples=100, classes=3)

# Initialization
dense1 = Layer_Dense(2, 3)
activation1 = Activation_ReLU()

dense2 = Layer_Dense(3, 3)
activation2 = Activation_Softmax()

loss_function = Loss_Categorical_Cross_Entropy()

# Forward pass
dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)

print(activation2.output[:5])

loss = loss_function.calculate(activation2.output, y)
print("Avg Loss: ", loss)


# Accuracy
# outputs the index from softmax_output
predictions = np.argmax(activation2.output, axis=1)

if len(y.shape) == 2:
    y = np.argmax(y, axis=2)

accuracy = np.mean(predictions == y)

# True evaluates to 1; False to 0
print("Accuracy: ", accuracy)

[[0.33333334 0.33333334 0.33333334]
 [0.3333332  0.3333332  0.33333364]
 [0.3333329  0.33333293 0.3333342 ]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
Avg Loss:  1.0986104
Accuracy:  0.34


In [None]:
# making changes
# During the forward method for our Layer_Dense class, we will want to remember what the
# inputs were (we will need them during backprop)


# Dense Layer
class Layer_Dense:

    # Layer init
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))

    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        # saving inputs
        self.inputs = inputs

    def backward(self, dvalues):
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradients
        self.dinputs = np.dot(dvalues, self.weights.T)


# ReLU Activation Function
class Activation_ReLU:

    # Forward
    def forward(self, inputs):
        # saving inputs
        self.inputs = inputs
        self.output = np.maximum(0, inputs)

    # Backward
    def backward(self, dvalues):
        self.dinputs = dvalues.copy()
        self.dinputs[self.inputs <= 0] = 0
