### Intro


<u>**Scenario A: 1 neuron in current layer, 1 in next layer**</u>

So far, we have performed an example backward pass with a **single neuron**, which received a **singular derivative**, `deriv_from_next_layer` (from "next" layer) to apply the chain rule. 

<br>

---

<u>**Scenario B: 1 neuron in current layer, mulitple in next layer**</u>

Let’s consider **multiple neurons** in the next layer. A **single neuron** of the current layer connects to all of them — they all receive the output of this neuron. 

What will happen during backpropagation? 
- Each neuron from the next layer will return a partial derivative of its function with respect to this input. 
- The neuron in the current layer will receive a vector consisting of these derivatives. 
- We need this to be a **singular value** for a singular neuron. 
- To continue backpropagation, we need to **sum** this vector.

<br>

---

<u>**Scenario C: Multiple neurons in current layer, mulitple in next layer**</u>

During backpropagation: 
- Each neuron from the current layer will receive a vector of partial derivatives the same way that we described for Scenario B. 
- With a layer of neurons, it’ll take the form of a list of these vectors, or a 2D array. 
- Each neuron in the next layer is going to output a gradient of the partial derivatives with respect to all of its inputs. 
- From all the neurons in the next layer this will form a list of these vectors. 



### Our staged scenario

We're zooming into a layer of 3 neurons with 4 inputs. Let's image we have done the forward pass already and are in middle of the backpropagation and have the gradients for the "next layer/sublayer" already. 



In [None]:
import numpy as np

#### Current layer and gradient from next

In [None]:
# -- ------------------------------
# Passed-in gradient from the next layer/sublayer
dvalues = np.array([[1., 1., 1.]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([
    [0.2, 0.8, -0.5, 1],       # weights for neuron 1
    [0.5, -0.91, 0.26, -0.5],  # weights for neuron 2
    [-0.26, -0.27, 0.17, 0.87] # weights for neuron 3
]).T

In [None]:
dvalues

array([[1., 1., 1.]])

In [None]:
dvalues.shape

(1, 3)

<u>**Remember:**</u> `weights` has been transposed:
- Before each "record" represented a neuron and its weights
- Now each "record" represents an input attribute, of which there are 4, and its 3 weights (one for each neuron) 

In [None]:
weights

array([[ 0.2 ,  0.5 , -0.26],
       [ 0.8 , -0.91, -0.27],
       [-0.5 ,  0.26,  0.17],
       [ 1.  , -0.5 ,  0.87]])

In [None]:
weights.shape

(4, 3)

#### Get gradient

Remember from above, in section _**Sublayer - weights and inputs**_, that to calculate the partial derivative **with respect to the input** equals the related weight. 

> Note: `dinputs` is a gradient of the neuron function with respect to inputs.

In [None]:
# Sum weights related to the given input multiplied by
# the gradient related to the given neuron
dx0 = sum([
    weights[0][0]*dvalues[0][0], 
    weights[0][1]*dvalues[0][1],
    weights[0][2]*dvalues[0][2]
])

dx1 = sum([
    weights[1][0]*dvalues[0][0], 
    weights[1][1]*dvalues[0][1],
    weights[1][2]*dvalues[0][2]
])

dx2 = sum([
    weights[2][0]*dvalues[0][0], 
    weights[2][1]*dvalues[0][1],
    weights[2][2]*dvalues[0][2]
])

dx3 = sum([
    weights[3][0]*dvalues[0][0], 
    weights[3][1]*dvalues[0][1],
    weights[3][2]*dvalues[0][2]
])

dinputs = np.array([dx0, dx1, dx2, dx3])

print(dinputs)

[ 0.44 -0.38 -0.07  1.37]


#### Using dot product

We can achieve the same result by using the `np.dot`:

In [None]:
# sum weights of given input
# and multiply by the passed-in gradient for this neuron
dinputs = np.dot(dvalues[0], weights.T)

print(dinputs)

[ 0.44 -0.38 -0.07  1.37]


#### Using batch of inputs

With more samples, the "next" layer will return a list of gradients. 



##### wrt inputs

Our code just needs a minor tweak: `dvalues` instead of `dvalues[0]`

In [None]:
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# SAME AS BEFORE
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# sum weights of given input
# and multiply by the passed-in gradient for this neuron
dinputs = np.dot(dvalues, weights.T)

print(dinputs)

[[ 0.44 -0.38 -0.07  1.37]
 [ 0.88 -0.76 -0.14  2.74]
 [ 1.32 -1.14 -0.21  4.11]]


##### wrt weights


In [None]:
# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

In [None]:
# Let's remember what dvalues looks like:
dvalues

array([[1., 1., 1.],
       [2., 2., 2.],
       [3., 3., 3.]])

In [None]:
# inputs transposed (imagine now that "row" one has all the 3 eyecolors of the 3 people being inputted)
inputs.T

array([[ 1. ,  2. , -1.5],
       [ 2. ,  5. ,  2.7],
       [ 3. , -1. ,  3.3],
       [ 2.5,  2. , -0.8]])

In [None]:
# sum inputs for given weight
# and multiply by the passed-in gradient for this neuron
dweights = np.dot(inputs.T, dvalues)

print(dweights)

[[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]


##### wrt to biases

For the biases and derivatives with respect to them, the derivatives come from the sum operation and always equal 1, multiplied by the incoming gradients to apply the chain rule. 

Since gradients are a list of gradients (a vector of gradients for each neuron for all samples), we just have to sum them with the neurons, column-wise, along axis 0.

In [None]:
# One bias for each neuron
# biases are the row vector with a shape (1, neurons)
biases = np.array([[2, 3, 0.5]])

dbiases = np.sum(dvalues, axis=0, keepdims=True)

print(dbiases)

[[6. 6. 6.]]


##### Sublayer ReLU

Let's now move up to the next sublayer

In [None]:
# Example layer output (before ReLU)
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

# Gradient from "next" layer
dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])



In [None]:
# ReLU activation's derivative
drelu = np.zeros_like(z)
# Set cells to 1 where corresponding z cell > 0
drelu[z > 0] = 1

print(drelu)

[[1 1 0 0]
 [1 0 0 1]
 [0 1 1 0]]


In [None]:
# The chain rule
drelu *= dvalues

print(drelu)

[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


###### Simplified

In [None]:
drelu = dvalues.copy()
drelu[z <= 0] = 0

print(drelu)

[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


#### Full forward and backward pass

In [None]:
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

In [None]:
# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

In [None]:
# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T


# One bias for each neuron
biases = np.array([[2, 3, 0.5]])

In [None]:
# Forward pass
layer_outputs = np.dot(inputs, weights) + biases  # Dense layer
relu_outputs = np.maximum(0, layer_outputs)  # ReLU activation

In [None]:
# Backpropagation starts here

# ReLU activation - simulates derivative with respect to input values
# from next layer passed to current layer during backpropagation
drelu = relu_outputs.copy()
drelu[layer_outputs <= 0] = 0

In [None]:
# dinputs - multiply by weights
dinputs = np.dot(drelu, weights.T)

In [None]:
# dweights - multiply by inputs
dweights = np.dot(inputs.T, drelu)

In [None]:
# dbiases - sum values, do this over samples (first axis)
dbiases = np.sum(drelu, axis=0, keepdims=True)

In [None]:
# Update parameters
weights += -0.001 * dweights
biases += -0.001 * dbiases

print(weights)
print(biases)

[[ 0.179515   0.5003665 -0.262746 ]
 [ 0.742093  -0.9152577 -0.2758402]
 [-0.510153   0.2529017  0.1629592]
 [ 0.971328  -0.5021842  0.8636583]]
[[1.98489  2.997739 0.497389]]


## Conclusion

By this point, we’ve covered everything we need to perform backpropagation, except for the derivative of the Softmax activation function and the derivative of the cross-entropy loss function.




