### Intro


<u>**Scenario A: 1 neuron in current layer, 1 in next layer**</u>

So far, we have performed an example backward pass with a **single neuron**, which received a **singular derivative**, `deriv_from_next_layer` (from "next" layer) to apply the chain rule. 

<br>

---

<u>**Scenario B: 1 neuron in current layer, mulitple in next layer**</u>

Let’s consider **multiple neurons** in the next layer. A **single neuron** of the current layer connects to all of them — they all receive the output of this neuron. 

What will happen during backpropagation? 
- Each neuron from the next layer will return a partial derivative of its function with respect to this input. 
- The neuron in the current layer will receive a vector consisting of these derivatives. 
- We need this to be a **singular value** for a singular neuron. 
- To continue backpropagation, we need to **sum** this vector.

<br>

---

<u>**Scenario C: Multiple neurons in current layer, mulitple in next layer**</u>

During backpropagation: 
- Each neuron from the current layer will receive a vector of partial derivatives the same way that we described for Scenario B. 
- With a layer of neurons, it’ll take the form of a list of these vectors, or a 2D array. 
- Each neuron in the next layer is going to output a gradient of the partial derivatives with respect to all of its inputs. 
- From all the neurons in the next layer this will form a list of these vectors. 



### Our staged scenario

We're zooming into a layer of 3 neurons with 4 inputs. 

![](https://drive.google.com/uc?id=1j6jrEYOttXaVrn9TUSnFX1pSFw9H7jD6)

Let's imagine we have done the forward pass already and are in middle of the backpropagation and have the gradients for the "next layer" already. 



## Forward Pass

In [1]:
import numpy as np

### Inputs

Fabricated inputs:

![](https://drive.google.com/uc?id=1-7BfMCQX7EO_aq7ex6p-HyIzCuh2RjuZ)

In [2]:
inputs = np.array([
    [1, 2, 3, 2.5],
    [2., 5., -1., 2],
    [-1.5, 2.7, 3.3, -0.8]
])

In [3]:
inputs.shape

(3, 4)

### Weights

- We'll keep weights transposed

![](https://drive.google.com/uc?id=1eGsR-34hXzqdRpqJQnicO0hEmHelbVHA)

In [4]:
weights = np.array([
    [0.2, 0.8, -0.5, 1],
    [0.5, -0.91, 0.26, -0.5],
    [-0.26, -0.27, 0.17, 0.87]
]).T

In [5]:
weights

array([[ 0.2 ,  0.5 , -0.26],
       [ 0.8 , -0.91, -0.27],
       [-0.5 ,  0.26,  0.17],
       [ 1.  , -0.5 ,  0.87]])

In [6]:
weights.shape

(4, 3)

### Biases

One bias for each neuron:

In [7]:
biases = np.array([
    [2, 3, 0.5]
])

In [8]:
biases.shape

(1, 3)

### Forward Pass

In [9]:
from IPython.display import Image


**Inputs:**

![](https://drive.google.com/uc?id=1-7BfMCQX7EO_aq7ex6p-HyIzCuh2RjuZ)

**Weights:**

![](https://drive.google.com/uc?id=1eGsR-34hXzqdRpqJQnicO0hEmHelbVHA)

**Inputs • Weights:**

Remember: the output shape of `[3, 4]` matrix • `[4, 3]` matrix will be `[3, 3]`

In our case: output of `[people, measurements]` • `[weights, nuerons]` will be `[people, nuerons]`

![](https://drive.google.com/uc?id=1U8PrVGfhGBd6LYF3kAf_CuXmFHK76Wmy)



In [10]:
Image(url='https://drive.google.com/uc?id=1pLNkfC-csbh1ab7zXtWNGc0FvPvZ5XFX')

In [11]:
inputs_dot_weights =  np.dot(inputs, weights)

inputs_dot_weights

array([[ 2.8  , -1.79 ,  1.885],
       [ 6.9  , -4.81 , -0.3  ],
       [-0.59 , -1.949, -0.474]])

In [12]:
biases

array([[2. , 3. , 0.5]])

In [13]:
layer_outputs = inputs_dot_weights + biases  

layer_outputs

array([[ 4.8  ,  1.21 ,  2.385],
       [ 8.9  , -1.81 ,  0.2  ],
       [ 1.41 ,  1.051,  0.026]])

In [14]:
relu_outputs = np.maximum(0, layer_outputs) 

relu_outputs

array([[4.8  , 1.21 , 2.385],
       [8.9  , 0.   , 0.2  ],
       [1.41 , 1.051, 0.026]])

![](https://drive.google.com/uc?id=1Bt-qC8ljXA-SmhutPZs5EbaM2_vNxF0j)

## Backpropagation

### `deriv_relu_wrt_z`

Let's get the ReLU activation's derivative and apply the chain rule.

As a reminder, here's how we handled it before:

```python
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)
```

Over here, we'll be doing the same, albeit in a more optimal fashion a matrix of data.

---


> Note: 
- For this fabricated example we're using the ReLU's output itself as the "passed-in gradients" (i.e. we're minimizing this output).
- Don't take this example too seriously

`layer_outputs` - the input to ReLU:

In [15]:
layer_outputs

array([[ 4.8  ,  1.21 ,  2.385],
       [ 8.9  , -1.81 ,  0.2  ],
       [ 1.41 ,  1.051,  0.026]])

In [16]:
deriv_relu_wrt_z = relu_outputs.copy()

# Set cells to 0 where corresponding layer_outputs cell <= 0
deriv_relu_wrt_z[layer_outputs <= 0] = 0

deriv_relu_wrt_z

array([[4.8  , 1.21 , 2.385],
       [8.9  , 0.   , 0.2  ],
       [1.41 , 1.051, 0.026]])

![](https://drive.google.com/uc?id=1JRQ6jYS78PhbDXJ3hkzKCVCVylszwPzA)

### `deriv_relu_wrt_inputs`



As a reminder, here's how we handled it before (with respect to to input $x_0$):

```python
deriv_relu_wrt_x0 = deriv_relu_wrt_z * w[0]
```

Over here, we'll be doing the same, albeit in a more optimal fashion for a matrix of data.

---

![](https://drive.google.com/uc?id=1eGsR-34hXzqdRpqJQnicO0hEmHelbVHA)

![](https://drive.google.com/uc?id=1vUmbQIhcsXF__sgkbMwUhojkyQht_Ur7)

![](https://drive.google.com/uc?id=1JRQ6jYS78PhbDXJ3hkzKCVCVylszwPzA)

![](https://drive.google.com/uc?id=1gJi4JiBcIwat1yoaAAjJf16K1f97S0gv)

In [31]:
Image(url='https://drive.google.com/uc?id=1Hj3U-Vu1ziTe2o821HZ2BkzU_Sa9_9z2')

In [17]:
weights.T

array([[ 0.2 ,  0.8 , -0.5 ,  1.  ],
       [ 0.5 , -0.91,  0.26, -0.5 ],
       [-0.26, -0.27,  0.17,  0.87]])

In [18]:
deriv_relu_wrt_inputs = np.dot(deriv_relu_wrt_z, weights.T)

deriv_relu_wrt_inputs

array([[ 0.9449 ,  2.09495, -1.67995,  6.26995],
       [ 1.728  ,  7.066  , -4.416  ,  9.074  ],
       [ 0.80074,  0.16457, -0.42732,  0.90712]])

### `deriv_relu_wrt_weights`

As a reminder, here's how we handled it before (with respect to $w_0)$:

```python
deriv_relu_wrt_w0 = deriv_relu_wrt_z * x[0]
```

Over here, we'll be doing the same, albeit in a more optimal fashion for a matrix of data.

---

![](https://drive.google.com/uc?id=1-7BfMCQX7EO_aq7ex6p-HyIzCuh2RjuZ)

![](https://drive.google.com/uc?id=1znI1wls2OEnRFTaCcvSjIU2dOBEFBhY3)

![](https://drive.google.com/uc?id=1JRQ6jYS78PhbDXJ3hkzKCVCVylszwPzA)

![](https://drive.google.com/uc?id=1EpFdlX1P4RIlrzixdRX79e57dGPVI-YP)

In [32]:
Image(url='https://drive.google.com/uc?id=1YZsYDwyyTWoS2IIoMMA3W8Jw5zkZU1Fq')

In [20]:
inputs.T

array([[ 1. ,  2. , -1.5],
       [ 2. ,  5. ,  2.7],
       [ 3. , -1. ,  3.3],
       [ 2.5,  2. , -0.8]])

In [22]:
deriv_relu_wrt_weights = np.dot(inputs.T, deriv_relu_wrt_z)

deriv_relu_wrt_weights

array([[20.485 , -0.3665,  2.746 ],
       [57.907 ,  5.2577,  5.8402],
       [10.153 ,  7.0983,  7.0408],
       [28.672 ,  2.1842,  6.3417]])

### `deriv_relu_wrt_biases`

For the biases and derivatives with respect to them, the derivatives come from the sum operation and always equal 1, multiplied by the incoming gradients to apply the chain rule.

Since gradients are a list of gradients (a vector of gradients for each neuron for all samples), we just have to sum them with the neurons, column-wise, along axis 0.

In [23]:
deriv_relu_wrt_z

array([[4.8  , 1.21 , 2.385],
       [8.9  , 0.   , 0.2  ],
       [1.41 , 1.051, 0.026]])

In [24]:
deriv_relu_wrt_biases = np.sum(
    deriv_relu_wrt_z, 
    axis=0, 
    keepdims=True
)

deriv_relu_wrt_biases

array([[15.11 ,  2.261,  2.611]])

## Update parameters

### Update Weights

**Weights before update:**

![](https://drive.google.com/uc?id=1eGsR-34hXzqdRpqJQnicO0hEmHelbVHA)

![](https://drive.google.com/uc?id=1EpFdlX1P4RIlrzixdRX79e57dGPVI-YP)

In [25]:
weights

array([[ 0.2 ,  0.5 , -0.26],
       [ 0.8 , -0.91, -0.27],
       [-0.5 ,  0.26,  0.17],
       [ 1.  , -0.5 ,  0.87]])

In [26]:
deriv_relu_wrt_weights

array([[20.485 , -0.3665,  2.746 ],
       [57.907 ,  5.2577,  5.8402],
       [10.153 ,  7.0983,  7.0408],
       [28.672 ,  2.1842,  6.3417]])

In [27]:
weights += -0.001 * deriv_relu_wrt_weights

print(f"new weights\n{weights}")

new weights
[[ 0.179515   0.5003665 -0.262746 ]
 [ 0.742093  -0.9152577 -0.2758402]
 [-0.510153   0.2529017  0.1629592]
 [ 0.971328  -0.5021842  0.8636583]]


### Update Biases

**Biases before update:**

In [28]:
biases

array([[2. , 3. , 0.5]])

**Partial derivatives of ReLU with respect to the biases:**

In [29]:
deriv_relu_wrt_biases

array([[15.11 ,  2.261,  2.611]])

In [30]:
biases += -0.001 * deriv_relu_wrt_biases

print(f"new biases\n{biases}")

new biases
[[1.98489  2.997739 0.497389]]


## Conclusion

By this point, we’ve covered everything we need to perform backpropagation, except for the derivative of the Softmax activation function and the derivative of the cross-entropy loss function.




