## Singel Neuron "Network"

Before applying this to a complete neural network, let’s start with a simplified forward pass with just one neuron. Rather than backpropagating from the loss function for a full neural network, let’s backpropagate the ReLU function for a single neuron and act as if we intend to **minimize the output for this single neuron**. 

This example is obviously not used in the real world (where we minimize the loss etc) - this just for learning purposes etc.

![](https://drive.google.com/uc?id=1_lq5wWBDiqhtXbPaOTfvdCrZY0o7qJQz)

In [4]:
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)


In [5]:
print(f'xw0 :{xw0}')
print(f'xw1 :{xw1}')
print(f'xw2 :{xw2}')
print(f'z   :{z}')
print(f'y   :{y}')

xw0 :-3.0
xw1 :2.0
xw2 :6.0
z   :6.0
y   :6.0


### Our Big Function

![](https://drive.google.com/uc?id=1JpU0VqoiiRBLoDyrV7GE1JSED14cwPxo)

<br>

Let’s rewrite our equation to the form that will allow us to determine how to calculate the derivatives more easily:

![](https://drive.google.com/uc?id=17XPz2uAdSPCmTwHn6D0b5I6BDCHZudmP)

<br>

... in psuedo-code:

```
ReLU(
    sum(
        mul(x0, w0), 
        mul(x1, w1), 
        mul(x2, w2), 
        b
    )
)
```

### Partial derivative of x0

Let's start by considering what we need to calculate for the partial derivative of $\large x_0$

![](https://drive.google.com/uc?id=1JdTSOrQXda3c6a6LOoFHZRUB2HnFP4X5)


> For legibility, we did not denote the $\large ReLU()$ parameter (which is the full sum), nor the $\large sum()$ parameters (which are all of the multiplications of inputs and weights). We excluded this because the equation would be longer and harder to read. 

This equation shows that we have to calculate the derivatives and partial derivatives of all of the atomic operations and multiply them to acquire the impact that $\large x_0$ makes on the output. 




#### Gradient from next layer

We’ll have multiple chained layers of neurons in the neural network model, followed by the loss function. 

We'll want to know the impact of a given **weight or bias** on the loss. 

The derivative **with respect to the layer’s inputs**, as opposed to the derivative **with respect to the weights and biases**, is not used to update any parameters. Instead, it is used to **chain** to another layer (which is why we backpropagate to the previous layer in a chain).

---


For this example, let’s assume that our neuron receives a gradient of $1$ from the **next layer**. We’re making up this value for demonstration purposes, and a value of 1 won’t change the values, which means that we can more easily show all of the processes. 

We are going to use the color of red for derivatives.

![](https://drive.google.com/uc?id=1kdRsvEPwknm-mTohSuPzUrv2S6MCYml-)

#### Sublayer - ReLU

Recall that the derivative of ReLU() **with respect to its input** is 1, if the input is greater than 0, and 0 otherwise.

The input value to the ReLU function is 6, so the derivative equals 1. 

We have to use the chain rule and multiply this derivative with the derivative received from the next layer (which we made up to be 1).

<br>

![](https://drive.google.com/uc?id=1ZBCFRJbpnniz3uZO5vEFC0C_xEPs7LXk)

This results with the derivative of 1:

![](https://drive.google.com/uc?id=1fyBKyZ3Bz-nwGfbgRvbZGFJzojt4fyok)

> Note: `w.r.t` stands for "with respect to"

In [6]:
# -- ---------------------------------------
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)
# -- ---------------------------------------


# -- ---------------------------------------
# Backward pass

# The derivative from the next layer
deriv_from_next_layer = 1.0

# Derivative of ReLU and the chain rule
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)
print(deriv_relu_wrt_z)
# -- ---------------------------------------



1.0


#### Sublayer - Sum

The partial derivative of the simple sum operation (i.e. $f(x, y, z) = x + y + z$) is always 1, no matter the inputs:

![](https://drive.google.com/uc?id=1CK16QPZ9QmMF-GQzC44ApNYF2Nm--s7m)


In [7]:
# -- ---------------------------------------
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)
# -- ---------------------------------------



# -- ---------------------------------------
# Backward pass

# The derivative from the next layer
deriv_from_next_layer = 1.0

# Derivative of ReLU and the chain rule
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)

# -- ----------------------
# Partial derivatives of the sum with chain rule
deriv_sum_wrt_xw0 = 1
deriv_relu_wrt_xw0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0

# -- ----------------------

print(f'deriv_relu_wrt_z  : {deriv_relu_wrt_z}')
print(f'deriv_relu_wrt_xw0: {deriv_relu_wrt_xw0}')
# -- ---------------------------------------


deriv_relu_wrt_z  : 1.0
deriv_relu_wrt_xw0: 1.0



![](https://drive.google.com/uc?id=1bLDvheKGVXgleqzgrIBx_tCfC-zoBxkF)

### For all weighted inputs and bias



#### Similarity of the partial derivatives


Note how all the partial derivatives of this "Network" are similar. Due to the chain rule they end up sharing a lot of the same parts.

![](https://drive.google.com/uc?id=1_XloSiT5xFdoJMLBFXjJvC8wNXbXHEK_)

> For legibility, we did not denote the $\large ReLU()$ parameter (which is the full sum), nor the $\large sum()$ parameters (which are all of the multiplications of inputs and weights). We excluded this because the equation would be longer and harder to read. 

Here's the complete one for $\large x_0$:

<br>

$\huge \frac{∂}{∂ x_0} = $ 

```python
# The derivative from the next layer
deriv_from_next_layer = 1.0

# Derivative of ReLU and the chain rule
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)

# Partial derivatives of the sum, the chain rule
deriv_sum_wrt_xw0 = 1
deriv_relu_wrt_xw0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0

# Partial derivatives of the multiplication, the chain rule
deriv_mul_wrt_x0 = w[0]
deriv_relu_wrt_x0 = deriv_relu_wrt_xw0 * deriv_mul_wrt_x0
```

We can now perform the same operation.

![](https://drive.google.com/uc?id=13bXiryB5dBMpa9vsAJIDVBJEHOsSO1UM)

In [8]:

# -- ---------------------------------------
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)
# -- ---------------------------------------


# -- ---------------------------------------
# Backward pass

# -- ----------------------
# The derivative from the next layer
deriv_from_next_layer = 1.0
# -- ----------------------

# -- ----------------------
# Derivative of ReLU and the chain rule
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)
# -- ----------------------

# -- ----------------------
# Partial derivatives of the sum, the chain rule
deriv_sum_wrt_xw0 = 1
deriv_sum_wrt_xw1 = 1
deriv_sum_wrt_xw2 = 1
deriv_sum_wrt_b   = 1

deriv_relu_wrt_xw0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0
deriv_relu_wrt_xw1 = deriv_relu_wrt_z * deriv_sum_wrt_xw1
deriv_relu_wrt_xw2 = deriv_relu_wrt_z * deriv_sum_wrt_xw2
deriv_relu_wrt_b   = deriv_relu_wrt_z * deriv_sum_wrt_b
# -- ----------------------

print(f'deriv_relu_wrt_z  : {deriv_relu_wrt_z}')
print(f'deriv_relu_wrt_xw0: {deriv_relu_wrt_xw0}')
print(f'deriv_relu_wrt_xw1: {deriv_relu_wrt_xw1}')
print(f'deriv_relu_wrt_xw2: {deriv_relu_wrt_xw2}')
print(f'deriv_relu_wrt_b  : {deriv_relu_wrt_b}')

# -- ---------------------------------------


deriv_relu_wrt_z  : 1.0
deriv_relu_wrt_xw0: 1.0
deriv_relu_wrt_xw1: 1.0
deriv_relu_wrt_xw2: 1.0
deriv_relu_wrt_b  : 1.0


### Sublayer - weights and inputs

Continuing backward, the next function is the multiplication of weights and inputs. 

The derivative for a product is whatever the input is being multiplied by. 

Recall:

![](https://drive.google.com/uc?id=1IAQS4eIH242ZSOLRVK3929RMBosiCx7p)

<br>

---

![](https://drive.google.com/uc?id=19GdcFSnltR2iSBV2nCL7EMv_YLELhESa)



In [10]:

# -- ---------------------------------------
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)
# -- ---------------------------------------


# -- ---------------------------------------
# Backward pass

# -- ----------------------
# The derivative from the next layer
deriv_from_next_layer = 1.0
# -- ----------------------

# -- ----------------------
# Derivative of ReLU and the chain rule
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)
# -- ----------------------

# -- ----------------------
# Partial derivatives of the sum, the chain rule
deriv_sum_wrt_xw0 = 1
deriv_sum_wrt_xw1 = 1
deriv_sum_wrt_xw2 = 1
deriv_sum_wrt_b   = 1

deriv_relu_wrt_xw0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0
deriv_relu_wrt_xw1 = deriv_relu_wrt_z * deriv_sum_wrt_xw1
deriv_relu_wrt_xw2 = deriv_relu_wrt_z * deriv_sum_wrt_xw2
deriv_relu_wrt_b   = deriv_relu_wrt_z * deriv_sum_wrt_b
# -- ----------------------


# -- ----------------------
# Partial derivatives of the multiplication, the chain rule
deriv_mul_wrt_x0 = w[0]
deriv_mul_wrt_x1 = w[1]
deriv_mul_wrt_x2 = w[2]

deriv_mul_wrt_w0 = x[0]
deriv_mul_wrt_w1 = x[1]
deriv_mul_wrt_w2 = x[2]

deriv_relu_wrt_x0 = deriv_relu_wrt_xw0 * deriv_mul_wrt_x0
deriv_relu_wrt_w0 = deriv_relu_wrt_xw0 * deriv_mul_wrt_w0

deriv_relu_wrt_x1 = deriv_relu_wrt_xw1 * deriv_mul_wrt_x1
deriv_relu_wrt_w1 = deriv_relu_wrt_xw1 * deriv_mul_wrt_w1

deriv_relu_wrt_x2 = deriv_relu_wrt_xw2 * deriv_mul_wrt_x2
deriv_relu_wrt_w2 = deriv_relu_wrt_xw2 * deriv_mul_wrt_w2
# -- ----------------------


print(f'deriv_relu_wrt_z  : {deriv_relu_wrt_z}')
print(f'deriv_relu_wrt_xw0: {deriv_relu_wrt_xw0}')
print(f'deriv_relu_wrt_xw1: {deriv_relu_wrt_xw1}')
print(f'deriv_relu_wrt_xw2: {deriv_relu_wrt_xw2}')
print(f'deriv_relu_wrt_b  : {deriv_relu_wrt_b}')
print(f'deriv_relu_wrt_x0 : {deriv_relu_wrt_x0}')
print(f'deriv_relu_wrt_w0 : {deriv_relu_wrt_w0}')
print(f'deriv_relu_wrt_x1 : {deriv_relu_wrt_x1}')
print(f'deriv_relu_wrt_w1 : {deriv_relu_wrt_w1}')
print(f'deriv_relu_wrt_x2 : {deriv_relu_wrt_x2}')
print(f'deriv_relu_wrt_w2 : {deriv_relu_wrt_w2}')

# -- ---------------------------------------


deriv_relu_wrt_z  : 1.0
deriv_relu_wrt_xw0: 1.0
deriv_relu_wrt_xw1: 1.0
deriv_relu_wrt_xw2: 1.0
deriv_relu_wrt_b  : 1.0
deriv_relu_wrt_x0 : -3.0
deriv_relu_wrt_w0 : 1.0
deriv_relu_wrt_x1 : -1.0
deriv_relu_wrt_w1 : -2.0
deriv_relu_wrt_x2 : 2.0
deriv_relu_wrt_w2 : 3.0


Let's see the graph again: 


![](https://drive.google.com/uc?id=19GdcFSnltR2iSBV2nCL7EMv_YLELhESa)


In [18]:
from IPython.display import Image

In [19]:
Image(url='https://drive.google.com/uc?id=1nSY_g3ksfY70Ui77Eg-L5mBcz4f81ruz')

### Simplify wrt x0

In above code, look how we applied the chain rule to calculate the partial derivative of the ReLU activation function **with respect to the first input**, $x_0$ 

Let’s take the related lines of the code and simplify what's needed for our final **`deriv_relu_wrt_x0`**:

<br>

**Original**:

```python
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)

deriv_sum_wrt_xw0 = 1
deriv_relu_wrt_xw0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0

deriv_mul_wrt_x0 = w[0]
deriv_relu_wrt_x0 = deriv_relu_wrt_xw0 * deriv_mul_wrt_x0
```

**Replace `deriv_mul_wrt_x0` with `w[0]`**:

```python
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)

deriv_sum_wrt_xw0 = 1
deriv_relu_wrt_xw0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0

deriv_relu_wrt_x0 = deriv_relu_wrt_xw0 * w[0]
```


**Replace `deriv_relu_wrt_xw0` with `deriv_relu_wrt_z * deriv_sum_wrt_xw0`**:

```python
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)

deriv_sum_wrt_xw0 = 1

deriv_relu_wrt_x0 = deriv_relu_wrt_z * deriv_sum_wrt_xw0 * w[0]
```

**Replace `deriv_sum_wrt_xw0` with `1`**:

```py
deriv_relu_wrt_z = deriv_from_next_layer * (1. if z > 0 else 0.)

deriv_relu_wrt_x0 = deriv_relu_wrt_z * 1 * w[0]
```

**Replace `deriv_relu_wrt_z` with `deriv_from_next_layer * (1. if z > 0 else 0.)`**:

```python
deriv_relu_wrt_x0 = deriv_from_next_layer * (1. if z > 0 else 0.) * w[0]
```



#### Run again

In [11]:
# B"H


# -- ---------------------------------------
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)
# -- ---------------------------------------


# -- ---------------------------------------
# Backward pass

deriv_from_next_layer = 1.0

deriv_relu_wrt_x0 = deriv_from_next_layer * (1. if z > 0 else 0.) * w[0]
deriv_relu_wrt_x1 = deriv_from_next_layer * (1. if z > 0 else 0.) * w[1]
deriv_relu_wrt_x2 = deriv_from_next_layer * (1. if z > 0 else 0.) * w[2]

deriv_relu_wrt_w0 = deriv_from_next_layer * (1. if z > 0 else 0.) * x[0]
deriv_relu_wrt_w1 = deriv_from_next_layer * (1. if z > 0 else 0.) * x[1]
deriv_relu_wrt_w2 = deriv_from_next_layer * (1. if z > 0 else 0.) * x[2]

deriv_relu_wrt_b  = deriv_from_next_layer * (1. if z > 0 else 0.) 

print(f'deriv_relu_wrt_x0 : {deriv_relu_wrt_x0}')
print(f'deriv_relu_wrt_w0 : {deriv_relu_wrt_w0}')
print(f'deriv_relu_wrt_x1 : {deriv_relu_wrt_x1}')
print(f'deriv_relu_wrt_w1 : {deriv_relu_wrt_w1}')
print(f'deriv_relu_wrt_x2 : {deriv_relu_wrt_x2}')
print(f'deriv_relu_wrt_w2 : {deriv_relu_wrt_w2}')
print(f'deriv_relu_wrt_b  : {deriv_relu_wrt_b}')
# -- ---------------------------------------


deriv_relu_wrt_x0 : -3.0
deriv_relu_wrt_w0 : 1.0
deriv_relu_wrt_x1 : -1.0
deriv_relu_wrt_w1 : -2.0
deriv_relu_wrt_x2 : 2.0
deriv_relu_wrt_w2 : 3.0
deriv_relu_wrt_b  : 1.0


Let's see the graph again: 


![](https://drive.google.com/uc?id=19GdcFSnltR2iSBV2nCL7EMv_YLELhESa)

### The Gradients

The partial derivatives above, combined into a vector, make up our gradients. 

Our gradients could be represented as:

In [12]:
# gradients on inputs:
x_gradient = [deriv_relu_wrt_x0, deriv_relu_wrt_x1, deriv_relu_wrt_x2]  

# gradients on weights:
w_gradient = [deriv_relu_wrt_w0, deriv_relu_wrt_w1, deriv_relu_wrt_w2]  

# gradient on bias...just 1 bias here:
b_gradient = deriv_relu_wrt_b  

print(f'x_gradient: {x_gradient}')
print(f'w_gradient: {w_gradient}')
print(f'b_gradient: {b_gradient}')

x_gradient: [-3.0, -1.0, 2.0]
w_gradient: [1.0, -2.0, 3.0]
b_gradient: 1.0


> NOTE: 
> - For this single neuron example, we also won’t need our `x_gradient`. 
> - With many layers, we will continue backpropagating to preceding layers with the partial derivative with respect to our inputs.

### Apply gradients to the weights

We can now apply these gradients to the weights to hopefully minimize the output. 

This is typically the purpose of the **optimizer** (discussed in following sections), but we can show a simplified version of this task by directly applying a negative fraction of the gradient to our weights. 

We apply a **negative** fraction to this gradient since we want to decrease the final output value, and the gradient shows the direction of the steepest ascent. 

In [15]:
# Our current weights and bias are:
print(f'w: {w}')
print(f'b: {b}')

w: [-3.0, -1.0, 2.0]
b: 1.0


In [14]:
# Our gradients:
print(f'w_gradient: {w_gradient}')
print(f'b_gradient: {b_gradient}')

w_gradient: [1.0, -2.0, 3.0]
b_gradient: 1.0


We can then apply a fraction of the gradients to these values:

In [16]:
w[0] += -0.001 * w_gradient[0]
w[1] += -0.001 * w_gradient[1]
w[2] += -0.001 * w_gradient[2]
b += -0.001 * b_gradient

print(w, b)

[-3.001, -0.998, 1.997] 0.999


### Run another forward pass

In [17]:
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)
print(y)

5.985


We’ve successfully decreased this neuron’s output from 6.000 to 5.985.

Note that it does not make sense to decrease the neuron’s output in a real neural network; we were doing this **purely as a simpler exercise than the full network**. 

## More Complex Example


### Intro


<u>**Scenario A: 1 neuron in current layer, 1 in next layer**</u>

So far, we have performed an example backward pass with a **single neuron**, which received a **singular derivative**, `deriv_from_next_layer` (from "next" layer) to apply the chain rule. 

<br>

---

<u>**Scenario B: 1 neuron in current layer, mulitple in next layer**</u>

Let’s consider **multiple neurons** in the next layer. A **single neuron** of the current layer connects to all of them — they all receive the output of this neuron. 

What will happen during backpropagation? 
- Each neuron from the next layer will return a partial derivative of its function with respect to this input. 
- The neuron in the current layer will receive a vector consisting of these derivatives. 
- We need this to be a **singular value** for a singular neuron. 
- To continue backpropagation, we need to **sum** this vector.

<br>

---

<u>**Scenario C: Multiple neurons in current layer, mulitple in next layer**</u>

During backpropagation: 
- Each neuron from the current layer will receive a vector of partial derivatives the same way that we described for Scenario B. 
- With a layer of neurons, it’ll take the form of a list of these vectors, or a 2D array. 
- Each neuron in the next layer is going to output a gradient of the partial derivatives with respect to all of its inputs. 
- From all the neurons in the next layer this will form a list of these vectors. 



### Our staged scenario

We're zooming into a layer of 3 neurons with 4 inputs. Let's image we have done the forward pass already and are in middle of the backpropagation and have the gradients for the "next layer/sublayer" already. 



In [None]:
import numpy as np

#### Current layer and gradient from next

In [None]:
# -- ------------------------------
# Passed-in gradient from the next layer/sublayer
dvalues = np.array([[1., 1., 1.]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([
    [0.2, 0.8, -0.5, 1],       # weights for neuron 1
    [0.5, -0.91, 0.26, -0.5],  # weights for neuron 2
    [-0.26, -0.27, 0.17, 0.87] # weights for neuron 3
]).T

In [None]:
dvalues

array([[1., 1., 1.]])

In [None]:
dvalues.shape

(1, 3)

<u>**Remember:**</u> `weights` has been transposed:
- Before each "record" represented a neuron and its weights
- Now each "record" represents an input attribute, of which there are 4, and its 3 weights (one for each neuron) 

In [None]:
weights

array([[ 0.2 ,  0.5 , -0.26],
       [ 0.8 , -0.91, -0.27],
       [-0.5 ,  0.26,  0.17],
       [ 1.  , -0.5 ,  0.87]])

In [None]:
weights.shape

(4, 3)

#### Get gradient

Remember from above, in section _**Sublayer - weights and inputs**_, that to calculate the partial derivative **with respect to the input** equals the related weight. 

> Note: `dinputs` is a gradient of the neuron function with respect to inputs.

In [None]:
# Sum weights related to the given input multiplied by
# the gradient related to the given neuron
dx0 = sum([
    weights[0][0]*dvalues[0][0], 
    weights[0][1]*dvalues[0][1],
    weights[0][2]*dvalues[0][2]
])

dx1 = sum([
    weights[1][0]*dvalues[0][0], 
    weights[1][1]*dvalues[0][1],
    weights[1][2]*dvalues[0][2]
])

dx2 = sum([
    weights[2][0]*dvalues[0][0], 
    weights[2][1]*dvalues[0][1],
    weights[2][2]*dvalues[0][2]
])

dx3 = sum([
    weights[3][0]*dvalues[0][0], 
    weights[3][1]*dvalues[0][1],
    weights[3][2]*dvalues[0][2]
])

dinputs = np.array([dx0, dx1, dx2, dx3])

print(dinputs)

[ 0.44 -0.38 -0.07  1.37]


#### Using dot product

We can achieve the same result by using the `np.dot`:

In [None]:
# sum weights of given input
# and multiply by the passed-in gradient for this neuron
dinputs = np.dot(dvalues[0], weights.T)

print(dinputs)

[ 0.44 -0.38 -0.07  1.37]


#### Using batch of inputs

With more samples, the "next" layer will return a list of gradients. 



##### wrt inputs

Our code just needs a minor tweak: `dvalues` instead of `dvalues[0]`

In [None]:
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# SAME AS BEFORE
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# sum weights of given input
# and multiply by the passed-in gradient for this neuron
dinputs = np.dot(dvalues, weights.T)

print(dinputs)

[[ 0.44 -0.38 -0.07  1.37]
 [ 0.88 -0.76 -0.14  2.74]
 [ 1.32 -1.14 -0.21  4.11]]


##### wrt weights


In [None]:
# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

In [None]:
# Let's remember what dvalues looks like:
dvalues

array([[1., 1., 1.],
       [2., 2., 2.],
       [3., 3., 3.]])

In [None]:
# inputs transposed (imagine now that "row" one has all the 3 eyecolors of the 3 people being inputted)
inputs.T

array([[ 1. ,  2. , -1.5],
       [ 2. ,  5. ,  2.7],
       [ 3. , -1. ,  3.3],
       [ 2.5,  2. , -0.8]])

In [None]:
# sum inputs for given weight
# and multiply by the passed-in gradient for this neuron
dweights = np.dot(inputs.T, dvalues)

print(dweights)

[[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]


##### wrt to biases

For the biases and derivatives with respect to them, the derivatives come from the sum operation and always equal 1, multiplied by the incoming gradients to apply the chain rule. 

Since gradients are a list of gradients (a vector of gradients for each neuron for all samples), we just have to sum them with the neurons, column-wise, along axis 0.

In [None]:
# One bias for each neuron
# biases are the row vector with a shape (1, neurons)
biases = np.array([[2, 3, 0.5]])

dbiases = np.sum(dvalues, axis=0, keepdims=True)

print(dbiases)

[[6. 6. 6.]]


##### Sublayer ReLU

Let's now move up to the next sublayer

In [None]:
# Example layer output (before ReLU)
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

# Gradient from "next" layer
dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])



In [None]:
# ReLU activation's derivative
drelu = np.zeros_like(z)
# Set cells to 1 where corresponding z cell > 0
drelu[z > 0] = 1

print(drelu)

[[1 1 0 0]
 [1 0 0 1]
 [0 1 1 0]]


In [None]:
# The chain rule
drelu *= dvalues

print(drelu)

[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


###### Simplified

In [None]:
drelu = dvalues.copy()
drelu[z <= 0] = 0

print(drelu)

[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


#### Full forward and backward pass

In [None]:
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

In [None]:
# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

In [None]:
# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T


# One bias for each neuron
biases = np.array([[2, 3, 0.5]])

In [None]:
# Forward pass
layer_outputs = np.dot(inputs, weights) + biases  # Dense layer
relu_outputs = np.maximum(0, layer_outputs)  # ReLU activation

In [None]:
# Backpropagation starts here

# ReLU activation - simulates derivative with respect to input values
# from next layer passed to current layer during backpropagation
drelu = relu_outputs.copy()
drelu[layer_outputs <= 0] = 0

In [None]:
# dinputs - multiply by weights
dinputs = np.dot(drelu, weights.T)

In [None]:
# dweights - multiply by inputs
dweights = np.dot(inputs.T, drelu)

In [None]:
# dbiases - sum values, do this over samples (first axis)
dbiases = np.sum(drelu, axis=0, keepdims=True)

In [None]:
# Update parameters
weights += -0.001 * dweights
biases += -0.001 * dbiases

print(weights)
print(biases)

[[ 0.179515   0.5003665 -0.262746 ]
 [ 0.742093  -0.9152577 -0.2758402]
 [-0.510153   0.2529017  0.1629592]
 [ 0.971328  -0.5021842  0.8636583]]
[[1.98489  2.997739 0.497389]]


## Conclusion

By this point, we’ve covered everything we need to perform backpropagation, except for the derivative of the Softmax activation function and the derivative of the cross-entropy loss function.




