# Backpropagation in depth

In the [last lesson](https://github.com/VikParuchuri/zero_to_gpt/blob/master/explanations/rnn.ipynb), we learned how to create a recurrent neural network.  We now know how to build several network architectures using components like dense layers, softmax, and recurrent layers.

We've been a bit loose with how we cover backpropagation, to make neural network architecture easier to understand.  In this lesson, we'll do a deep dive into how backpropagation works.  We'll do this by building a computational graph that keeps track of the different operations that transform the input data.

# The Softmax function

In a previous lesson, we introduced the softmax function.  This is used to convert the output of a neural network into probabilities that can be used to make predictions.  The softmax function is defined as:

$$\zeta=\frac{e^{\hat{y_{i}}}}{\sum_{j=0}e^{\hat{y_{j}}}}$$

For each row of our neural network output, we raise $e$ to the power of our output value, then divide by the sum of $e$ raised to the power of each of the outputs for that row.

The softmax function looks like this in code:

In [2]:
import numpy as np

def softmax(preds):
    # Subtract the max of each row from each row element to stabilize the softmax
    # If we don't do this, we could raise e to very large or small values, and cause numerical overflow or underflow
    normalized = preds - np.max(preds, axis=-1).reshape(-1,1)
    raised = np.exp(normalized)
    output = raised / np.sum(raised, axis=1).reshape(-1,1)
    return output

# Demonstrate how the softmax works
input_row = np.arange(0,3).reshape(1,-1)
print(input_row)
softmax(input_row)

[[0 1 2]]


array([[0.09003057, 0.24472847, 0.66524096]])

We didn't do this previously in the softmax function, but in the above code, we subtract the maximum from each element in the row.  This prevents numerical underflow or overflow.  Each [numeric type](https://numpy.org/doc/stable/user/basics.types.html) (float, integer, etc) can only hold a certain number of digits.  For example, floating point 16 can store 5 exponent bits, and ten digit bits (each bit is only base 2, so this is less than the same number of base-10 digits).  The maximum value we can store in `float16` is `65500`:

In [3]:
# Check the maximum value we can assign to float16
np.finfo('float16').max

65500.0

In [8]:
# This is an example of numeric overflow, where we store more digits than float16 can hold
a = np.array([0], dtype=np.float16)
a[0] = 6.55e5

  a[0] = 6.55e5


When we raise $e$ to a very large or small number, we can generate a number that is too large to store in our specific data type.  Subtracting the max gives us the same end result, but reduces the risk of overflow.  Feel free to try the softmax out with and without subtracting the max to see how it works!

# Softmax derivative

Instead of computing the softmax derivative, we previously used the fact that the derivative of the softmax and negative log likelihood functions "cancel out", and end up with a derivative of $p-y$.  But what if we want to find the derivative ourselves?  We can approach it analytically, and find the derivative of the entire function.

But an easier method is to break the softmax function apart into individual operations.  Each operation will make a single modification to the data:



To start, let's read in some data and define a 2-layer neural network that can make predictions:

In [None]:
import pandas as pd

# Read in our data, and fill missing values
data = pd.read_csv("../../data/clean_weather.csv", index_col=0)
data = data.ffill()

# Create data sets of our predictors and targets (x and y)
x = data[:10][["tmax", "tmin", "rain"]].to_numpy()
y = data[:10][["tmax_tomorrow"]].to_numpy()

Once we have the data, we'll initialize our parameters for 2 layers.  To keep things simple, we'll omit the bias, so we just need weights for each layer:

In [None]:
import numpy as np
w1 = np.random.rand(3, 3)
w2 = np.random.rand(3,1)