
# Deriving the Jacobians & Gradients for Neural Networks

In this article, we'll derive all the Jacobians and Gradients for a two-layer
neural network from first principles. This will include calculating derivatives for linear layers, ReLu, Softmax
and Cross entropy loss.

## The model

We'll assume a simple neural network with 1 hidden layer, similarly to the one defined within the
[Stanford
CS2249](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/readings/gradient-notes.pdf)
lecture notes. In addition, we'll assume the input and output shapes correspond
to the MNIST dataset (hand-written digits), so that these calculations can
easily be written in code for a later post.

$$ x = \text{input} \in \mathbb{R}^{784 \times 1} $$
$$ z = Wx + b_1 \in \mathbb{R}^{128 \times 1} $$
$$ h = \text{ReLU}(z) $$
$$ \theta = Uh + b_2 \in \mathbb{R}^{10 \times 1} $$
$$ \hat{y} = \text{softmax}(\theta) $$
$$ J = \text{cross-entropy}(y, \hat{y}) $$

## Backwards Pass

### Jacobian of loss with respect to $\theta$

#### Derivative of softmax (with respect to $\theta$)

We begin by calculating the derivative of softmax with respect to $\theta$. To keep notation simple, we write the sum part of softmax simply as $\sum$:

$$
    \hat{y}_i = \frac{e^{{\theta}_i}}{\sum_{k=1}{e^{{\theta}_k}}}
    = \frac{e^{{\theta}_i}}{\sum}
$$

Using the [quotient rule](https://en.wikipedia.org/wiki/Quotient_rule), we can
calculate the partial derivative with respect to a specific logit ${\theta}_j$:

$$
    \frac{\partial{\hat{y}_i}}{\partial{{\theta}_j}} =
    \frac{{\sum} \times {\frac{\partial}{\partial{{\theta}_j}}e^{{\theta}_i}}-{e^{{\theta}_j}}{e^{{\theta}_i}}}{{\sum}^2}
$$

We can simplify this further, by noting that $\frac{\partial}{\partial{{\theta}_j}}e^{{\theta}_i}$ is 1 if $i = j$ and $0$ otherwise:

$$
    \text{If } i = j \text{ :}
$$

$$
    \frac{\partial{\hat{y}_i}}{\partial{{\theta}_j}} =
    \frac{{\sum} \times
    {e^{{\theta}_i}}-{e^{{\theta}_i}}{e^{{\theta}_j}}}{{\sum}^2} =
    \frac{e^{{\theta}_i}}{\sum} \times \frac{\sum - e^{{\theta}_j}}{\sum}
    = \hat{y}_i(1 - \hat{y}_j)
$$

$$
    \text{Else if } i \neq j \text{ :}
$$

$$
    \frac{\partial{\hat{y}_i}}{\partial{{\theta}_j}} =
    \frac{{\sum} \times {0}-{e^{{\theta}_i}}{e^{{\theta}_j}}}{{\sum}^2}
    = -\hat{y}_i \hat{y}_j
$$

#### Derivative of loss (with respect to $\theta$)

Next we'll calculate the partial derivative of loss with respect to a specific
logit ${\theta}_j$:

$$
    \text{CE}(y, \hat{y}) = -\sum_{i=1}^{10}{y_i\ln(\hat{y}_i)}
$$

$$
    \frac{\partial{J}}{\partial{{\theta}_j}} =
    -\sum_{i=1}{y_i \frac{\partial \ln(\hat{y}_i)}{\partial{{\theta}_j}}}
$$

$$
    = -\sum_{i=1}{\frac{y_i}{\hat{y}_i} \frac{\partial{\hat{y}_i}}{\partial{{\theta}_j}}}
$$

Splitting this sum up (to the $i=j$ part and the $i \neq j$ parts), we get:

$$
    = 
    - \frac{y_j}{\hat{y}_j} \frac{\partial{\hat{y}_j}}{\partial{{\theta}_j}}
    -\sum_{i=1, i \neq j}{\frac{y_i}{\hat{y}_i} \frac{\partial{\hat{y}_i}}{\partial{{\theta}_j}}}
$$

Then we can substitute in the softmax partial derivatives for these two cases:

$$
    \frac{\partial{J}}{\partial{{\theta}_j}} =
    -\frac{y_j}{\hat{y}_j} \times \hat{y}_j (1 - \hat{y}_j)
    - \sum_{i=1, i \neq j}{ \frac{y_i}{\hat{y}_i} \times -(\hat{y}_i \hat{y}_j)}
$$

$$
    = -y_j + y_j \hat{y}_j + \sum_{i=1, i \neq j}{y_i \hat{y}_j}
$$

$$
    = \hat{y}_j \sum_{i}{y_i} - y_j
$$

Given that y is a one-hot-encoded vector, it sums to 1.

$$
    \sum_{i}{y_i} = 1
$$

$$
    \frac{\partial{J}}{\partial{{\theta}_j}} = \hat{y}_j - y_j
$$

#### Jacobian

Finally this gives us the Jacobian:

$$
    \frac{\partial{J}}{\partial{{\theta}}} = \hat{y} - y \in \mathbb{R}^{10 \times 1}
$$

### Jacobians for the second linear layer

#### Weights ($U$)

The second linear layer is defined as follows:

$$
    \theta_j = \sum_{l=1}{U_{k,l} \times h_k}
$$

The gradient with respect to a specific weight $U_{k,l}$ is therefore 0 if $j
\neq k$ as the rules of matrix multiplication mean that $U_{k,l}$ would have no
impact on $\theta_i$. Otherwise it is simply $x_k$:

$$
    \frac{\partial{\theta_j}}{\partial{U_{k,l}}} = 
    \begin{cases}
        h_k & \text{if } j = k\\    
        0 & \text{if } j \neq k
    \end{cases}
$$

Then using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule):

$$
    \frac{\partial{J}}{\partial{U_{k,l}}} =
    \frac{\partial{J}}{\partial{\theta}} \times 
    \frac{\partial{\theta}}{\partial{U_{k,l}}}
$$

$$
    = \frac{\partial{J}}{\partial{\theta}} \times 
    \frac{\partial{\theta}}{\partial{U_{k,l}}}
$$

$$
    = \sum_{j=1}{\frac{\partial{J}}{\partial{\theta_j}}}
    \times \frac{\partial{\theta_j}}{\partial{U_{k,l}}}
$$

Since this is only non-zero when $j = k$:

$$
    \frac{\partial{J}}{\partial{U_{k,l}}} =
    \frac{\partial{J}}{\partial{\theta_k}}
    \times h_l
$$

Converting this into a $\mathbb{R}^{128 \times 10}$ matrix (so we can simply
subtract from $U$ for stochastic gradient descent), we arrive at the Jacobian:

$$
    \frac{\partial{J}}{\partial{U}} =
    \frac{\partial{J}}{\partial{\theta}}
    \times h^T 
    \in \mathbb{R}^{10 \times l28}
$$

#### Bias ($b_2$)

As the bias is a $\mathbb{R}^{10 \times 1}$ vector that is simply added to
$\theta$, it's gradient will be the same as $\frac{\partial{J}}{\partial{\theta}}$:

$$
\frac{\partial{J}}{\partial{b_2}} =
    \frac{\partial{J}}{\partial{\theta}}
    \in \mathbb{R}^{10 \times 1}
$$

### Jacobian for ReLU

The derivative of ReLU is straight-forward:

$$
    \text{ReLU}(z_p) = \max(z_p, 0)
$$

$$
    \text{ReLU}'(z_p) = 
    \begin{cases}
        1 & \text{if } z_p > 0\\    
        0 & \text{if } z_p \leq 0
    \end{cases}
$$

Therefore using the chain rule:

$$
    \frac{\partial{J}}{\partial{z}} =
    \frac{\partial{J}}{\partial{\theta}}
    \times \frac{\partial{\theta}}{\partial{h}}
    \times \frac{\partial{h}}{\partial{z}}
$$

$$
    \theta = U h + b_2
$$

$$
    \frac{\partial{\theta}}{\partial{h}} = U
    \in \mathbb{R}^{10 \times 128}
$$

To get the correct transpose:

$$
    \frac{\partial{J}}{\partial{z}} =
    U^T
    \times (\frac{\partial{J}}{\partial{\theta}})
    \times (\text{ReLU}'(z))
$$

### Jacobians for the first linear layer

Following a similar approach to the second linear layer (using the chain rule):

$$
    \frac{\partial{J}}{\partial{b_1}} = \frac{\partial{J}}{\partial{z}}
$$

$$
    \frac{\partial{J}}{\partial{W}} = \frac{\partial{J}}{\partial{z}} \times x
$$

And that's it! Hopefully you found this helpful, and let me know in the comments
if you spot any issues.
