## Backprop Workbook 01: Backprop to Output Layer

**For these questions, assume that an $x$ input has 1024 dimensions, that the first hidden layer should have $512$ units, a second layer has $256$ units, and that there are $10$ classes to choose from at the end.**

**Cell to run for Latex commands**

\\[
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\grad}[1]{\nabla #1}
\newcommand{\softmax}[0]{\text{SOFTMAX}}
\\]

## Calculus

**1. What is $\fpartial{}{x} \log x$?**

$\frac{1}{x}$


**2. What is $\fpartial{}{x} e^x$?**

$e^x$


**3. What is $\fpartial{}{x} e^{(x^2)}$?**

$2x e^{(x^2)}$


**4. What is $\fpartial{}{x} \log(x^2)$? Use the chain rule.**


Using the chain rule, we have:

\\[
\fpartial{}{x} \log(x^2)
=
\frac{1}{x^2} \fpartial{}{x} x^2
=
\frac{2x}{x^2}
=
\frac{2}{x}
\\]


**5. Same question, but use the property that $\log$ pulls down exponents.**


We have $\log x^2 = 2 \log x$. Since the partial of $\log x$ is $\frac{1}{x}$ we are done.

**6. Give the rules for $\log a^b$, $\log ab$, and $\log \frac{a}{b}$.**


\\[
\begin{align}
\log ab
&=
\log(b) + \log(a)
\\
\log a^b
&=
b\log(a)
\\
\log \frac{a}{b}
&=
\log(a b^{-1})
=
\log(a) + \log(b^{-1})
=
\log(a) - \log(b)
\end{align}
\\]

Note that the law about exponents follows from the law about products, because $a^b$ is just $a$ multiplied $b$ times.

**7. Explain the chain rule. Given functions $f$ and $g$, write a function for the derivative of the composition function $f \circ g$ where $(f \circ g)(x) := f(g(x))$.**


\\[
(f \circ g)'(x)
=
f'(g(x))g'(x)
\\]

## Backpropagation: The Concept

Gradient descent is about changing the weight matrices $W^{(i)}$ and the bias vectors $b^{(i)}$ so that the loss goes down. To do this, we need to know how the loss changes as we change the weight matrices and bias vectors.

That means we need to calculate gradients like $\grad_{W^{(i)}} CE(h^{(3)}, y)$ and $\grad_{b^{(i)}} CE(h^{(3)}, y)$.


**0. What is a partial derivative? What is a gradient?**


Say you have a function $f(x_1, x_2) = x_1 x_2$ which is a function of two scalar variables with a single scalar valued output.

There are two derivatives you can take: one which asks how changes in $x_1$ change the output, and another which asks how changes in $x_2$ change the output.

These are denoted:

\\[
\fpartial{}{x_1} f(x_1, x_2)
\\
\fpartial{}{x_2} f(x_1, x_2)
\\]

They are called *partial derivatives*. You calculate them just like normal derivatives, except if you're doing a partial with respect to (written *wrt*) $x_1$, you just treat $x_2$ like a constant (and vice versa).

A gradient is just the vector of partial derivatives.

\\[
\grad f\left((x_1, x_2)\right)
=
\left(
    \fpartial{}{x_1} f(x_1, x_2), \fpartial{}{x_2} f(x_1, x_2)
\right)
\\]


**1. What is the shape of $\grad_{W^{(3)}} CE(h^{(3)}, y)$? What does the entry at position $(i, j)$ of this matrix represent?**


The shape is $(256, 10)$. The entry at position $(i, j)$ represents how changing $W^{(3)}_{i, j}$ would change the loss $CE(h^{(3)}, y)$. That is:

\\[
\left(
    \grad_{W^{(3)}} CE(h^{(3)}, y)
\right)_{i, j}
=
\fpartial{}{W^{(3)}_{i, j}}
CE(h^{(3)}, y)
\\]


**2. How is it that changing a weight in $W^{(3)}$ might change the loss?**


A change to $W^{(3)}_{i, j}$ changes $z^{(3)}_j$, which changes all the $h^{(3)}$ values. Since that changes the probability of the correct class $h^{(3)}_{y^*}$, this will change the cross entropy loss.

## Backpropagation to $z^{(3)}$


In order to calculate $\grad_{W^{(3)}} CE(h^{(3)}, y)$, we will first calculate $\grad_{z^{(3)}} CE(h^{(3)}, y)$. Once we know how changing $z^{(3)}$ changes the cross entropy, we can then think about how changing $W^{(3)}$ changes $z^{(3)}$.

Sometimes it makes sense to break the process up further:

1. See how changes to $h^{(3)}$ change the loss $CE(h^{(3)}, y)$.
2. See how changes to $z^{(3)}$ change $h^{(3)}$.
3. See how changes to $W^{(3)}$ change $z^{(3)}$.

In the case of the cross entropy loss plus a softmax, it turns out to be nicer to combine steps one and two.

**1. What is the shape of $\grad_{z^{(3)}} CE(h^{(3)}, y)$? What does each entry represent?**

The shape is $(10,)$. Each entry represents how a change in each of ten $z^{(3)}_i$ values would change the loss. Formula-wise this is:

\\[
\left(
    \grad_{z^{(3)}} CE(h^{(3)}, y)
\right)_i
=
\fpartial{}{z^{(3)}_i}
CE(h^{(3)}, y)
\\]

Note how I wrapped the gradient formula in parentheses and subscripted by $i$. This is how I write "get the $i$-th component of the value of this formula wrapped in parentheses."

**1. Recall that $h^{(3)} = \text{SOFTMAX}\left(z^{(3)}\right)$. So before anything, let's write $CE_\text{vector}(h^{(3)}, y)$ in terms of the $z^{(3)}_i$ values by expanding the formulas for cross-entropy and for $h^{(3)}$.**


\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log h^{(3)} \cdot y
\\
&=
    -\log \sum_{i = 0}^{9} h^{(3)}_i y_i
\\
&=
    -\log \sum_{i = 0}^{9}
        y_i
        \frac{
            \exp\left(z^{(3)}_i\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
\end{align}
\\]


**2. Only one term of the sum above matters. Which one? Why? Write the formula without the summation.**


The only term that matters is for $i = y^*$. That's because only the probability on the correct class matters. The other $y_i$ values are all zero.

\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log \sum_{i = 0}^{9}
        y_i
        \frac{
            \exp\left(z^{(3)}_i\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
\\
&=
    -\log\left(
        y_{y^*}
        \frac{
            \exp\left(z^{(3)}_{y^*}\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\\
&=
    -\log\left(
        \frac{
            \exp\left(z^{(3)}_{y*}\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\end{align}
\\]


**3. Why are all the $z^{(3)}$ values still present in the formula?**


Because they are all part of the softmax calculation because they represents odds relative to each other.

**4. Write this log in terms of a difference of logs.**


\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log\left(
        \frac{
            \exp\left(z^{(3)}_{y*}\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\\
&=
    -\log \left(\exp\left(z^{(3)}_{y*}\right)\right)
    +
    \log \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
\\
&=
    -z^{(3)}_{y*}
    +
    \log \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
\end{align}
\\]



**5. We're trying to calculate $\grad_{z^{(3)}} CE(h^{(3)}, y)$. That means calculating each partial derivative $\fpartial{}{z^{(3)}_i} CE(h^{(3)}, y)$. What is the partial of the first term when $i \ne y^*$? What is the partial of the first term when $i = y^*$?**


The first term is just $-z^{(3)}_{y*}$, so if we differentiate with respect to some $z^{(3)}_i$ where $i$ is *not* the correct class, then this derivative is zero.

Otherwise, the derivative is $-1$.


**6. Why is this term zero when $i\ne y^*$? Why is it negative when $i = y^*$?**


This is saying that changing the $z^{(3)}_i$ values for the wrong classes doesn't change the numerator. But when you change it for the *right* class, it changes the numerator, which increases the probability of the correct class, which reduces the cross-entropy loss.


**7. Now the second term! Using the rule that the derivative of $\log a$ wrt $a$ is $\frac{1}{a}$, and also the chain rule, do the first-step of the deriative of the second term wrt $z^{(3)}_i$. You don't have to differentiate the inside of the log yet.**


\\[
\begin{align}
    \fpartial{}{z^{(3)}_i}
        \log \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
&=
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \fpartial{}{z^{(3)}_i}
        \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
\end{align}
\\]


**8. Next, use the rule that the derivative of $e^a$ wrt $a$ is also $e^a$. Also eliminate unnecessary terms in the sum.**


\\[
\begin{align}
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \fpartial{}{z^{(3)}_i}
        \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
&=
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \exp\left(z^{(3)}_i\right)
\end{align}
\\]


**9. Finally, use the definition of $h^{(3)}_i = SOFTMAX(z^{(3)})_i$ to simplify this.**


\\[
\begin{align}
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \exp\left(z^{(3)}_i\right)
&=
    h^{(3)}_i
\end{align}
\\]


**10. Why is the partial of the second term the same formula no matter whether $i$ is $y^*$?**


Because the second term is about the change to the denominator, and you change the denominator regardless of whether you are changing the correct class.

**11. Give the overall formula for the partial when $i \ne y^*$ and for $i = y^*$.**


$h^{3}_i$ and $-1.0 + h^{3}_i$.

**12. Let's use these to write the gradient as a vector formula. Note that you want to subtract $1.0$ from exactly one term of $h^{(3)}$ and nothing from the rest...**


\\[
\grad_{z^{(3)}} CE(h^{(3)}, y)
=
h^{(3)} - y
\\]


**13. Let's do an intuition check. What entries of the gradient are positive? Which are negative? Why?**


All entries except $y^*$ are positive, because increasing their log odds decreases the probability on the right answer. That increases the loss.

Increasing $z^{(3)}_{y^*}$ increases the probability of the correct answer, so it reduces the loss.

## Backprop to $W^{(3)}$


**1. To update the weights $W^{(3)}$ we must calculate $\grad_{W^{(3)}} CE(h^{(3)}, y)$. What is the shape of this "2 dimensional gradient"?**


(256, 10)

**2. What does each entry $\left(\grad_{W^{(3)}} CE(h^{(3)}, y)\right)_{i, j}$ mean?**


Each entry in grad(W3) represents the partial derivative of the cross entropy with respect to each corresponding entry in W3.

**3. Which value in the second hidden layer does $W_{i, j}$ connect to which pre-activation in the third layer?**


Wi,j connects h2_i with z3_j

**4. If for a given $j$ we have $\fpartial{}{z^{(3)}_j} CE(h^{(3)}, y)$ is zero, what is $\grad_{W^{(3)}} CE(h^{(3)}, y)$ at position $(i, j)$ for all $i$ and our given $j$? Why?**


If changing zj has no effect on the cross entropy, than the inputs to zj (W3 for all i at j) also do not have any effect. In particular, the gradient would have a value of 0.

**5. If for a given $i$ we have $h^{(2)}_i = 0$, what is $\grad_{W^{(3)}} CE(h^{(3)}, y)$ at position $(i, j)$ for all $j$ and our given $i$? Why?**


If the ith activation of the second hidden layer is 0, then changing the weights has no effect, and therefore grad(W3) for all j at our given i is 0.

**6. Given the above, what two "forces" does $\left(\grad_{W^{(3)}} CE(h^{(3)}, y)\right)_{i, j}$ need to combine?**


**7. Apply the chain rule to $\fpartial{}{W^{(3)}_{i, j}} CE(h^{(3)}, y)$. Break this up into (a) how does $W^{(3)}_{i, j}$ change $z^{(3)}_j$ and (b) how does $z^{(3)}_j$ change the cross-entropy. Write a product of two partials, but you don't need to evaluate the partials.**


dz3_j/dW3_i_j  * dCE(h3, y)/dz3_j

**8. We must calculate $\fpartial{}{W^{(3)}_{i, j}} z^{(3)}_j$. What is it? Use the formula for $z^{(3)}_j$.**


z3_j = sum_i (h2_i * W3_i_j) + b3_j

d/dW3_i_j z3_j = h2_i

**9. We calculated $\fpartial{}{z^{(3)}_j} CE(h^{(3)}, y)$ before. What is it?**


(h3 - y)_j

**10. Using the last two answers, what is $\fpartial{}{W^{(3)}_{i, j}} CE(h^{(3)}, y)$ then?**


h2_i * (h3 - y)_j

**11. We want $\grad_{W^{(3)}} CE(h^{(3)}, y)$, which is a matrix. Each entry of the matrix consists of a product of two factors as above. What is the first factor for every entry in row $i$?**


h2_i

**11. What is the second factor for every entry in column $j$?**


(h3 - y)_j

**12. If a vector $u$ has length $a$ and a vector $v$ has length $b$, what is the *outer product* $u \otimes v$? What is its shape?**


u is a column vector, v is a row vector

the shape of the output is (a, b)

**13. Can you write $\grad_{W^{(3)}} CE(h^{(3)}, y)$ as an outer product of two vectors?**


h2 outer (h3 - y)

**14. You just backpropagated a derivative. Way to go!!**

Thanks! I'm so proud!

## Backprop to $b^{(3)}$


**1. We next want to find $\grad_{b^{(3)}} CE(h^{(3)}, y)$. What is the shape of this gradient? Why?**


(10,). There are 10 biases that influence the cross entropy and therefore 10 partial derivatives

**2. Let's consider a single $j$ index. Break $\fpartial{}{b^{(3)}_j} CE(h^{(3)}, y)$ into two partial factors as before.**


d/db3_j z3_j * d/dz3_j CE(h3, y)

**3. Evaluate the first partial.**



dz3_j/b3_j = 1

**4. Therefore, what is the partial overall?**

(h - y)_j

**5. Using this, give me a vector formula for the gradient $\grad_{b^{(3)}} CE(h^{(3)}, y)$.**

h - y

**6. What must always be the relationship between $\grad_{b^{(i)}} CE(h^{(3)}, y)$ and $\grad_{z^{(i)}} CE(h^{(3)}, y)$? Why?**

These two gradients are the same