## Backprop Workbook 00: Forward Propagation

**For these questions, assume that an $x$ input has 1024 dimensions, that the first hidden layer should have $512$ units, a second layer has $256$ units, and that there are $10$ classes to choose from at the end.**

**Cell to run for Latex commands**

\\[
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\grad}[1]{\nabla #1}
\newcommand{\softmax}[0]{\text{SOFTMAX}}
\\]

## Questions about shapes of $X$, $W^{(i)}$, $b^{(i)}$ and $Z^{(i)}$

**1. What do the rows of $X$ represent? What do the columns of $X$ represent? What is the shape of $X$?**


Rows represent each image in the batch.

Columns represent each pixel in the image; a column of values is the same pixel's value in each of the images.

The shape of $X$ is $(bs, 1024)$. $bs$ is the "batch size", or the number of $x$ examples we want to simultaneously train on in our "batch."

**2. You have a first matrix of weights $W^{(1)}$ and a vector of biases $b^{(1)}$. What are the shapes of $W^{(1)}$ and $b^{(1)}$? Why have I written these superscripts?**


The shape of $W^{(1)}$ is: $(1024, 512)$. The shape of $b^{(1)}$ is $(1, 512)$.

This is the first set of weights and biases used to calculate the pre-activations for the first hidden layer. There will be more weights and biases.

**3. For a single $x$ input of shape $(1, 1024)$, what is the formula to calculate the hidden pre-activation values $z^{(1)}$? What is the dimensionality of $z^{(1)}$?**


Formula is:

\\[
z^{(1)} = x W^{(1)} + b^{(1)}
\\]

The dimensionality of $z^{(1)}$ is $(1, 512)$.

**4. For a batch input matrix $X$ of shape $(bs, 1024)$, what is the formula to calculate the hidden pre-activation values $Z^{(1)}$? What is the dimensionality of $Z^{(1)}$?**

**Hint**: in numpy, if $A$ has dimension $(10, 24)$, then you can add $b$ a vector of dimension $(1, 24)$, and $A + b$ is the $(10, 24)$ matrix where each row of $A$ has the vector $b$ added to it. This is called a *broadcasting* addition. You may write the formula with $+$ interpreted to allow broadcasting addition.


\\[
Z^{(1)} = X W^{(1)} + b^{(1)}
\\]

The dimensionality of $Z^{(1)}$ is $(bs, 512)$.

## Questions about the $\sigma$ function

**1. How do I convert a $3:1$ odds of winning to a $75$% probability of winning? If the odds of an event are $odds:1$, how do I convert that to a percentage?**


If the odds are $odds:1$, then the formula for $odds$ to $p$ is $p = \frac{odds}{1 + odds}$.


**2. The $\sigma$ function turns a number from $(-\infty, \infty)$ into a number from $(0, 1)$ that can be used as a probability. What is the formula for $\sigma(z)$? Give me the version with $e^z$ in the numerator.**


$\sigma(z) = \frac{e^z}{1 + e^z}$.

**3. Let's say $f(odds) = \frac{odds}{1 + odds}$. $f$ converts and odds to a probability. Can you write sigmoid in terms of $f$?**


$\sigma(z) = f(e^z)$


**4. If we usually interpret the input of $f$ as an odds, then if we try to interpret $e^z$ as an odds, what does that imply we would interpret $z$ as?**


We interpret $z$ as the log of an odds.


**5a. What $z$ value has $\sigma(z) = 0.50$ (50% probability) equivalent to an odds of $1.0$?**



$z = 0.0$.


**5b. When is the probability $<0.5$, when is the probability $>0.5$?**



When $z$ is negative the probability will be less than half, when $z$ is positive probability will be greater than half.

Note: $\sigma(z)$ isn't necessarily a probability. It can be the "percent activated."


**6a. What is the problem numerically with the $\sigma(z) = \frac{e^z}{1+e^z}$ formula? What happens when $z$ is really large? What will we calculate on our CPU? What do we *want* to calculate theoretically?**


When $z$ is really large then the floating point representation of $e^z$ can overflow and be $\infty$.

That's a problem because both the numerator and denominator will be $\infty$ which means their ratio is not a number. We want it to be: $1.0$.


**6b. Is there a problem for very negative $z$s with this formula?**


No, because $e^z$ will round to $0.0$ and that's not a problem because this is zero divided by one which is zero which is correct.


**7. How do we fix this problem? What's the better formula for a computer?**


$\sigma(z) = \frac{1}{e^{-z} + 1}$. It works for very negative and very positive $z$ values.


## Activations and Pre-Activations

**1. How do I calculate the activation values $H^{(1)}$ from $Z^{(1)}$?**


\\[
H^{(1)} = \sigma(Z^{(1)})
\\]


**2. What is another name for a function like $\sigma$ when used to calculate hidden activations? What are other examples?**


Activation function. ReLU, softmax.

**3. What do I call the linear transformation of the $X$ values before being input into the activation function? What symbols do I use?**


We wrote it as $z^{(1)}$ and it is called pre-activations.

**4. What is the purpose of an activation function? What is an interpretation of the activation function $\sigma$?**


Without activation functions, we just have a series of linear functions, which is equivalent to just a single linear function.

We want the neurons to be able to represent binary feature detectors: the feature is present or not present. Thus it makes sense to reduce to a range of $(0, 1)$ with zero meaning "not detected" and one meaning "detected."

Intermediate values mean "sort-of detected."

Because of the non-linearity, subsequent layers look for features in the first-layer features.

**TODO**: Something something universal function approximators.

## Investigating columns and rows of $W^{(i)}$ and $Z^{(i)}$

**1. How do we notate the $i$th row of $W^{(1)}$? The $j$th column?**


$W^{(1)}_{i, :}$ and $W^{(1)}_{:, j}$.

**2. What is the formula for calculating a specific pre-activation $z^{(1)}_j$ for a single input $x$? Write as a vector operation with a dot-product. The write the formula replacing the dot-product with an explicit sum using $\sum$.**


\\[
\begin{align}
z^{(1)}_j &= x \cdot W^{(1)}_{:, j} + b^{(1)}_j
\\
z^{(1)}_j &= \left(
    \sum_{i = 0}^{1024} x_i W^{(1)}_{i, j}
\right) + b^{(1)}_j
\end{align}
\\]

**3. What does a column of $W^{(1)}_{:, j}$ represent? What does a row of $W^{(1)}_{i, :}$ represent?**


The column consists of weights for each input dimension $x_i$ used to calculate the preactivation $z^{(1)}_j$. They are the weights for the $j$th hidden unit.

The row consists of weights for a single input dimension $x_i$ used to compute the contribution of $x_i$ to each of the hidden pre-activations $z^{(1)}_j$.

## 2nd Hidden Layer, Output Layer


**1. What are the dimensions of $W^{(2)}$ and $b^{(2)}$?**


$(512, 256)$ and $(1, 256)$

**2. What is the dimension of the first-layer activations $H^{(1)}$?**


$(bs, 512)$

**3. What are the formulas for $Z^{(2)}$ and $H^{(2)}$?**


\\[
\begin{align}
Z^{(2)} &= H^{(1)} W^{(2)} + b^{(2)}
\\
H^{(2)} &= \sigma(Z^{(2)})
\end{align}
\\]

**4. What does the row $W^{(2)}_{i, :}$ represent? What does the column $W^{(2)}_{:, j}$ represent?**


The $j$th column of $W^{(2)}$ represents the weights for each output of the first hidden layer used to compute the $j$th pre-activation of the second hidden layer.

The $i$th row of $W^{(2)}$ is all of the weights for the $i$th activation of the hidden layer $H^{(1)}$ used to compute the contribution of the $i$th hidden unit of the first layer all the hidden pre-activations $z^{(2)}_j$ for all $j$.

**What are the dimensions of $W^{(3)}, b^{(3)}$? What is the shape of $Z^{(3)}$?**


$(256, 10)$ and $(1, 10)$. $(bs, 10)$.


**4. What is the formula for calculating $Z^{(3)}$?**


\\[
\begin{align}
Z^{(3)} &= H^{(2)} W^{(3)} + b^{(3)}
\end{align}
\\]

## Output Layer Activations: $\softmax$

**1. What is the formula for calculating $H^{(3)}$? Hint: you don't use $\sigma$ this time!**


\\[
\begin{align}
H^{(3)} &= \softmax(Z^{(3)})
\end{align}
\\]

**2. The $\softmax$ function maps a 10-dimensional vector $z^{(3)}$ to a 10-dimensional vector $h^{(3)}$. How is $h^{(3)}_i$ calculated?**


\\[
h^{(3)}_i = \frac{
    \exp \left(z^{(3)}_i\right)
}{
    \sum_{j = 0}^9 \exp \left(z^{(3)}_j\right)
}
\\]

**3. How is the $\softmax$ function like the $\sigma$ function?**


**TODO**.

## Target Outputs $y^*$, $y$, and $Y$

**1. In our problem, we want to classify an input $x$ as one of ten classes. Let's represent the correct answer in our training dataset as $y^*$. What is the shape of $y^*$? What is its range of values?**


Shape of $y^*$ is `()` or just a scalar. The range is zero to nine.


**2. We will denote the one-hot encoding of $y^*$ as simply $y$. What is the shape and range of the values of $y$?**


This is a ten-dimensional vector, where all the values are zero, except at one position. At the position $y^*$, the value is $1.0$.

In a sense, the one-hot $y$ representation is a "perfect" probability distribution for the correct answer.


**3. What is the format for $Y$, which is the one hot encoding of the correct class $y*$ for each example $x$ in the batch?**



Shape is $(bs, 10)$ and each row $i$ is a one hot encoding of $y_i^*$ ($i$-th correct class).


## Loss Function: Preliminaries

**1. What properties do we want out of our last hidden layer $h^{(3)}$ (aka, the output layer)?**


All values $h^{(3)}_i$ must be between zero and one because otherwise they're not a valid probability.

The probabilities should sum to one so that $h^{(3)}$ forms a proper probability *distribution.*

For an input $x$, we ideally want $h^{(3)}$ to be equal to the one-hot encoding $y$: be all zeros except for at the position of the correct answer, where it has a value 1.0.

**2. The probability we assign to the correct class is $h^{(3)}_{y^*}$. What is the ideal value of $h^{(3)}_{y^*}$?**


1.0 or 100%.


**3. What is the ideal value of $\log h^{(3)}_{y^*}$?**


0.0


**4. What is the worst value of $h^{(3)}_{y^*}$? And $\log h^{(3)}_{y^*}$?**


0.0 and $-\infty$.


**5. If larger values of $h^{(3)}_{y^*}$ are better than what values of $\log h^{(3)}_{y^*}$ are better? Why?**


Larger ones. Because monotonic.

**6. What are the properties of a loss function?**


Loss function should be non-negative. Should be zero when perfect/correct.

The worse the prediction the greater the loss function.


**7. Can we use $\log h^{(3)}_{y^*}$ by itself as a loss function? Why? What do we have to do to use $\log h^{(3)}_{y^*}$ as a loss function?**


No. Goes negative. Greater values are better.

We use $-\log h^{(3)}_{y^*}$ as the loss function.


**8. Is there a deep reason for using $-\log h^{(3)}_{y^*}$ rather than say $-h^{(3)}_{y^*}$?**


**TODO**: Maximum likelihood of dataset, add up cross entropy losses.

**9. What do we call this loss function?**


Cross entropy.

## Cross Entropy Calculations

**1. If I give you $h^{(3)}$ and a one-hot encoding $y$ for a single example, what is the formula for the cross-entropy loss?**


\\[
\begin{align}
{CE}_{\text{vector}} (h^{(3)}, y) &= -\log\left(
    h^{(3)} \cdot y
\right)
\\
&= -\log\left(
    \sum_{i = 0}^{9} h^{(3)}_i y_i
\right)
\end{align}
\\]

**2. Now that you can write ${CE}_{\text{vector}}$, can you write ${CE}_{\text{matrix}}$? You may use a summation $\sum$ over each row of $H^{(3)}$ and $Y$, and you may use ${CE}_{\text{vector}}$.**


\\[
\begin{align}
{CE}_{\text{matrix}}(H^{(3)}, Y)
=
\sum_{i = 0}^{bs} {CE}_{\text{vector}}\left(
    H^{(3)}_{i, :},
    Y_{i, :}
\right)
\end{align}
\\]

**3a. We use Python loops to implement $\sum$s. Python loops are slow. Numpy operations are fast. Let's step-by-step learn to eliminate the explicit summation for ${CE}_{\text{matrix}}$. Okay?**


Yeah!


**3b. ${CE}_{\text{matrix}}$ calls ${CE}_{\text{vector}}$ for each pair of corresponding rows of $H^{(3)}$ and $Y$. That involves take the negative log of a dot product.  Let's perform this dot product using numpy. A dot product first multiplies corresponding entries in a vector. How do we get numpy to do this for a matrix?**


\\[
H^{(3)} * Y \quad\text{(numpy)}
\\
H^{(3)} \odot Y \quad\text{(math)}
\\]

**3c. The next step of a dot product is to sum out the products. We need to do this per row. We can do this using `np.sum`. What named argument must we pass `np.sum`? What is the shape of the result? How would you describe the result in words?**


\\[
\text{np.sum}(H^{(3)} * Y, axis = 1) \quad\text{(numpy)}
\\]

This is a vector of shape $(bs,)$. The $i$-th entry is equal to the probability assigned to the correct class $y_i^*$.


**3d. Use the above formula to calculate the cross entropies for each example in the batch ${CE}_{\text{vector}} (H^{(3)}_{i, :}, Y_{i, :})$. What is the shape of this?**


\\[
-\log \text{np.sum}(H^{(3)} * Y, axis = 1) \quad\text{(numpy)}
\\]

The shape is $(bs,)$.

**3e. Last, use `np.sum` to calculate the total mean cross entropy for the batch.**


\\[
\text{np.sum}\left(
    -\log \text{np.sum}(H^{(3)} * Y, axis = 1),
    axis = 0
\right)
/ bs
\\]
