# Backpropagation through time

## RNN review

*A RNN with a single hidden layer with ReLU activation, and an output layer with softmax, is defined by what?*
1. $W_{xh}$
2. $W_{hh}$
3. $b_h$
4. $W_{hq}$
5. $b_q$

In particular, the equivalent of y_netIn ~ $W_{xh}*x_t + W_{hh}*h_{t-1} + b_h$.

And the equivalent of y_netAct ~ $\phi(W_{xh}*x_t + W_{hh}*h_{t-1} + b_h)$.

And the equivalent of z_netIn ~ $W_{hq}*y_netAct + b_q$.

## One more step!

*A RNN with two hidden layers, each with ReLU activation, and an output layer with softmax, is defined by what?*

1. $W_{xh1}$
2. $W_{hh1}$
3. $b_{h1}$
4. $W_{xh2}$
5. $W_{hh2}$
6. $b_{h2}$
7. $W_{hq}$
8. $b_q$

And we will be tracking two hidden states, one for each hidden layer.

# Backpropagation in a MLP

A MLP looks quite similar to a RNN. Imagine a MLP with one hidden layer with ReLU activation, and an output layer with softmax. Bring up your implementation of backpropagation from project 2. Here's the forward pass:
1. y_netIn = X@y_wts + y_b
2. y_netAct = max(y_netIn, 0)
3. z_netIn = y_netAct@z_wts + z_b
4. z_netAct = 1 / (1 + exp(-z_netIn))
5. loss (cross entropy)

What are the derivatives you calculated?

1. dz_netAct (meets loss from above and z_netIn from below): -1/(len(z_netAct)*z_netAct) (derivative of cross entropy loss)
2. Three parts!
   * dz_netIn (meets dz_netAct from above and z_wts from below): dz_netAct * (one_hot(y, num_outputs) - z_netAct)
   * dz_wts (meets dz_netIn from above and y_netAct): (dz_netIn.T@y_netAct).T + reg*z_wts
   * dz_b (meets dz_netIn from above): sum(dz_netIn, axis=0)
3. dy_netAct (meets dz_netIn from above and dy_netIn from below): dz_netIn@z_wts.T
4. Three parts!
   * dy_netIn (meets dy_netAct from above and y_wts from below): dy_netAct*(np.where(y_netIn <= 0, 0, 1))
   * dy_wts: (dy_netIn.T@features).T + reg*y_wts
   * dy_b: sum(dy_netIn, axis=0)

# Backpropagation in a RNN

Let's stick with a one hidden layer RNN defined as before. Here's the forward pass:

1. $h_t = W_{xh} x_t + w_{hh} h_{t-1} + b_h$
2. $o_t = W_{hq} h_t + b_q$

Loss is $1/T \sum_{t=1}^T l(o_t, y_t)$

So for backprop, we need to calculate:

1. $dy$ (analogous to dz_netAct): -1/(len(y_t)*y_t)
2. Two parts!
   * $dW_{hq}$ (meets dy from above and the output from h from below, which we will call h_netAct): dy_t@h_netAct_t.T
   * $db_{q}$ (meets dy from above and nothing from below): dy_t
2. $dh_t$ (analogous to dy_netAct): $W_{hq}$.T@$dy_t$ + $h_t$, **and then backprop through the activation whether tanh or ReLU to get dhraw_t**
3. Three parts!
   * $dW_{hh}$ (meets dh from above and $h_{t-1}$): $dhraw_t$@$h_{t-1}$
   * $dW_{xh}$ (meets dh from above and $x_t$): $dhraw_t$@$x_t$
   * $db_h$ (meets dh from above): $dhraw_t$

Take some time and define these.

*How would you handle two hidden layers?*

# Dealing with long histories

The textbook outlines three strategies:
1. Full computation
2. Regular truncation
3. Random truncation

Let's talk about each.

# Karpathy again!

http://karpathy.github.io/2015/05/21/rnn-effectiveness/