# Deep Learning Book

# Chapter 10 Sequence Modeling

## Teacher Forcing, p372

A RNN whose only recurrence is the feedback connection from the output of the hidden layer, not the hidde layer itself, is **less powerful** than those which have direct connection from hidden layer $h^{(t-1)}$ to $h^{(t)}$. p370.

The advantage of eliminating hidden-to-hidden recurrence is that, for any loss function based on comparing the prediction at time $t$ to the training target at time $t$, all the time steps are decoupled. Tranining can thus be parallelized, with gradient for each step t computed in isolation.

Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output $y^{(t)}$ as input at time $t+1$, instead of the output from the previous step, thus parallelizing training.

## BPTT

The backprop algorithm applied to the unrolled graph with $\mathbb{O}(\tau)$ cost is called **back-propagation through time (BPTT)**. As soon as the hidden units become a function of earlier time steps, the BPTT algorithm is necessary.

Example is based on the RNN architecture given in figure 10.3 on page 369. This network is described as:

$$
\begin{aligned}
a^{(t)} &= b + W h^{(t-1)} + U x^{(t)}, &(10.8)\\
h^{(t)} &= \tanh\big(a^{(t)}\big), &(10.9)\\
o^{(t)} &= c + V h^{(t)}, &(10.10)\\
\hat{y}^{(t)} &= \text{softmax}\big( o^{(t)} \big), &(10.11)
\end{aligned}
$$

Where:

* $x$ are input sequences
* $o$ outputs of corresponding $x$
* $y$ training target for corresponding $x$
* $U$ weight matrix for input-to-hidden connections
* $W$ weight matrix for hidden-to-hidden connections
* $V$ weight matrix for hidden-to-output connections

First, **derivate of `softmax(x)`**, see this [post](http://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/), using the quotient rule:

$$ \big(\frac{f}{g}\big)' = \frac{f'g - fg'}{g^2} $$

Let $f = e^x$, $g = \sum_i e^{x_i}$, $y = \text{softmax}(x)$ and $y_i$ be the result for $x_i$, $x \in R^n$.

For $i = j$:

$$
\begin{aligned}
\frac{\partial y_i}{\partial x_i} &= \frac{e^{x_i} g - e^{x_i} (0 + \cdots + e^{x_i} + 0 + \cdots + 0)}{g^2} \\
&= \frac{e^{x_i} g - e^{x_i} e^{x_i}}{g^2} \\
&= \frac{e^{x_i} (g - e^{x_i})}{g^2} \\
&= \frac{e^{x_i}}{g} \times \frac{g - e^{x_i}}{g}\\
&= y_i \times (1 - \frac{e^{x_i}}{g}) \\
&= y_i \times (1 - y_i)
\end{aligned}
$$

For $i \neq j$:

$$
\begin{aligned}
\frac{\partial y_i}{\partial x_j} &= \frac{0 \times g - e^{x_i} e^{x_j}}{g^2} \\
&= -\frac{e^{x_i}}{g} \frac{e^{x_j}}{g} \\
&= - y_i y_j
\end{aligned}
$$

Back to BPTT, we start from the node immediately preceding the final loss:

$$ \frac{\partial L}{\partial L^{t}} = 1 $$

Then we have

$$(\triangledown_{o^{t}}L)_i = \frac{\partial L}{\partial o^{t}_i} = \frac{\partial L}{\partial L^{t}} \frac{\partial L^{t}}{\partial o^{t}_i} = \hat{y}^{(t)}_i - 1_{i,y^{(t)}} $$

# LSTM - Long Short-Term Memory

Chris Olah's [post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Components of LSTM cell:

## Forget gate 

Decides what info we are going to throw away from the cell state. 

* Inputs: $h_{t-1}$, $x_t$
* Output: a number between 0 and 1 for each number in the cell state $C_{t-1}$. 1 is keep everything, 0 is forget everything.

$$ f_t = \sigma\big( W_f \times [ h_{t-1}, x_t] + b_f\big) $$

## Input Gate, tanh layer

Next step is to decide what new info we are going to store in the cell state, done in two parts. 
* Input gate
* tanh layer

$$
\begin{aligned}
i_t &= \sigma\big( W_i \times [h_{t-1}, x_t] + b_i \big) \\
\tilde{C}_t &= \tanh \big( W_C \times [h_{t-1}, x_t] + b_C \big)
\end{aligned}
$$

## Cell State

Next step is to update the **old** cell state, $C_{t-1}$, into the **new** cell state $C_t$. Inputs are the outputs from **forget gate**, **input gate**, and **tanh layer**. 

$$ C_t = f_t \times C_{t-1} + i_t \times \tilde{C}_t $$

# Ouput 

Work out the output of the cell, using sigmoid to decide which parts of the cell state we are going to output. 

$$
\begin{aligned}
o_t &= \sigma \big( W_o \times [h_{t-1}, x_t] + b_o \big) \\
h_t &= o_t \times \tanh(C_t)
\end{aligned}
$$

# LSTM Variants

## Gers & Schmidhuber (2000)

 Add **peephole connections**, allowing the gates to look at the cell state. Modifications:
 
$$
\begin{aligned}
f_t &= \sigma \big( W_f \times [C_{t-1}, h_{t-1}, x_t] + b_f \big) \\
i_t &= \sigma \big( W_i \times [C_{t-1}, h_{t-1}, x_t] + b_i \big) \\
o_t &= \sigma \big( W_o \times [C_{t}, h_{t-1}, x_t] + b_o \big) \\
\end{aligned}
$$

## Coupled Forget and Input Gates

Make **forget** and **input** (cell update) gates make decision together.

$$ C_t = f_t \times C_{t-1} + (1-f_t) \times \tilde{C}_t $$

## GRU (Gated Recurrent Unit)

Combines **forget** and **input** gates into a single **update** gate, merges **cell state** and **hidden state**, plus othe changes. 

* Input: $h_{t-1}$, $x_t$
* Output: $h_t$

It is simpler than the standard LSTM, increasingly popular. 

$$
\begin{aligned}
z_t &= \sigma \big( W_z \times [h_{t-1}, x_t] \big) \\
r_t &= \sigma \big( W_r \times [h_{t-1}, x_t] \big) \\
\tilde{h}_t &= \tanh \big( W \times [r_t \times h_{t-1}, x_t] \big) \\
h_t &= (1 - z_t) \times h_{t-1} + z_t \times \tilde{h}_t
\end{aligned}
$$


Next big step is **attention**. This is as of 2015. See Olah's other [post](https://distill.pub/2016/augmented-rnns/)

More on RNN, Andrea Karpathy's [post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)