# **EECS 498 Deep Learning for Computer Vision (2019)**

These are my notes from Justin Johnson's course in 2019 which is heavily based on Stanford's CS 231N.

### **Lecture 12: Recurrent Neural Networks**

Recurrent neural networks are used to process sequences of inputs or outputs (video classification, machine translation, per-frame video classification). At every time step the RNN accepts an input that will update its hidden state $$h_t = f_W(h_{t-1}, x_t)$$ The same function and parameters are shared across each time step. 

#### **Vanilla RNNs**

A vanilla RNN has $h_t = \tanh(W_{hh} h_{t-1} + W_{xh}x_t)$ and $y_t = W_{hy} h_t$.

In practice Truncated Backpropogation Through Time is used where hidden states are carried through forever but backpropogation only traverses a portion of the sequence at a time.

When using RNNs for Image Captioning, the image is passed through a CNN and the output features $v$ are passed to an RNN. The update step now becomes $$h = \tanh(W_{xh} x + W_{hh} h + W_{ih}v) $$  The first input to the RNN is a $\verb|START|$ token.

RNNs suffer from exploding and vanishing gradients since in the backpropogation step. A solution is to use *gradient clipping* which will scale the gradient if its norm is too big.

#### **Long Short Term Memory (LSTM)**

LSTMs keep two hidden vectors at any time step and has update step $$\begin{align*} \begin{pmatrix} i \\ f \\ o \\ g\end{pmatrix} &= \begin{pmatrix} \sigma \\\sigma \\\sigma \\ \tanh \end{pmatrix} W \begin{pmatrix} h_{t-1} \\ x_t \end{pmatrix} \\
c_t &= f*c_{t-1} + i * g \\ h_{t} &= o * \tanh(c_t) \end{align*}$$ 


Here $g$ is the previous update step $\tanh(W_{xh} x + W_{hh}h)$, $i$ is the *input gate* that regulates what fraction of this to propogate, $f$ is the *forget gate* that regulates how much of the previous cell_state to forget, and $o$ is the *output gate* that regulates how much of the cell state should be shared with the output.

### **Lecture 13: Attention**

When working with sequence to sequence models we would like to not require our encoder network to encode a long sequence into a single vector. The attention mechanism allows the decoder to decide which parts of the input vector to look at to form the context vector at that time step.
 

We will compute alignment scores $e_{t,i} = f_{att}(s_{t-1}, h_i)$, where $s_{t}$ is the hidden state of the decoder at time step $t$ and $h_i$ is the hidden state of the encoder at time step $i$. The scores are passed through a softmax layer to obtain attention scores $a_{t,i}$ which satisfy $\sum_{i} a_{t,i} = 1$. The context vector at this step is $c_t = \sum_{i} a_{t,i} h_{i}$. This context vector is used in the decoder $s_t = g_U(y_{t-1}, h_{t-1}, c_t)$.

We can use the same attention mechanism to attend to other types of data.

##### **Image Captioning with RNNs and Attention**



Here we will pass the image through a CNN which will output a feature vector at each pixel $h_{(i,j)}$. Then we can use the attention mechanism to attend to these feature vectors $e_{t, (i,j)} = f_{att}(s_{t-1}, h_{(i,j)})$. Then we again pass through a softmax operation to get attention scores $a_{t, (i,j)}$ and obtain the context vector $c_{t} = \sum_{i,j} a_{t, (i,j)} h_{(i,j)}$. The new state is then computing by using this context vector and the input word at that step.