Understanding Recurrent Neural Networks

Sources:

Andrej Karpathy blog: The Unreasonable Effectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Stanford cs231n (spring 2017) lecture 10: Recurrent Neural Networks https://www.youtube.com/watch?v=6niqTuYFZLQ



### Context:

Neural networks like CNNs typically require some fixed-size input and produce a fixed-size output (see one-to-one below).

RNNs can operate on every item of a sequence; so the length of that sequence can very in size. Not just the input, the output can also vary in size.

- One to many: eg, given an input image, produce a sequence of output words (image captioning)
- Many to one: eg, given an input sequence of words, produce a single classification of output (sentiment classification)
- Many to many: eg, given an input sequence of english words, produce a sequence of french words (machine translation); or given an input sequence of video frames, produce a classification output (whether there's a cat in the frame) for each input frame.

You can also iterate over fixed-sized inputs on an RNN.

![img](./img/rnn-in-out-size.jpg)

### How does a RNN work?

Like a static variable in a class that gets updated every time some method is called, the hidden state in a RNN is updated by a new input. The updated hidden state is fed back into the model the next time it reads a new input.

![img](./img/rnn-recurrence-formula.png)

### Calculating new state and _recurrence_:

For example: for every word in a sentence, run the word through the RNN function (the input word can be a one-hot encoded vector):

$$h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t )$$

At the first step ($t = 1$), the function takes the first word $x_1$, and a hidden state $h$ as inputs. Since the hidden state is initialized to a vector of zeroes, the first hidden state is essentially the tanh of the first word (tanh is applied element-wise; it squashes each value to betwenn -1 and 1):

$$h_1 = \tanh ( W_{xh} x_1 )$$

![img](./img/rnn-step1.png)

The second word $x_2$ together with previous step's hidden state $h_1$ are fed into the same function to produce a new hidden state $h_2$:

$$h_2 = \tanh ( W_{hh} h_1 + W_{xh} x_2 )$$

![img](./img/rnn-step2.png)

This process is repeated until the end of the sentence.

![img](./img/rnn-step3.png)

Both $x_t$ and $h_t$ have their own set of weights $W_{xh}$ and $W_{hh}$ (as a fully connected layer) that remain unchanged during the forward pass.

![img](./img/rnn-step3-weights.png)

If you want to produce an output $y_t$ at each step, you can introduce another set of weights $W_{hy}$ for the calculation:

$$y_t = W_{hy} \times h_t $$

The last hidden state $h_T$ at the end of the sentence can be considered to be a summary of the input sentence.

In code, it looks like this:

In [None]:
class RNN:
    '''A single recurrent cell'''
    def step(self, x):
        '''Update the hidden state'''
        self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
        # Optional: compute the output vector
        y = np.dot(self.W_hy, self.h)
        return y

rnn = RNN()

Reminder: `np.dot(a, b)` is the [dot-product](https://en.wikipedia.org/wiki/Dot_product#Algebraic_definition) of 2 vectors (__not__ element-wise!).

`np.tanh` is applied element-wise.

Notice that there are __three sets of weights__! $W_{hh}$ (`self.W_hh`) for the hidden state; $W_{xh}$ (`self.W_xh`) for the input; and $W_{hy}$ (`self.W_hy`) for the output.

The hidden state `self.h` is initialized to a vector of zeros.

`np.tanh` function implements a non-linearity that squashes the activations to the range `[-1, 1]`.

For each step in a sequence, you'd run the line below to update the hidden state:

In [None]:
rnn.step(x) # x is an input vector

Of course, you can daisy-chain (stack) RNN models so that the output of one cell becomes the input of a downstream cell:

In [None]:
y1 = rnn.step(x)
y2 = rnn2.step(y1)