Understanding Recurrent Neural Networks

Sources:

Andrej Karpathy blog: The Unreasonable Effectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Stanford cs231n (spring 2017) lecture 10: Recurrent Neural Networks https://www.youtube.com/watch?v=6niqTuYFZLQ



### Context:

Neural networks like CNNs typically require some fixed-size input and produce a fixed-size output. 

RNNs can operate on every item of a sequence; so the length of that sequence can very in size. Not just the input, the output can also vary in size.

This has several advantages: you can have various combo of input size vs output size.

You can also iterate over fixed-sized inputs on an RNN


![img](./img/rnn-in-out-size.jpg)

### How does a RNN work?

Like a static variable in a class that gets updated every time some method is called, a hidden state in a RNN cell retains some of the things it has seen, and is updated by new inputs.

The __Recurrent__ part of RNN:

For example: for every word in a sentence, run the word through the RNN function. The input word can be a one-hot encoded vector:

$$h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t )$$

At the first step ($t = 1$), the function takes the first word $x_1$, and a hidden state $h$ as inputs. Since the hidden state is initialized to a vector of zeroes, the first hidden state is essentially the tanh of the first word:

$$h_1 = \tanh ( W_{xh} x_1 )$$


The second word $x_2$ together with previous step's hidden state $h_1$ are fed into the same function to produce a new hidden state $h_2$:

$$h_2 = \tanh ( W_{hh} h_1 + W_{xh} x_2 )$$

This process is repeated until the end of the sentence.

Both $x_t$ and $h_t$ have their own set of weights $W_{xh}$ and $W_{hh}$ (as a fully connected layer) that remain unchanged during the forward pass.

If you want to produce an output $y_t$ at each step, you can introduce another set of weights $W_{hy}$ for the calculation:

$$y_t = W_{hy} h_t $$

The last hidden state $y_T$ at the end of the sentence can be considered to be a summary of the input sentence.

As a program, it looks like this:

In [None]:
class RNN:
    '''A single recurrent cell'''
    def step(self, x):
        '''Update the hidden state'''
        self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
        # Optional: compute the output vector
        y = np.dot(self.W_hy, self.h)
        return y

rnn = RNN()

Reminder: `np.dot(a, b)` is the dot-product of 2 matrices (matrix or vector multiplication with a summation).

Notice that there are __three sets of weights__! One (`W_hh`) for the hidden state; one (`W_xh`) for the input; and one (`W_hy`) for the output.

The hidden state `self.h` is initialized with a zero vector.

`np.tanh` function implements a non-linearity that squashes the activations to the range `[-1, 1]`.

For each step in a sequence, you'd run the line below to update the hidden state:

In [None]:
rnn.step(x) # x is an input vector