# Recurrent Neural Networks and Language Models

## Traditional Models

Language model typically computes probability for sequences of words. For example, given a sequence, predict the next word of the input sequence and continue until a desired length is reached.

### Markov Assumption

Probability is usually conditioned on a window of `n` previous words. People tend to use Markov assumption which states that
> The future is independent of the past given the present

$$
P(w_{1}, ... w_{m}) = \prod P(w_{i} \mid w_{i + 1 - n}, ..., w_{i-1})
$$

And to compute probabilities, we compute unigrams and bigrams (conditioning on one or two previous word(s))

$$
P(w_{2} \mid w_{1}) = \frac{count(w_{1}, w_{2})}{count(w_{1})}
$$

And

$$
P(w_{3} \mid w_{1}, w_{2}) = \frac{count(w_{1}, w_{2}, w_{3})}{count(w_{1}, w_{2})}
$$

*n-gram is a contiguous sequence of n items from a given sample of text or speech*

### Counting

As one can already imagine that performance will increase as the `n` in n-gram increases. However, it is not enough to keep just one n-gram because some words usually occur in the beginning of a sentence so a robust model needs to try 5-gram, 4-gram and etc... This is so-called **backoff**, but it also means that it will take up a lot of RAM.

## RNN
Recurrent neural network addresses the RAM problem from the traditional language model. Its RAM requirement only scales with number of words.

![rnn](./assets/08_rnn.png)

Given a list of word vectors:

$$
x_{1}, ..., x_{t-1}, x_{t}, x_{t+1}, ..., x_{T}
$$

At every single time step `t`, we compute the a hidden vector.

$$
h_{t} = \sigma\left(W^{hh}h_{t-1} + W^{hx}x_{t}\right)
$$

And an output vector.

$$
\hat{y}_{t} = \text{softmax}\left(W^{S}h_{t}\right)
$$

### Feed Forward

In [27]:
import numpy as np

word_vec_dim = 10
hidden_dim = 10
output_dim = 5

def get_word_vecs(corpus):
    word_vecs = dict()
    for word in corpus.split():
        if word_vecs.get(word) is None:
            word_vecs[word] = np.random.rand(word_vec_dim)
    
    return word_vecs


def sigmoid(x):
    return 1 / (1 - np.exp(x))


def softmax(x):
    shifted_logits = x - np.max(x, axis=0, keepdims=True)
    Z = np.sum(np.exp(shifted_logits), axis=0, keepdims=True)
    return np.exp(shifted_logits) / Z


def rnn_timestep(Whh, prev_h, Whx, x, Ws):
    h = sigmoid(Whh.dot(prev_h) + Whx.dot(x))
    y = softmax(Ws.dot(h))
    
    return h, y


corpus = "hello world this is natural language processing"
T = len(corpus.split())
word_vecs = get_word_vecs(corpus)
h0 = np.zeros(hidden_dim) # Use zero vector for initial hidden
Whh = np.random.rand(hidden_dim, hidden_dim)
Whx = np.random.rand(hidden_dim, word_vec_dim)
Ws = np.random.rand(output_dim, hidden_dim)

hs = (T + 1) * [h0] # Hidden vectors for all timesteps
ys = (T + 1) * [None] # Output vectors for all timesteps
for i, word in enumerate(corpus.split()):
    t = i + 1
    x = word_vecs.get(word)
    hs[t], ys[t] = rnn_timestep(Whh, hs[t-1], Whx, x, Ws)

### Training

*I am not going to do the backprop here because I have a blog post about this on my machine learning notebook already.* We can use the same cross entropy loss as before to compute loss at each timestamp. Here I use `V` to represent the size of vocabulary.

$$
J_{t}(\theta) = - \Sigma_{j = 0}^{V} y_{t}\text{[j]} \;\text{log}\; \hat{y}_{t}\text{[j]}
$$

Thus, the total loss is just averaging all the timestep losses.

$$
J = -\frac{1}{T} \Sigma^{T}_{t=1} J_{t}
$$

More commonly use is the perplexity score which is

$$
\text{Perplexity:} \; 2^{J}
$$

Training RNN is actually hard because of vanishing gradient. This can be addressed using LSTM in next lecture.

### Truncated Backpropagation

Remember that we only have three weight matrices, `Whh`, `Whx`, and `Ws`. It is rather inefficient to perform an update on these matrices at every timestep. The better approach is that we parse the corpus into different sequences. At every sequence, we reset the hidden vector back to zero, we move along the timesteps and accumulate all the gradients at each timestep. For example,

```python
hs = (seq_len + 1) * [None]
hs[0] = # set zero vector
grad_Whh, grad_Whx, grad_Ws = # initialize as zero matrices
for i in range(sequence_length):
    t = i + 1
    grads = model.loss(x[t], y[t], h[t-1])
    grad_Whh += grads['Whh']
    grad_Whx += grads['Whx']
    grad_Ws += grads['Ws']
```

And then we perform an update on `Whh`, `Whx`, and `Ws` in one go and move onto the next sequence. This is known as **truncated** backpropagation. It is very much like stochastic gradient descent.

## Bidirectional Deep RNN
For classification problem, sometimes you actually want to incorporate information from words both proceding and following. The vanilla RNN will only read words from left to right. The current prediction relies on the information from the past. What if the classification of a word relies on the words that come before and after it? We can use bidirectional RNN.

![bidirectional_rnn](./assets/08_bidirectional_rnn.png)

Instead of one hidden vector for prediction, we use two! The new hidden vector now summarizes the past and future around a single token. If that's not enough, try to go deep and make multiple layers with multiple hidden vectors at each timestep.

![deep_bidirectional_rnn](./assets/08_deep_bidirectional_rnn.png)