# RECURRENT NEURAL NETWORKS

A limitation of vanilla neural networks is that they are constrained: they accept a fixed-sized vector as input and produce a fixed-sized vector as output. Recurrent networks allow us to operate over sequences of vectors of arbitrary sizes.

![RNN](img/recurrent_nets.png "RNN")

At each timestep, there are two inputs to the hidden layer: the output of the previous layer $ h_{t-1} $, and the input at that timestep $ x_{t} $. The former is multiplied by a weight matrix $ W_{h} $, and the latter by a weight matrix $ W_{x} $. The result is then run through a non-linearity function to produce the output at the current timestep $ h_{t} $.

$$ h^{(t)} = tanh(W_{h}h^{(t-1)} + W_{x}x^{(t)} + b) $$

It is important to note that the same weights $ W_{x} $ and $ W_{h} $ are applied repeatedly at each timestep. Thus, the number of parameters the model has to learn is less, and most importantly, is independent of the length of the input sequence.  
At each timestep we could optionally produce an output using the weight matrix $ W_{o} $.

The initial hidden state $ h^{(0)} $ can be learned as a parameter of the network or it can be assumed to be 0. In some problems the initial hidden state is given as an input to the network.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import time
import sys
sys.path.append("../")

from src.recurrent_net import RecurrentNetwork
from src.solver import UnsupervisedSolver
from src.utils.gradient_check import eval_numerical_gradient, rel_error

# for auto-reloading external modules
%load_ext autoreload
%autoreload 2

# plot configuration
%matplotlib inline
plt.rcParams["figure.figsize"] = (15.0, 12.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

When training the network we must compute the output distribution $ \hat{y}^{(t)} $ for every timestep $ t $. After that we compute the loss at every timestep as the cross-entropy between the predicted probability distribution $ \hat{y^{(t)}} $ and the true distribution. Finally, we average this over the entire training batch to get the overall loss.

![RNN Training](img/recurrent_net_training.png "RNN Training")

One problem with recurrent neural networks is the vanishing gradient. During backpropagation, the contribution of gradient values gradually vanishes as they propagate to earlier timesteps. Thus, for deep networks, long-term effects are not accounted for. Due to vanishing gradients, we don't know whether there is no dependency between steps $ t $ and $ t+n $ in the data, or we just cannot capture the true dependency due to this issue.  
It is too difficult for the RNN to learn to perserve information over many timesteps.

![RNN Gradient](img/recurrent_net_gradient.png "RNN Gradient")

In [2]:
# Initialize toy example to check the implementation.
np.random.seed(0)

batch_size = 2
timesteps = 3
hidden_dim = 6
input_dim = 20
output_dim = 20

X = np.random.randn(batch_size, timesteps, input_dim)
y = np.random.randint(low=0, high=10, size=(batch_size, timesteps))

In [3]:
# Check the backward pass for a vanilla RNN cell.
rnn_model = RecurrentNetwork(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim,
                             n_layers=1, cell_type="rnn", dtype=np.float64)

loss, grads = rnn_model.loss(X, y)
f = lambda _ : rnn_model.loss(X, y)[0]

for param_name in sorted(grads):
    param_grad_num = eval_numerical_gradient(f, rnn_model.params[param_name], verbose=False, h=1e-6)
    print('%s relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))

W_out relative error: 4.655994e-04
Wh_0 relative error: 1.131941e-01
Wx_0 relative error: 8.751668e-05
b_0 relative error: 4.989417e-06
b_out relative error: 3.066667e-09


## Long Short-Term Memory

In 1997 Sepp Hochreiter and Jurgen Schmidhuber proposed a solution to the vanishing gradient problem in their paper:  
<i>[1] "Long Short-Term Memory"</i>.

In standard RNNs at every timestep a simple non-linearity is applied to the output of the previous layer $ h^{(t-1)} $ and the input at that timestep $ x^{(t)} $.

In LSTMs at every timestep there is a hidden state $ h^{(t)} $ and a <b>cell state</b> $ c^{(t)} $. The idea of the cell state is to store long-term information. During the forward pass the LSTM can <i>erase</i>, <i>write</i> and <i>read</i> information from the cell state. The selection of which information is erased/written/read is controlled by three gates based on the previus inputs. Gates are a way to optionally let information through. They are composed of a sigmoid function that outputs numbers between zero and one, describing how much of each component should be let through.

![RNN LSTM](img/recurrent_net_lstm.png "RNN LSTM")

#### Forget Gate

The desicion what information to throw away from the cell state is made by a sigmoid layer called the forget gate. At timestep $ t $ it "looks" at $ h^{(t-1)} $ and $ x^{(t)} $, and outputs a number between zero and one for each number in the cell state $ c^{(t-1)} $.

$$ f_{t} = \sigma \space (W_{fh}h^{(t-1)} + W_{fx}x^{(t)} + b_{f}) $$

#### Input Gate

The next step is to decide what new information to store in the cell state. This operation has two parts:
 * a <i>tanh</i> layer creates a vector of new candidate values
 * a sigmoid layer called the input gate decides which values we will store

$$ i_{t} = \sigma \space (W_{ih}h^{(t-1)} + W_{ix}x^{(t)} + b_{i}) $$
$$ g_{t} = \tanh \space (W_{gh}h^{(t-1)} + W_{gx}x^{(t)} + b_{f}) $$

The new cell state is calculated by multiplying the old state by $ f_{t} $ and adding the candidate values multiplied by $ i_{t} $:

$$ c^{(t)} = f_{t} * c^{(t-1)} + i_{t} * g_{t} $$

#### Output Gate

Finally, the output gate decides what information we are going to output. The output is a filtered version of the current cell state and the output gate is again a sigmoid layer:

$$ o_{t} = \tanh \space (W_{oh}h^{(t-1)} + W_{ox}x^{(t)} + b_{o}) $$
$$ h^{t} = o_{t} * tanh(c^{t}) $$

![RNN LSTM-Detailed](img/recurrent_net_lstm_detailed.png "RNN LSTM Detailed")

In [4]:
# Check the backward pass for an LSTM cell.
lstm_model = RecurrentNetwork(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim,
                              n_layers=1, cell_type="lstm", dtype=np.float64)

loss, grads = lstm_model.loss(X, y)
f = lambda _ : lstm_model.loss(X, y)[0]

for param_name in sorted(grads):
    param_grad_num = eval_numerical_gradient(f, lstm_model.params[param_name], verbose=False, h=1e-6)
    print('%s relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))

W_out relative error: 4.572344e-05
Wh_0 relative error: 1.601574e-01
Wx_0 relative error: 1.390295e-01
b_0 relative error: 1.608477e-01
b_out relative error: 3.539295e-09


## Multilayer RNNs

One way to create more complex RNN models is to stack the networks in layers.  
The first layer of the recurrent network reads the input and produces a sequence of hidden states. Now we could use this sequence of hidden states as an input to another recurrent network. We could stack layers in this manner to produce a multilayer RNN.

![RNN Multilayer](img/recurrent_net_multilayer.png "RNN Multilayer")

In [5]:
# Check the backward pass for a multilayer RNN.
multilayer_lstm_model = RecurrentNetwork(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim,
                                         n_layers=3, cell_type="lstm", dtype=np.float64)

loss, grads = multilayer_lstm_model.loss(X, y)
f = lambda _ : multilayer_lstm_model.loss(X, y)[0]

for param_name in sorted(grads):
    param_grad_num = eval_numerical_gradient(f, multilayer_lstm_model.params[param_name], verbose=False, h=1e-6)
    print('%s relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))

W_out relative error: 3.638050e-03
Wh_0 relative error: 1.127214e-07
Wh_1 relative error: 1.848132e-07
Wh_2 relative error: 3.351967e-07
Wx_0 relative error: 3.646608e-03
Wx_1 relative error: 1.249184e-03
Wx_2 relative error: 2.542450e-03
b_0 relative error: 2.923031e-03
b_1 relative error: 1.407397e-02
b_2 relative error: 9.665872e-06
b_out relative error: 7.677437e-10


## Character Level Language Model

We will train a character level language model using an LSTM recurrent network.  
To train the model we wiil use a simple text file. We will encode every character into a one-hot vector encoding, and feed the network one character at a time. The size of our vocabulary will be the number of distinct characters in the dataset. At each time-step the output of the network will be a vector of dimension `vocab_size`, which we will interpret as the score the RNN assigns to each character comming next in the sequence.

![RNN Chars](img/recurrent_net_chars.png "RNN Chars")

During training, for subsequent inputs we will use the actual ground truth next character in the sequence and we stop generating characters once we generate the target length. Once we have our predicted target sequence, we compare it against our actual target sequence to calculate the loss.  
During inference, every predicted character by the model we will feed as the next input and we keep generating characters until a certain amount of characters have been generated.

In [6]:
# Load the data.
filename = "../datasets/text/the_cask_of_amontillado.txt"
with open(filename, "r") as file:
    text = file.read()

print("Length of the text: %d symbols" % len(text))
print("Example text:\n\n", text[:336])

Length of the text: 13017 symbols
Example text:

 THE CASK OF AMONTILLADO by EDGAR ALLAN POE

THE thousand injuries of Fortunato I had
borne as I best could, but when he ven-
tured upon insult, I vowed revenge. You,
who so well know the nature of my soul, will not
suppose, however, that I gave utterance to a threat.
AT LENGTH I would be avenged; this was a point de-
finitively settle


In [7]:
# Preprocess the data to create char-to-int and int-to-char mappings.
chars = list(set(text))
vocab_size = len(chars)

char_to_idx = {ch: i for i,ch in enumerate(chars)}
idx_to_char = {i: ch for i,ch in enumerate(chars)}

data = [char_to_idx[ch] for ch in text]

print("Number of distinct characters: %d" % vocab_size)
print("Vocab:\n", char_to_idx)

print("Example text after preprocessing:\n", data[:30])

Number of distinct characters: 59
Vocab:
 {'e': 0, 'B': 1, ' ': 2, 'C': 3, 'M': 4, 'u': 5, '\n': 6, ',': 7, 'F': 8, 'V': 9, 'I': 10, 'N': 11, 'n': 12, 'D': 13, 'c': 14, 'q': 15, 'y': 16, "'": 17, 'k': 18, 'z': 19, 'v': 20, '?': 21, 'p': 22, 'w': 23, '.': 24, 'K': 25, 'g': 26, 'l': 27, 'Y': 28, 'H': 29, 's': 30, '!': 31, 'W': 32, 'U': 33, 'E': 34, 'i': 35, ':': 36, 'A': 37, 'G': 38, '"': 39, 'a': 40, ';': 41, 'j': 42, 'r': 43, 't': 44, 'R': 45, 'P': 46, 'S': 47, 'd': 48, 'b': 49, '-': 50, 'T': 51, 'o': 52, 'f': 53, 'O': 54, 'x': 55, 'L': 56, 'm': 57, 'h': 58}
Example text after preprocessing:
 [51, 29, 34, 2, 3, 37, 47, 25, 2, 54, 8, 2, 37, 4, 54, 11, 51, 10, 56, 56, 37, 13, 54, 2, 49, 16, 2, 34, 13, 38]


In [8]:
class text_dataset(object):
    def __init__(self, data, seq_length):
        """
        Inputs:
        - data: List of integers representing the preprocessed data.
          Every integer corresponds to a character.
        - seq_length: Integer giving the length of the sequence to be input
          to the recurrent neural network.
        """
        self.data = data
        self.seq_length = seq_length

    def train_batch(self, batch_size):
        """
        Generate the next batch of examples from the data.

        Returns:
        - batch: A numpy array of integers of shape (batch_size, seq_length) giving
          a batch of training examples.
        """
        seq_length = self.seq_length
        batch = np.ndarray((batch_size, seq_length), dtype=np.int)

        for idx in range(batch_size):
            cursor = np.random.randint(len(self.data)-seq_length)
            batch[idx] = self.data[cursor : cursor + seq_length]

        return batch

    def num_train(self):
        """
        Returns:
        - num_train: Integer, giving the number of training examples in the dataset.
        """
        return len(self.data)

In [9]:
class CharLevelRNN(object):
    def __init__(self, vocab_size, hidden_dim, char_to_idx,
                 n_layers=1, reg=0.0, dtype=np.float32):
        """
        Inputs:
        - vocab_size: Integer giving the size of the vocabulary.
        - hidden_dim: Integer giving the hidden size.
        - n_layer: Integer giving the number of recurrent layers to use.
        - dtype: Numpy datatype to use for computation.
        """
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim

        self.char_to_idx = char_to_idx
        self.idx_to_char = {idx: ch for ch, idx in char_to_idx.items()}

        self.RNN = RecurrentNetwork(input_dim=vocab_size,
                                    hidden_dim=hidden_dim,
                                    output_dim=vocab_size,
                                    n_layers=n_layers,
                                    weight_scale=1,
                                    reg=reg,
                                    cell_type="lstm",
                                    dtype=dtype)

        self.params = self.RNN.params

    def one_hot_encoding(self, x, V):
        """
        Inputs:
        - x: A numpy array of shape (N, T) giving a batch of integer indices.
        
        Returns:
        - embed: A numpy array of shape (N, T vocab_size) giving one-hot encodings
          of each element.
        """
        x = x.astype(np.int)

        if len(x.shape) == 1: # if T = 1
            N = x.shape[0]
            embed = np.zeros((N, V), dtype=np.int)
            embed[np.arange(N), x] = 1
        else:
            N, T = x.shape
            embed = np.zeros((N, T, V), dtype=np.int)
            embed[np.array([np.arange(N)]*(T)).transpose(1, 0), [np.arange(T)]*N, x] = 1

        return embed

    def loss(self, X):
        """
        Inputs:
        - X: A numpy array of shape (N, T) of integers giving the input sequence
          to the RNN. Each element is in the range 0 <= X[i, j] < V.
  
        Returns:
        - loss: A scalar value giving the loss.
        - grads: Dictionary with the same keys as self.params, mapping parameter
          names to gradients of the loss with respect to those parameters.
        """
        embed_x = self.one_hot_encoding(X, self.vocab_size)
        embed_in = embed_x[:, : -1]
        y = X[:, 1 :]
        loss, grads = self.RNN.loss(embed_in, y)
        
        return loss, grads

    def sample(self, start=None, max_length=100):
        """
        Inputs:
        - start: A numpy array of shape (N, ) of integers giving input data.
          If None, the sequence starts with " ".
        - max_length: Maximum length T of generated outputs.

        Returns:
        - sequence: Array of shape (N, max_length) giving sampled outputs.
        """
        if start is None:
            start = np.array([np.random.randint(self.vocab_size)])

        N = start.shape[0]
        start_embed = self.one_hot_encoding(start, self.vocab_size)
        output = self.RNN.sample(start_embed, self.one_hot_encoding, max_length=max_length)
        data = np.argmax(output, axis=2)
        sequence = [[idx_to_char[idx] for idx in seq] for seq in data]
        text = [''.join(seq) for seq in sequence]

        return text

In [10]:
seq_length = 25
batch_size = 64

# Initialize the dataset.
dataset = text_dataset(data, seq_length=seq_length)

# Initialize the model.
np.random.seed(seed=None)
hidden_dim = 256
n_layers = 2
reg = 1e-2

char_level_model = CharLevelRNN(vocab_size=vocab_size,
                                hidden_dim=hidden_dim,
                                char_to_idx=char_to_idx,
                                n_layers=n_layers,
                                reg=reg,
                                dtype=np.float32)

In [11]:
# Train the model.
rnn_solver = UnsupervisedSolver(char_level_model, dataset,
                                update_rule="adam",
                                optim_config={"learning_rate":1e-3},
                                lr_decay=0.99,
                                batch_size=batch_size,
                                clip_norm = 50.0,
                                num_epochs=5,
                                print_every=500,
                                verbose=True)

tic = time.time()
rnn_solver.train()
toc = time.time()
print("training took %.3f minutes" % ((toc - tic) / 60))

Number of iterations per epoch: 203
(Iteration 1 / 1015) loss: 98.05993
(Iteration 1 / 1015); Epoch(1 / 5); loss: 98.05993
Sample:
 vmee
enee
ee
eeed
oooolllloooolllloooollll!!ggtrr;;uttooll                                          
(Iteration 204 / 1015); Epoch(2 / 5); loss: 66.12531
Sample:
 
 he the the the the the the the the the the the the the the the the the the the the the the the the
(Iteration 407 / 1015); Epoch(3 / 5); loss: 57.76694
Sample:
 e the the the the the the the the the the the the the the the the the the the the the the the the th
(Iteration 501 / 1015) loss: 55.58795
(Iteration 610 / 1015); Epoch(4 / 5); loss: 52.30736
Sample:
 Ce the wall the wall the wall the wall the wall the wall the wall the wall the wall the wall the wal
(Iteration 813 / 1015); Epoch(5 / 5); loss: 47.82888
Sample:
 Le and the bece the bested the bested the bested the bested the bested the bested the bested the bes
(Iteration 1001 / 1015) loss: 46.16495
(Iteration 1015 / 1015); Epoch(5 / 5);