# Recurrent Neural Networks

A particular challenge with sequential data and modelling tasks is that the sequence lengths can vary from one dataset example to the next. This makes the use of a fixed input size architecture such as the MLP unsuitable. In addition, there can be many different types of sequential modelling tasks that we want to consider, each of which could have different architectural requirements, not just one-to-one relation like MLP. For example:
- Text sentiment analysis (many-to-one)
- Image captioning (one-to-many)
- Language translation (many-to-many)
- Part-of-speech tagging (many-to-many)

The recurrent neural network (RNN) is designed to handle this variability of lengths in sequence data and diversity of problem tasks.

### Basic RNN Computation

Let $\{\boldsymbol x_t\}_{t=1}^T$ be an example sequence input, with each $\boldsymbol x_t\in\mathbb R^D$. Suppose that we are in the many-to-many setting, and there is a corresponding sequence of labels $\{y_t\}_{t=1}^T$, with $y_t\in Y$, where $Y$ could be $\{0,1\}$ for a binary classification task for example.

The basic RNN computation is given as follows:

$$
\begin{gather}
\boldsymbol h_t^{(1)}=\sigma\left( \mathbf W_{hh}^{(1)}\boldsymbol h_{t-1}^{(1)} + \mathbf W_{xh}^{(1)}\boldsymbol x_t + \boldsymbol b_h^{(1)} \right)
\\
\hat {\boldsymbol y}_t = \sigma_{out}\left( \mathbf W_{hy}\boldsymbol h_t^{(1)} +\boldsymbol b_y \right)
\end{gather}
$$

for $t=1,\cdots,T$, where $\boldsymbol h_t^{(1)}\in\mathbb R^{n_1}$, $\mathbf W_{hh}^{(1)}\in\mathbb R^{n_1\times n_1}$, $\mathbf W_{xh}^{(1)}\in\mathbb R^{n_1\times D}$, $\boldsymbol b_h^{(1)}\in\mathbb R^{n_1}$, $\hat{\boldsymbol y}_t\in\mathbb R^{n_y}$, $\mathbf W_{hy}\in\mathbb R^{n_y\times n_1}$, $\boldsymbol b_y\in\mathbb R^{n_y}$, $\sigma$ and $\sigma_{out}$ are activation functions, $n_1$ is the number of units in the hidden layer, and $n_y$ is the dimension of the output space $Y$. Note that the computation requires an initial hidden state $\boldsymbol h_0^{(1)}$ to be defined, although in practice this is often just set to the zero vector.

![rnn-structure](../../figures/rnn-structure.png)

Recurrent neural networks make use of weight sharing, similar to convolutional neural networks, but this time the weights are shared across time. This allows the RNN to be 'unrolled' for as many time steps as there are in the data input $\boldsymbol x$.

The RNN also has a persistent state, in the form of the hidden layer $\mathbf h^{(1)}$. This hidden state can carry information over an arbitrary number of time steps, and so predictions at a given time step $t$ can depend on events that occurred at any point in the past, at least in principle. As with MLPs, the hidden state stores distributed representations of information, which allows them to store a lot of information, in contrast to hidden Markov models.

### Stacked RNNs

RNNs can also be made more powerful by stacking recurrent layers on top of each other:

$$
\begin{gather}
\boldsymbol h_t^{(k)} = \sigma\left( \mathbf W_{hh}^{(k)}\boldsymbol h_{t-1}^{(k)} + \mathbf W_{xh}^{(k)} \boldsymbol h_t^{(k-1)} + \boldsymbol b_h^{(k)} \right)
\\
\hat y_t = \sigma_{out}\left( \mathbf W_{hy} \boldsymbol h^{(L)} + \boldsymbol b_y \right)
\end{gather}
$$

for $k=1,\cdots,L$.

![stacked-rnn-structure](../../figures/stacked_rnn_structure.png)

### Bidirectional RNNs

Standard recurrent neural networks are uni-directional. That is, they only take past context into account. In some applications, where the full input sequence is available to make predictions, it is possible and desirable for the network to take both past and future context into account.

For example, consider a part-of-speech (POS) tagging problem, where the task is to label each word in a sentence according to its particular part of speech, e.g. noun, adjective, verb etc. In some cases, the correct label can be ambiguous given only the past context, for example the word "light" in the sentence "There's a light..." could be a adjective or a noun depending on how the sentence continues. 

Bidirectional RNNs are designed to look at both future and past context. They consist of two RNNs running forward and backwards in time, whose states are combined in sum way (e.g. adding or concatenating) to produce the final hidden state of the layer.

![bidirectional-rnn-structure](../../figures/bidirectional_rnn-structure.png)

### Training RNNs

RNNs are trained in the same way as multilayer perceptrons and convolutional neural networks. A loss function $L(y_1,\cdots, y_T, \hat y_1,\cdots, \hat y_T) is defined according to the problem task and learning principle, and the network is trained using the backpropagation algorithm and a selected network optimiser. In the many-to-one case (e.g. sentiment analysis), the loss function may be defined as $L(y_T,\hat y_T)$.

Recall the equation describing the backpropagation of errors in the MLP case:

$$
\begin{align}
\delta^{(k)}=\sigma'(\boldsymbol a^{(k)})(\mathbf W^{(k)})^T\delta^{(k+1)}
\end{align}
$$

for $k=1,\cdots,L$ where $k$ indexes the hidden layers. In the case of recurrent neural networks, the errors primarily backpropagate along the time direction, and we obtain the following propagation of errors in the hidden states:

$$
\begin{align}
\delta_{t-1}^{(k)}=\sigma'(\boldsymbol a_{t-1}^{(k)})(\mathbf W_{hh}^{(k)})^T\delta_t^{(k)}
\end{align}
$$

for $t=T,\cdots,1$. For this reason, the backpropagation algorithm for RNNs is referred to as backpropagation through time (BPTT).

Recurrent neural networks can also be trained as generaive models for unlabelled sequence data, by re-writing the network to send the output back as the input to the next step, which is an example of self-supervised learning, which is where we use an unlabelled dataset to frame a supervised learning problem. This can be used to train language models, or generative music models for example. In practical we treat this case the same as a supervised learning problem, where the outputs are the same as the inputs but shifted by one time step. This particular technique is also sometimes referred to as teacher forcing.

In [53]:
import torch
import torch.nn as nn
import torchtext
import numpy as np
import os
import json
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from pathlib import Path
import string
from collections import Counter, OrderedDict
from torch.utils.data import DataLoader
import torch.nn.functional as F

device = "mps" if torch.backends.mps.is_available() else "cpu"


In [27]:
# Load the text file into a string

with open(Path('../../datasets/Shakespeare.txt'), 'r', encoding='utf-8') as file:
    text = file.read()

In [28]:
# Create a lit of chunks of text

text_chunks = text.split('.')

In [37]:
# Display some randomly selected text samples

num_samples = 3
indices = np.random.choice(len(text_chunks), num_samples, replace=False)
for chunk in np.array(text_chunks)[indices]:
    print(chunk)
    print('-----')

thou art not noble;
for all the accommodations that thou bear'st
are nursed by baseness
-----
you are plebeians,
if they be senators: and they are no less,
when, both your voices blended, the great'st taste
most palates theirs
-----
hortensio:
who shall begin?

lucentio:
that will i
-----


In [30]:
# Strip any whitespace at the beginning or end of the strings and convert the strings to lowercase

text_chunks = [s.strip().lower() for s in text_chunks]

In [34]:
# Filter out the chunks that are too short or too long

text_chunks = [sentence for sentence in text_chunks if 10 <= len(sentence) <= 400]

In [39]:
# Define a function to create text inputs and targets
def create_pure_inputs_and_targets(chunks):
    inputs = [chunk[:-1] for chunk in chunks]
    targets = [chunk[1:] for chunk in chunks]

    return list(zip(inputs, targets))

In [40]:
# Create pure inputs and targets

pure_ds = create_pure_inputs_and_targets(text_chunks)

# Make train and validation splits

train_set, validation_set = train_test_split(pure_ds, test_size=0.2)

In [73]:
# Define a function that converts the sentences to tokens at the character level

def get_vocab(chunks):
    counter = Counter(''.join(chunks))
    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    ordered_dict = OrderedDict(sorted_by_freq_tuples)
    vocab = torchtext.vocab.vocab(ordered_dict, specials=['<unk>']) # special characters
    vocab.set_default_index(-1) # token for unk characters
    return vocab

In [74]:
# Create the vocabulary

vocab = get_vocab(text_chunks)

In [75]:
# Define a function that preprocesses the data and return dataloaders
def get_loaders(train_set, validation_set, batch_size):
    def collate_batch(batch):
        input_list, target_list = [], []
        for input, target in batch:
            processed_input = torch.tensor([vocab[c] for c in input], dtype=torch.int64)
            processed_target = torch.tensor([vocab[c] for c in target], dtype=torch.int64)

            input_list.append(processed_input)
            target_list.append(processed_target)
        
        # input sequence의 길이가 다를 때 이를 동일하게 맞추어 주기 위하여 padding 추가
        input_tensor = torch.nn.utils.rnn.pad_sequence(input_list, batch_first=True, padding_value=0)
        target_tensor = torch.nn.utils.rnn.pad_sequence(target_list, batch_first=True, padding_value=0)
        return input_tensor.to(device), target_tensor.to(device)
    
    # collate_fn: Batch를 특정한 처리를 하여 결합하고자 할 때 사용하는 함수
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
    validation_loader = DataLoader(validation_set, batch_size=batch_size, collate_fn=collate_batch)

    return train_loader, validation_loader

In [76]:
train_loader, validation_loader = get_loaders(train_set, validation_set, batch_size=32)

In [77]:
# Build a RNN model

class RNN(nn.Module):
    def __init__(self, vocab, embedding_dim, gru_units):
        super(RNN, self).__init__()
        
        self.embedding = nn.Embedding(len(vocab), embedding_dim, padding_idx=0)
        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=gru_units, batch_first=True)
        self.linear = nn.Linear(gru_units, len(vocab))
    
    def forward(self, x, h0=None):
        x = self.embedding(x)
        out_gru, h = self.gru(x, h0)
        out = self.linear(out_gru)
        return out, h

In [78]:
rnn_model = RNN(vocab, 256, 1024).to(device)
rnn_model

RNN(
  (embedding): Embedding(39, 256, padding_idx=0)
  (gru): GRU(256, 1024, batch_first=True)
  (linear): Linear(in_features=1024, out_features=39, bias=True)
)

In [79]:
# Create an EarlyStopping class for training

class EarlyStopping:
    def __init__(self, patience):
        self.patience = patience
        self.counter = 0
        self.min_valid_loss = np.inf
    
    def early_stop(self, validation_loss):
        if validation_loss < self.min_valid_loss:
            self.min_valid_loss = validation_loss
            self.counter = 0
        elif validation_loss > self.min_valid_loss:
            self.counter += 1
            if self.counter >= self.patience:
                return True
        return False

In [80]:
def train_model(model, loss_function, optimiser, train_loader, validation_loader, early_stopping, epochs):
    epoch_losses, epoch_losses_validation = [], []
    epoch_acc, epoch_acc_validation = [], []

    for epoch in range(epochs):
        model.train()

        sum_loss, sum_loss_validation = 0., 0.
        sum_acc, sum_acc_validation = 0., 0.

        for inputs, y_true in train_loader:
            optimiser.zero_grad()

            y_pred = model(inputs)[0]
            loss = loss_function(y_pred, F.one_hot(y_true, len(vocab)).float())
            loss.backward()
            optimiser.step()

            sum_loss += loss.item()
            sum_acc += (y_true == y_pred.argmax(dim=2)).sum() / (inputs.shape[0] * inputs.shape[1])

        with torch.no_grad():
            model.eval()

            for inputs, y_true in validation_loader:
                y_pred = model(inputs)[0]

                loss = loss_function(y_pred, F.one_hot(y_true, len(vocab)).float())
                sum_loss_validation += loss.item()
                sum_acc_validation += (y_true == y_pred.argmax(dim=2)).sum() / (inputs.shape[0] * inputs.shape[1])
        
        avg_epoch_loss = sum_loss / len(train_loader)
        avg_epoch_acc = sum_acc / len(train_loader)

        avg_epoch_loss_validation = sum_loss_validation / len(validation_loader)
        avg_epoch_acc_validation = sum_acc_validation / len(validation_loader)

        epoch_losses.append(avg_epoch_loss)
        epoch_acc.append(avg_epoch_acc)

        epoch_losses_validation.append(avg_epoch_loss_validation)
        epoch_acc_validation.append(avg_epoch_acc_validation)

        print(f"Epoch {epoch + 1} - loss: {avg_epoch_loss:.4f}, val_loss: {avg_epoch_loss_validation:.4f}, "
              f'accuracy: {avg_epoch_acc:.4f}, val_accuracy: {avg_epoch_acc_validation:.4f}')
        
        if early_stopping.early_stop(avg_epoch_loss_validation):
            break

    history = {
        'loss': epoch_losses,
        'val_loss': epoch_losses_validation,
        'accuracy': epoch_acc,
        'val_accuracy': epoch_acc_validation
    }

    return history


In [81]:
# Train the model

history = train_model(rnn_model, nn.CrossEntropyLoss(), torch.optim.Adam(rnn_model.parameters()), train_loader, validation_loader, EarlyStopping(patience=3), epochs=15)

Epoch 1 - loss: nan, val_loss: nan, accuracy: 0.6383, val_accuracy: 0.6459
Epoch 2 - loss: nan, val_loss: nan, accuracy: 0.6412, val_accuracy: 0.6459


KeyboardInterrupt: 