## Recurrent Neural Networks - Supervised Learning II - MDS Computational Linguistics

### Goal of this tutorial
- Introduce Recurrent Neural Networks (RNNs)
- Implement RNN for sentiment analysis
- Implement Long-Short Term Memories (LSTMs) for sentiment analysis
- Implement Gated Recurrent Units (GRUs) for sentiment analysis

### General
- This notebook was last tested on Python 3.6.9 and PyTorch 1.2.0

We would like to acknowledge the following materials which helped as a reference in preparing this tutorial:
- https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf

## Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are used to model sequences of arbitrary length (e.g., sequence of words in a sentence, sequence of sentences in a document, sequence of frames in a video). RNNs typically use their internal state (memory) to process sequence of inputs. At each time-step, RNNs output a prediction and hidden state, feeding its previous hidden state into each next step. RNNs are applied in a wide range of NLP applications:
- language modeling, where RNN can condition on **all** previous words in the corpus unlike n-gram language model
- text classification, where the states act as features (we will see sentiment analysis in this tutorial)
- machine translation, where a RNN is used to process a sentence in source language and another RNN is used to decode the sentence in target language (we will see this in the "Machine Translation" course)
- sequence labeling, where the states in RNN are used to predict a category for each item in the sequence (we might see named entity recognition in the next tutorial)

Recommended reading for understanding the theory of RNNs: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 


### Grabbing few tweets using torchtext

Let us follow **torchtext** tutorial to read few tweets from the [sentiment analysis dataset](http://alt.qcri.org/semeval2016/task4/) used in the previous tutorial on feedforward neural networks. The preprocessed (tokenization, removing URLs, mentions, hashtags and so on) tweets are placed under ``data/sentiment-twitter-2016-task4`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``.  

Let us view few tweets from ``train.tsv`` using pandas.

In [1]:
import pandas as pd
df = pd.read_csv("./data/sentiment-twitter-2016-task4/train.tsv", sep = '\t', header=None) # the separator of tsv file is `\t`
df.head()


Unnamed: 0,0,1
0,dear <<<MENTION>>> the newooffice for mac is g...,2
1,<<<MENTION>>> how about you make a system that...,2
2,i may be ignorant on this issue but should we ...,2
3,thanks to <<<MENTION>>> i just may be switchin...,2
4,if i make a game as a <<<HASHTAG>>> universal ...,0


Let us pick up 5 tweets from the training set and convert them to tensors.

In [2]:
# import related packages
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset

# define the white space tokenizer to get tokens
def tokenize_en(tweet):
    """
    Tokenizes English tweet from a string into a list of strings (tokens)
    """
    return tweet.strip().split()

# define the TorchText's fields
TEXT = Field(sequential=True, tokenize=tokenize_en, lower=True)
LABEL = Field(sequential=False, unk_token=None)

To use the different splits (training, development and testing), we use `TabularDataset` class

In [3]:
train, val, test = TabularDataset.splits(
    path="./data/sentiment-twitter-2016-task4", # the root directory where the data lies
    train='train.tsv', validation="dev.tsv", test="test.tsv", # file names
    format='tsv',
    skip_header=False, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
    fields=[('tweet', TEXT), ('label', LABEL)])

Build our vocabulary to map words to integers:

In [4]:
TEXT.build_vocab(train, min_freq=2) # buiilds vocabulary based on all the words that occur at least twice in the training set
LABEL.build_vocab(train)

Initialize the iterators for the train, validation, and test data.

In [5]:
from torchtext.data import Iterator, BucketIterator

train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(5,64,64),
 sort_key=lambda x: len(x.tweet), 
 sort=True,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=False
)

Create a batch of five examples and print them

In [6]:
# create a single batch and terminate the loop
for batch in train_iter:
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.

# print the five examples with padding 
print("processed tweets: ")
for j in range(tweets.shape[1]): # sample loop
    tmp = []
    for i in range(tweets.shape[0]): # token loop
        tmp.append(TEXT.vocab.itos[tweets[i,j]])
    print(j," sample:",tmp)

processed tweets: 
0  sample: ['new', '<unk>', 'with', 'bentley', 'tomorrow']
1  sample: ['bringing', 'the', 'bentley', 'out', 'tomorrow']
2  sample: ['<<<mention>>>', 'make', 'david', 'beckham', 'tomorrow']
3  sample: ['ihop', 'is', 'the', 'move', 'tomorrow']
4  sample: ['i', 'want', 'ihop', 'tomorrow', 'morning']


### Creating a single hidden layer RNN

PyTorch has ``torch.nn.RNN`` module that implements the vanilla (Elman) RNN with *tanh* or *ReLU* non-linearity. The documentation for this module is [here](https://pytorch.org/docs/stable/nn.html#torch.nn.RNN). Let us use the sample batch of five examples created before to understand this module.

In this tutorial, we will represent the input tweet using a sequence of word embeddings (for each word present in the tweet). We will use ``torch.nn.Embedding module`` to store word vectors corresponding to words in the vocabulary.

Before implementing the embedding module for our usecase, let us compute the size of the word vocabulary.

In [7]:
VOCAB_SIZE = len(TEXT.vocab.stoi)
print(VOCAB_SIZE)

4875


Let us implement the embedding module for our usecase:

In [8]:
# an Embedding module containing 10 dimensional tensor for each word in the vocabulary
import torch
import torch.nn as nn

# Note, the parameters to Embedding class below are:
# num_embeddings (int): size of the dictionary of embeddings
# embedding_dim (int): the size of each embedding vector
# For more details on Embedding class, see: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/sparse.py
embedding = nn.Embedding(VOCAB_SIZE, 10, sparse=True)

Let us now feed the tensors of our sample batch to the embedding module and extract the sequence of word embeddings for each tweet

In [9]:
# print tensor containing word ids for our batch
print("word ids: ", tweets.data)

# feed the "word ids" tensor to the embedding module
tweet_input_embeddings = embedding(tweets)

# print the dimensions of the tweet_embeddings
print("tweet input word embeddings size: ", tweet_input_embeddings.size()) 
# first dimension - number of examples (5)
# second dimension - number of words in that single example (5)
# third dimension - number of features for a word (10) (or word embedding size)

word ids:  tensor([[  49, 1273,    4,  191,    6],
        [   0,    2,   82,   14,   76],
        [  18,  215,   73,    2,  191],
        [ 215,   48,  145,  598,   21],
        [  21,   21,   21,   21,  136]])
tweet input word embeddings size:  torch.Size([5, 5, 10])


Now let us define the RNN module:

In [10]:
"""
define the RNN module
"""
# first input - number of features in x (10, size of the word embedding)
# second input - number of features in hidden state h_t (20, size of the hidden layer)
# third input - number of recurrent layers (1)
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=1) # input_size, hidden_size, num_layers

RNN module takes two inputs: *the initial hidden state for each element in the batch* (t=0) and the *input features* (``tweet_input_embeddings`` in our case).

Let us construct the initial hidden state.

In [11]:
"""
hidden layer at time-step 0 (h_0)
"""
# first dimension - number of RNN layers (1)
# second dimension - number of words in the single example (5)
# third dimension - number of features in hidden state h_t (20, size of the hidden layer)
h0 = torch.randn(1, 5, 20)

Let us feed both the hidden representation constructed above and tweet embeddings to our RNN model.

In [12]:
"""
forward propagation over the RNN model
"""
output, hn = rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's of apprpriate size (num_layers, batch, hidden_size) when not provided

``output`` tensor contains the output features $h_t$ from the last layer of the RNN

In [13]:
# output = seq_len, batch, hidden_size (output features from last layer of RNN)
print("output size: ", output.size())

output size:  torch.Size([5, 5, 20])


``hn`` is a tensor of shape (num_layers, batch, hidden_size) containing the hidden state for t = seq_len

In [14]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([1, 5, 20])


You can take the output representation for a tweet after processing the last token (t=seq_len or last timestep) and call the resulting representation as the tweet representation that "summarizes" the information present in the tweet. This tweet representation can further be used for a useful task like tweet classification (we will try out sentiment analysis later in this tutorial) by adding a classification module on top of the tweet representation.

Let us compute the final tweet representation:

In [15]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (5)
# second dimension - number of features in hidden state h_t (20, size of the hidden layer)

tweet output embeddings size:  torch.Size([5, 20])


## Multilayered RNN

For some applications, we may need more than one hidden layer for RNN to model the information flow. Adding more layers requires fews changes.

Firstly, we change the ``num_layers`` argument to reflect the number of layers we want during the RNN module definition.

In [16]:
"""
define a 2 layered RNN module
"""
# first input - number of features in x (10, size of the word embedding)
# second input - number of features in hidden state h_t (20, size of the hidden layer)
# third input - number of recurrent layers (2)
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=2) # input_size, hidden_size, num_layers

Similar to single layered RNN, Multilayered RNN module takes two inputs: the initial hidden state for each element in the batch (t=0) and the input features (tweet_input_embeddings in our case).

Let us construct the new initial hidden state for a 2 layered RNN.

In [17]:
"""
hidden layer at time-step 0 (h_0)
"""
# first dimension - number of RNN layers (2)
# second dimension - number of words in the single example (5)
# third dimension - number of features in hidden state h_t (20, size of the hidden layer)
h0 = torch.randn(2, 5, 20)

Let us feed both the hidden representation constructed above and tweet embeddings to our RNN model.

In [18]:
"""
forward propagation over the RNN model
"""
output, hn = rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contains the output features $h_t$ from the last layer of the RNN

In [19]:
# output = seq_len, batch, hidden_size (output features from last layer of RNN)
print("output size: ", output.size())

output size:  torch.Size([5, 5, 20])


``hn`` is a tensor of shape (num_layers, batch, hidden_size) containing the hidden state for t = seq_len for a 2 layered RNN.

In [20]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 5, 20])


Let us compute the final tweet representation:

In [21]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (5)
# second dimension - number of features in hidden state h_t (20, size of the hidden layer)

tweet output embeddings size:  torch.Size([5, 20])


## RNN for Sentiment Analysis

In this section we will implement RNN for classifying the sentiment of the tweet (same task used in our previous feedforward neural networks tutorial).

We will pick up most of the functions from our feedforward neural networks code:

In [22]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
  torch.cuda.manual_seed(manual_seed)

# hyperparameters
MAX_EPOCHS = 5
LEARNING_RATE = 0.05
NUM_CLASSES = 3
EMBEDDING_SIZE = 10

Now we can define the full RNN model:

In [23]:
"""
create a model for RNN
"""
class RNNmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size, hidden_size, num_layers):
    # In the constructor we define the layers for our model
    super(RNNmodel, self).__init__()
    # word embedding lookup table
    self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size, sparse=True)
    # core RNN module
    self.rnn_layer = nn.RNN(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers) 
    # activation function
    self.activation_fn = nn.ReLU()
    # classification related modules
    self.linear_layer = nn.Linear(hidden_size, output_size) 
    self.softmax_layer = nn.LogSoftmax(dim=0)
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.embedding(x)
    out, _ = self.rnn_layer(out) # since we are not feeding h_0 explicitly, h_0 will be initialized to zeros by default
    # classify based on the hidden representation after RNN processes the last token
    out = out[-1]
    out = self.activation_fn(out)
    out = self.linear_layer(out)
    out = self.softmax_layer(out) # accepts 2D or more dimensional inputs
    return out

Some additional hyperparameters for RNN

In [24]:
# hyperparameters of RNN
HIDDEN_SIZE = 20
NUM_LAYERS = 2

Rest of the pipeline looks similar to our feedforward neural networks code (except that we are using **torchtext** instead of **DataLoader**):

In [25]:
# train logic 
def train(loader):
  total_loss = 0.0
  # iterate throught the data loader
  num_batches = 0
  for batch in loader:
    # load the current batch
    batch_input, batch_output = batch.tweet, batch.label
    
    # forward propagation
    # pass the data through the model
    model_outputs = model(batch_input)
    # compute the loss
    cur_loss = criterion(model_outputs, batch_output)
    total_loss += cur_loss.item()
    
    # backward propagation (compute the gradients and update the model)
    # clear the buffer
    optimizer.zero_grad()
    # compute the gradients
    cur_loss.backward()
    # update the weights
    optimizer.step()
    
    num_batches += 1
  return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
  accuracy, num_examples = 0.0, 0
  with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
    for batch in loader:
      # load the current batch
      batch_input, batch_output = batch.tweet, batch.label
      # forward propagation
      # pass the data through the model
      model_outputs = model(batch_input)
      # identify the predicted class for each example in the batch
      _, predicted = torch.max(model_outputs.data, 1)
      # compare with batch_output (gold labels) to compute accuracy
      accuracy += (predicted == batch_output).sum().item()
      num_examples += batch_output.size(0)
  return accuracy/num_examples


Let us define the RNN model.

In [26]:
# define the model
model = RNNmodel(EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS) 
model.to(device)
# define the loss function (last node of the graph)
criterion = nn.NLLLoss()

Let us perform the training.

In [27]:
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_iter)  
  # compute the training accuracy
  train_acc = evaluate(train_iter)
  # compute the validation accuracy
  val_acc = evaluate(val_iter)
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))

Epoch [1/5], Loss: 1.6099, Training Accuracy: 0.3905, Validation Accuracy: 0.3537
Epoch [2/5], Loss: 1.6066, Training Accuracy: 0.3997, Validation Accuracy: 0.3542
Epoch [3/5], Loss: 1.6044, Training Accuracy: 0.4045, Validation Accuracy: 0.3677
Epoch [4/5], Loss: 1.6021, Training Accuracy: 0.3992, Validation Accuracy: 0.3512
Epoch [5/5], Loss: 1.6000, Training Accuracy: 0.4097, Validation Accuracy: 0.3592


## GRUs

Gated Recurrent Units (GRUs) are a variant of RNNs that use more complex units for activation. They are created to have more persistent memory thereby making them easier for RNNs to capture long-term dependencies. To learn the theory behind GRUs, we recommend: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 

GRU is defined by ``torch.nn.GRU`` module and its documentation can be fetched [here](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU). Now let us define the GRU module.

In [28]:
"""
define the GRU module
"""
# first input - number of features in x (10, size of the word embedding)
# second input - number of features in hidden state h_t (20, size of the hidden layer)
# third input - number of recurrent layers (2)
gru_rnn = nn.GRU(input_size=10, hidden_size=20, num_layers=2) # input_size, hidden_size, num_layers

Similar to RNN, GRU module takes two inputs: *the initial hidden state for each element in the batch* (t=0) and the *input features* (``tweet_input_embeddings`` in our case).

Let us feed both the initial hidden state and tweet embeddings to our GRU model.

In [29]:
"""
forward propagation over the GRU model
"""
output, hn = gru_rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contqains the output features $h_t$ from the last layer of the GRU

In [30]:
# output = seq_len, batch, hidden_size (output features from last layer of GRU)
print("output size: ", output.size())

output size:  torch.Size([5, 5, 20])


``hn`` is a tensor of shape (num_layers, batch, hidden_size) containing the hidden state for t = seq_len

In [31]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 5, 20])


Similar to RNN, you can compute the final tweet representation (representation from last hidden state for each tweet) as follows.

In [32]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (5)
# second dimension - number of features in hidden state h_t (20, size of the hidden layer)

tweet output embeddings size:  torch.Size([5, 20])


## LSTMs

Long short-term memory (LSTMs) are a variant of RNNs that use more complex units for activation. Similar to the spirit of GRU, they are created to have more persistent memory thereby making them easier for RNNs to capture long-term dependencies. To learn the theory behind GRUs, we recommend: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 

LSTM is defined by ``torch.nn.LSTM`` module and its documentation can be fetched [here](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM). Now let us define the LSTM module.

In [33]:
"""
define the LSTM module
"""
# first input - number of features in x (10, size of the word embedding)
# second input - number of features in hidden state h_t (20, size of the hidden layer)
# third input - number of recurrent layers (2)
lstm_rnn = nn.LSTM(input_size=10, hidden_size=20, num_layers=2) # input_size, hidden_size, num_layers

Unlike RNN and GRU, LSTM module takes three inputs: the initial hidden state for each element in the batch (t=0), the input features (tweet_input_embeddings in our case) and initial cell state for each element in the batch.

Let us construct the initial cell state (this construction is similar to that of initial hidden state)

In [34]:
"""
cell state at time-step 0 (h_0)
"""
# first dimension - number of LSTM layers (2)
# second dimension - number of words in the single example (5)
# third dimension - number of features in hidden state h_t (20, size of the hidden layer)
c0 = torch.randn(2, 5, 20)

Let us feed the initial hidden state, initial cell state and tweet embeddings to our LSTM model.

In [35]:
"""
forward propagation over the LSTM model
"""
output, (hn, cn) = lstm_rnn(tweet_input_embeddings, (h0, c0)) # h0 and c0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contains the output features $h_t$ from the last layer of the LSTM

In [36]:
# output = seq_len, batch, hidden_size (output features from last layer of LSTM)
print("output size: ", output.size())

output size:  torch.Size([5, 5, 20])


``hn`` is a tensor of shape (num_layers, batch, hidden_size) containing the hidden state for t = seq_len

In [37]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 5, 20])


``cn`` is a tensor of shape (num_layers, batch, hidden_size) containing the cell state for t = seq_len.

In [38]:
# c_n = num_layers, batch, hidden_size (cell state for t=seq_len or cell state at last timestep)
print("last cell state size: ", hn.size())

last cell state size:  torch.Size([2, 5, 20])


Similar to RNN and GRU, you can compute the final tweet representation (representation from last hidden state for each tweet) as follows.

In [39]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (5)
# second dimension - number of features in hidden state h_t (20, size of the hidden layer)

tweet output embeddings size:  torch.Size([5, 20])
