## Recurrent Neural Networks with Attention

### Goal of this tutorial
- Introduce Recurrent Neural Networks (RNNs)
- Implement RNN for sentiment analysis
- Combine RNN and Self-attention
- Implement Long-Short Term Memories (LSTMs) for sentiment analysis
- Implement Gated Recurrent Units (GRUs) for sentiment analysis

### General
- This notebook was last tested on Python 3.7, PyTorch 1.8.0, torchtext 0.5.0

We would like to acknowledge the following materials which helped as a reference in preparing this tutorial:
- https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf

In [1]:
import pandas as pd
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
import torch
import torch.nn as nn
import math
import torch.nn.functional as F

### Embedding Layer

The ``Embedding`` layer in Pytorch is where we pass our vocabulary to get get back a word vector for each word in the vocabulary. We need to know what the input and output of this layer look like. Let's first
do this on a dummy example where we have two sentences ``x_1`` and ``x_2``. Let's assume we have the two sentences as:

In [2]:
x_1 = "He is very nice"
x_2 = "She is very kind"

Let's convert the two sentences into indexes (each word is replaced with its index in the vocabulary).
Let's assume our ``vocabulary size`` is set to 100. Remember, vocabulary size is a hyper-parameter.
Let's also store that ``vocabulary size`` in a variable ``VOCAB_SIZE`` now as we will need to pass it to the ``Embedding`` layer later.

In [3]:
x_1 = [1, 25, 40, 5]
x_2 = [4, 25, 40, 99]
VOCAB_SIZE = 100

### Max sequence length

One last thing we need to think about is the ``length of each sequence``. The two examples above are nicely set to equal length = 4. This does not need to be the case, as we can have sequences of varying lengths. We will be passing a batch of sentences to Pytorch and the max sequence length will be set to the length of the longest sentence in that batch. The rest of sentences (shorter ones) will be padded with zeros. Now, do we need to explicitly provide the max sequence length to Pytorch? And how do we know the max seq length for each batch, if different batches have sequences of varying lengths and each batch is set to the max sent in that batch? Well, rest assured, we don't really need to worry about that. Pytorch will assign a max seq length for each batch. We will be able to inspect the max seq length for a given batch using output of the ``Embedding`` layer. (We will see that soon).

## Size of word vector

The ``Embedding`` layer will give us a vector for each word in the vocabulary.
Now, we will need to tell it what size we want for that vector. Popular values for a vector size are usally between 100-300 for many tasks (e.g., sentiment analysis"). Let's set it to 300 dimensions. (You are encouraged to play with this value as practice). All words in the vocabulary will have the same embedding size. Let's put that hyper-parameter in a variable ``WORD_VEC_SIZE``:

In [4]:
WORD_VEC_SIZE= 300

We are now ready to call the ``Embedding`` class to construct an embeddings tensor:

In [5]:
# Constructing an embedding Layer:
embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE)
input = torch.LongTensor([ x_1, x_2 ])
embedded=embedding(input)
print(embedded)
print(embedded.shape) # batch size, sequence lenght, embedding size

tensor([[[ 1.0887, -1.7415,  1.9517,  ..., -0.1760, -0.3818, -0.0337],
         [ 1.6868, -2.0368, -0.3786,  ..., -0.5674,  0.9271,  1.2256],
         [ 2.3381,  1.2338, -0.6665,  ..., -1.6691, -0.0250, -1.0502],
         [ 1.4910,  0.5358,  0.5426,  ..., -0.0715, -0.0492,  0.8235]],

        [[-0.6776, -0.9685,  0.9145,  ...,  0.2226, -1.8891,  0.6292],
         [ 1.6868, -2.0368, -0.3786,  ..., -0.5674,  0.9271,  1.2256],
         [ 2.3381,  1.2338, -0.6665,  ..., -1.6691, -0.0250, -1.0502],
         [-1.3383, -0.1880,  1.4515,  ..., -0.5865,  0.1810, -0.3382]]],
       grad_fn=<EmbeddingBackward>)
torch.Size([2, 4, 300])


**Let's print the shape of this tensor:**

In [6]:
print(embedded.size())

torch.Size([2, 4, 300])


This is telling us:
- We have ``2 examples`` (that is, our ``x_1`` and ``x_2``). (Note: We will be paqssing a whole batch to the ``Embedding`` class and so this first dimension will be equal to the ``batch size``.
- For each of the two examples, we have a ``max sequence length`` = 4 (x_1 and x_2 each had 4 indexes).
- The ``word vector dimension`` is set to 300.

### Max sequence length: Another note

Recall from above we mentioned Pytorch automatically infers the max sequence length for each batch. 
For the example above (as you can see from the second dimension returned by ``embedded.size()``, Pytorch 
inferred the max seq length for this batch of two sentences is 4.
Let's just adjust the second example, **adding two more words** (the string "and kind"). Note, both our ``VOCAB_SIZE`` and ``WORD_VEC_SIZE`` stay the same as before. We assign the word "and" an index of "7" and the word "considerate" and index of "60". Note that we have to pad the first example ``x_1`` with zeros in the end. (Try removing the zero padding. What do you observe when you run your code with the ``Embedding`` class? Hint: You will get an error.):

In [7]:
x_1 = "He is very nice"
x_2 = "She is very kind and considerate"
x_1 = [1, 25, 40, 5, 0, 0]
x_2 = [4, 25, 40, 99, 7, 60]

Now, let's create a new ``embedding`` layer by creating a new instance of the ``Embedding`` class:

In [8]:
# Constructing an embedding Layer:
embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE)
input = torch.LongTensor([x_1, x_2 ])
embedded=embedding(input)
print(embedded)

tensor([[[-1.4602, -1.6259, -1.6428,  ..., -2.1957, -0.3970,  1.5695],
         [ 2.3837, -0.4616, -0.5866,  ..., -0.1050,  0.2585,  2.2116],
         [-1.6406, -0.9257,  1.0679,  ..., -1.8236,  0.7405,  0.4554],
         [ 0.3086,  0.9330,  1.1569,  ...,  0.5056,  0.0749, -1.2036],
         [ 1.3079, -1.2819,  1.0718,  ..., -1.5882, -1.4971, -1.0592],
         [ 1.3079, -1.2819,  1.0718,  ..., -1.5882, -1.4971, -1.0592]],

        [[ 0.9356, -0.5437,  0.7814,  ..., -1.0575,  0.4514, -0.2538],
         [ 2.3837, -0.4616, -0.5866,  ..., -0.1050,  0.2585,  2.2116],
         [-1.6406, -0.9257,  1.0679,  ..., -1.8236,  0.7405,  0.4554],
         [-0.3616,  2.0561, -0.2116,  ..., -1.0903, -0.8611, -1.9390],
         [ 0.7153, -0.2962,  1.5125,  ...,  0.0623, -0.0987,  2.6032],
         [ 0.8322,  0.6629,  0.4445,  ..., -0.3016, -0.1937,  0.7088]]],
       grad_fn=<EmbeddingBackward>)


**If we inspect the shape of the new tensor ``embedded``, we will see the second dimension now changed to 6, to match the max sequence length:**

In [9]:
print(embedded.size())

torch.Size([2, 6, 300])


### How does Pytorch initialize word vector dimensions/weights?

Note that Pytorch initializes the word vectors with initialized from a **normal distribution** $ \mathcal{N}(0, 1) $. The word embedding weights are by default learnable parameters in Pytorch and so they will be adjusted during training. (Note: These weights can be initialized from an external word embedding tool such as Word2vec, Fasttext, or Glove. Also, the weights can be frozen, which is a reasonable option when initialized from an external tool. You can choose to keep learning them within the model with your training data). Below we show the ones initialized from a normal distribution by Pytorch.

In [10]:
embedding.weight

Parameter containing:
tensor([[ 1.3079, -1.2819,  1.0718,  ..., -1.5882, -1.4971, -1.0592],
        [-1.4602, -1.6259, -1.6428,  ..., -2.1957, -0.3970,  1.5695],
        [-0.0959,  0.8302,  0.1974,  ..., -0.0876, -0.0506,  0.4265],
        ...,
        [ 0.1803,  0.7093, -1.2665,  ..., -0.5986,  0.4312,  0.7184],
        [-0.0135,  0.2512, -1.8572,  ...,  0.9146, -0.5150,  0.1267],
        [-0.3616,  2.0561, -0.2116,  ..., -1.0903, -0.8611, -1.9390]],
       requires_grad=True)

More information about the ``Embedding`` class can be found [here](https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#Embedding).

## Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are used to model sequences of arbitrary length (e.g., sequence of words in a sentence, sequence of sentences in a document, sequence of frames in a video). RNNs typically use their internal state (memory) to process sequence of inputs. At each time-step, RNNs output a prediction and hidden state, feeding its previous hidden state into each next step. RNNs are applied in a wide range of NLP applications:
- language modeling, where RNN can condition on **all** previous words in the corpus unlike n-gram language model
- text classification, where the states act as features (we will see sentiment analysis in this tutorial)
- machine translation, where a RNN is used to process a sentence in source language and another RNN is used to decode the sentence in target language.
- sequence labeling, where the states in RNN are used to predict a category for each item in the sequence.

Recommended reading for understanding the theory of RNNs: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 


### Grabbing few tweets using torchtext

Let us follow **torchtext** tutorial to read few tweets from the [sentiment analysis dataset](http://alt.qcri.org/semeval2016/task4/) used in the previous tutorial on feedforward neural networks. The preprocessed (tokenization, removing URLs, mentions, hashtags and so on) tweets are placed under ``data/sentiment-twitter-2016-task4`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``.  

Let us view few tweets from ``train.tsv`` using pandas.

In [11]:
import pandas as pd
df = pd.read_csv("./data/sentiment-twitter-2016-task4/train.tsv", sep = '\t', header=None, names=['tweet','label']) # the separator of tsv file is `\t`
df.head()

Unnamed: 0,tweet,label
0,dear <<<MENTION>>> the newooffice for mac is g...,2
1,<<<MENTION>>> how about you make a system that...,2
2,i may be ignorant on this issue but should we ...,2
3,thanks to <<<MENTION>>> i just may be switchin...,2
4,if i make a game as a <<<HASHTAG>>> universal ...,0


**We import the relevant packages, define the tokenizer and TorchText's fields.**

In [12]:
# import related packages
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset

# define the white space tokenizer to get tokens
def tokenize_en(tweet):
    """
    Tokenizes English tweet from a string into a list of strings (tokens)
    """
    return tweet.strip().split()

# define the TorchText's fields
TEXT = Field(sequential=True, tokenize=tokenize_en, lower=True)
LABEL = Field(sequential=False, unk_token = None)

**To use the different splits (training, development and testing), we use `TabularDataset` class to load datasets.**

In [13]:
train, val, test = TabularDataset.splits(
    path="./data/sentiment-twitter-2016-task4/", # the root directory where the data lies
    train='train.tsv', validation="dev.tsv", test="test.tsv", # file names
    format='tsv',
    skip_header=False, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
    fields=[('tweet', TEXT), ('label', LABEL)])

**Build our vocabulary to map words to integers.**

In [14]:
TEXT.build_vocab(train, min_freq=3) # builds vocabulary based on all the words that occur at least twice in the training set
LABEL.build_vocab(train)

**Initialize the iterators for the train, validation, and test data. Note that we set ``sort`` as False so as to not sort examples based on similar lengths which minimizes padding in this example.**

In [15]:
from torchtext.data import Iterator, BucketIterator

train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(4,64,64),
 sort_key=lambda x: len(x.tweet), 
 sort=False,
# An argument to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=False
)

**Create a batch of four examples and print them**

In [16]:
# create a single batch and terminate the loop
for batch in train_iter:
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.

# print the four examples with padding 
print("processed tweets: ")
for j in range(tweets.shape[1]): # sample loop
    tmp = []
    for i in range(tweets.shape[0]): # token loop
        tmp.append(TEXT.vocab.itos[tweets[i,j]])
    print(j," sample:",tmp)

processed tweets: 
0  sample: ['murray', 'has', 'given', 'it', 'his', 'all', 'which', "wasn't", 'expected', 'after', 'the', 'last', 'few', 'days', 'he', 'has', 'had', 'but', 'federer', 'just', '<unk>', 'tb', 'set', '<<<hashtag>>>', '<pad>']
1  sample: ['chelsea', 'may', 'have', '<<<digit>>>', 'out', 'on', 'loan', 'wait', 'till', 'you', 'see', "what's", 'going', 'on', 'at', 'juventus', '<<<url>>>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
2  sample: ['you', 'may', 'be', '<unk>', 'to', '<unk>', 'with', 'this', 'new', '<unk>', 'apple', 'watch', 'case', '<<<url>>>', '<<<hashtag>>>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
3  sample: ['<<<mention>>>', 'wondering', 'if', 'mariah', 'carey', 'want', 'to', 'play', 'some', 'concerts', 'with', 'us', 'in', '<unk>', 'after', 'jan', '<<<digit>>>', '<<<digit>>>', 'an', 'starting', 'a', 'tour', 'contact', 'john', '<unk>']


**Let us pick up 4 tweets from the training set and convert them to tensors.**

**Create a batch of four examples and print them**

In [17]:
# create a single batch and terminate the loop
for batch in train_iter:
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.

# print the four examples with padding 
print("processed tweets: ")
for j in range(tweets.shape[1]): # sample loop
    tmp = []
    for i in range(tweets.shape[0]): # token loop
        tmp.append(TEXT.vocab.itos[tweets[i,j]])
    print(j," sample:",tmp)

processed tweets: 
0  sample: ['apple', 'watch', 'with', 'expected', 'price', 'of', '<unk>', '<<<digit>>>', 'coming', 'to', 'india', 'in', 'this', 'october', '<<<url>>>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
1  sample: ['<<<mention>>>', '<<<mention>>>', 'novak', 'is', 'surely', 'the', 'man', 'to', 'beat', 'federer', 'or', 'murray', 'may', 'run', 'him', 'to', 'the', 'base', 'line', 'the', 'usual', '<unk>', '<pad>', '<pad>', '<pad>']
2  sample: ['david', 'beckham', 'showed', 'off', 'his', '<unk>', 'side', 'on', 'monday', 'as', 'he', 'prepared', 'to', 'take', 'a', 'spin', 'in', 'los', 'angeles', 'on', 'one', 'of', 'his', 'vintage', '<unk>']
3  sample: ['if', 'angela', 'merkel', '<unk>', 'at', 'bayreuth', '<unk>', 'it', 'may', 'be', 'the', 'only', 'thing', 'she', 'has', 'in', 'common', 'with', '<unk>', '<unk>', '<<<url>>>', '<pad>', '<pad>', '<pad>', '<pad>']


### Creating a single hidden layer RNN

PyTorch has ``torch.nn.RNN`` module that implements the vanilla (Elman) RNN with *tanh* or *ReLU* non-linearity. The documentation for this module is [here](https://pytorch.org/docs/stable/nn.html#torch.nn.RNN). Let us use the sample batch of five examples created before to understand this module.

In this tutorial, we will represent the input tweet using a sequence of word embeddings (for each word present in the tweet). We will use ``torch.nn.Embedding module`` to store word vectors corresponding to words in the vocabulary.

Before implementing the embedding module for our usecase, let us compute the size of the word vocabulary.

In [18]:
VOCAB_SIZE = len(TEXT.vocab.stoi)
print(VOCAB_SIZE)

3344


Let us implement the embedding module (whose underlying weight matrix shape is (``vocabulary size`` $\times$ ``word embedding size``) for our usecase:

In [19]:
# an Embedding module containing 10 dimensional tensor for each word in the vocabulary
import torch
import torch.nn as nn
WORD_VEC_SIZE=300
# Note, the parameters to Embedding class below are:
# num_embeddings (int): size of the dictionary of embeddings
# embedding_dim (int): the size of each embedding vector
# For more details on Embedding class, see: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/sparse.py
embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE, sparse=True)

Let us now feed the tensors of our sample batch to the embedding module and extract the sequence of word embeddings for each tweet.

In [20]:
# print tensor containing word ids for our batch
print("*"*50, "\n Word ids for the first batch (recall, it has 4 sentences, each column representing a sentence): \n", tweets.data, "\n","*"*50,)

# feed the "word ids" tensor to the embedding module
tweet_input_embeddings = embedding(tweets)

# print the dimensions of the tweet_embeddings
print("*"*50, "\n Tweet input word embeddings size: ", tweet_input_embeddings.size(), "\n","*"*50,) 
# first dimension - sequence length: number of words per example (same across the whole batch, after padding) --> max_seq = 22
# second dimension -  batch size / number of examples in the batch --> 4
# third dimension - number of dimensions in the word vector

************************************************** 
 Word ids for the first batch (recall, it has 4 sentences, each column representing a sentence): 
 tensor([[  55,    4,   73,   38],
        [  56,    4,  145,  171],
        [  18, 3038, 2055,  192],
        [ 665,   14,  103,    0],
        [ 772, 3228,   63,   22],
        [  12,    2,    0, 2659],
        [   0,  255, 1054,    0],
        [  13,    3,    8,   20],
        [ 240,  464,   86,   16],
        [   3,  206,   46,   19],
        [1655,   57,   36,    2],
        [  10,  766, 2468,  108],
        [  27,   16,    3,  269],
        [ 358,  246,  195,  107],
        [   5,  165,    7,   65],
        [   1,    3, 1739,   10],
        [   1,    2,   10, 1594],
        [   1, 2656, 1960,   18],
        [   1, 1030, 1786,    0],
        [   1,    2,    8,    0],
        [   1, 2119,   62,    5],
        [   1,    0,   12,    1],
        [   1,    1,   63,    1],
        [   1,    1, 2569,    1],
        [   1,    1,    0,    1]]

Let's actually view the actual word embeddings tensor for this batch:

In [21]:
print("*"*50, "\n Embeddings for the first batch: \n", tweet_input_embeddings, "\n","*"*50,) 

************************************************** 
 Embeddings for the first batch: 
 tensor([[[-0.1750,  0.0402,  0.5031,  ...,  0.6920, -0.8593, -0.4142],
         [-0.7898, -0.0891, -0.5490,  ..., -1.3227,  0.6840, -0.9691],
         [ 0.9319, -0.3605,  0.3309,  ..., -0.0160, -0.5626, -0.2331],
         [ 1.2326,  1.6668, -0.0129,  ..., -1.9598, -0.0499,  0.9838]],

        [[-0.8705,  0.3441, -0.6270,  ..., -0.0663, -1.0613, -1.1859],
         [-0.7898, -0.0891, -0.5490,  ..., -1.3227,  0.6840, -0.9691],
         [ 0.7920,  0.0173,  1.4477,  ...,  1.2311, -1.0475,  0.0853],
         [ 0.4255,  0.5645, -0.0418,  ..., -0.1219,  0.1175,  0.1358]],

        [[-1.3675, -0.1361,  1.4924,  ..., -1.2587,  0.6639, -0.0805],
         [-0.3887, -0.8374, -1.9132,  ..., -1.9521,  0.2736, -0.7804],
         [ 0.5587, -0.2877,  1.3229,  ...,  0.1692, -1.3463, -0.2517],
         [-0.7239,  1.2030, -0.7945,  ..., -1.6640,  2.2954,  0.8614]],

        ...,

        [[-0.6559,  0.7843,  0.9961,  ...

What we are seeing is the actual word vectors representing each of the 4 sentences (i.e., whole batch).
This is dimension 2 in ``tweet_input_embeddings``. 

In [22]:
tweet_input_embeddings.size()[1]

4

As mentioned, the dimension 1 (second dimension) in ``tweet_input_embeddings`` is ``max_seq length`` in this batch:

In [23]:
tweet_input_embeddings.size()[0]

25

Now, dimension 3 in ``tweet_input_embeddings`` (indexed as 2) is the size of the word vectors:

In [24]:
tweet_input_embeddings.size()[2]

300

Let's look at the vector for the ``first word`` in the ``first sentence`` in the batch:

In [25]:
tweet_input_embeddings[:1, :1, :].shape

torch.Size([1, 1, 300])

In [26]:
tweet_input_embeddings[:1, :1, :]

tensor([[[-0.1750,  0.0402,  0.5031, -0.3411,  0.2437,  0.6142,  2.4782,
           0.5794,  2.1893, -0.2097,  0.1088,  0.8284,  0.0568,  1.0521,
          -0.8394, -0.7548,  0.3653, -0.5104, -0.5117, -0.1925,  0.5494,
           0.2798, -1.3733, -0.4014, -1.0280, -1.4983, -0.3375, -0.7473,
          -0.1076,  0.0623, -0.3046,  1.1644, -1.0801,  1.2685, -0.0205,
          -0.4465,  0.4338,  0.9483, -0.0402, -0.2481,  0.5237, -0.0928,
          -0.3687, -0.2873, -2.3593, -0.1577,  0.2883, -2.8634,  2.2144,
          -0.6007,  0.6401,  0.1054, -0.1743,  0.1829,  1.5399,  1.4356,
           0.1966,  0.7467, -1.6388,  0.0815,  0.5131, -1.2455, -2.4991,
          -1.8950,  0.6431, -0.0721,  1.0959,  0.7057, -0.2337, -1.5975,
           0.4675, -0.6119, -0.5394,  0.6729,  0.3344, -1.8217, -0.8168,
          -0.5869, -0.5226,  0.6395,  0.8858,  0.3659,  1.2546, -0.0798,
           0.8276, -0.8062, -0.3041, -0.8552, -1.2441,  0.4603,  1.0823,
          -1.2414,  2.8956,  0.8722,  1.0787,  0.98

Let's look at the ``first 5 dimensions`` of that same ``first word`` of the ``first sentence``:

In [27]:
tweet_input_embeddings[:1, :1, :5]

tensor([[[-0.1750,  0.0402,  0.5031, -0.3411,  0.2437]]],
       grad_fn=<SliceBackward>)

The following shows you the ``first 5 dimensions`` of the ``first word`` from ``each of the 4 sentences``

In [28]:
tweet_input_embeddings[:1, :, :5]

tensor([[[-0.1750,  0.0402,  0.5031, -0.3411,  0.2437],
         [-0.7898, -0.0891, -0.5490, -0.6521, -0.5019],
         [ 0.9319, -0.3605,  0.3309,  0.0990, -2.0672],
         [ 1.2326,  1.6668, -0.0129, -0.5787, -0.6908]]],
       grad_fn=<SliceBackward>)

The following shows you the ``last 7 dimensions`` of the ``last word`` from ``the last sentence``. Enjoy!

In [29]:
tweet_input_embeddings[-1:, 3:, -7:]

tensor([[[-0.3759,  0.4754,  0.9738, -0.3668, -0.5680,  0.6787,  0.2977]]],
       grad_fn=<SliceBackward>)

We will be passing the embedding vector for the first word, Simultaneously for each senetence in the batch, to the RNN. But let's now define an RNN module first:

In [30]:
"""
define the RNN module
"""
# first input - number of dimesnions for word vectors for a vector x (300, size of the word embedding)
# second input - number of nodes in hidden state h_t (50, size of the hidden layer)
# third input - number of recurrent layers (we set it to 1)
rnn = nn.RNN(input_size=300, hidden_size=50, num_layers=1) # input_size, hidden_size, num_layers

We will now pass the ``tweet_input_embeddings`` (representations of words in our batch) to RNN. Before we do, we need to know RNN also *optionally* takes a parameter for the ``initial hidden state h0`` (that is, the hidden state we will input to the model before it runs. It is just something that we need to start the model. If we don't provide it, Pytorch will just initialize h0 to a tensor of zeros. 

Let's construct an ``initial hidden state h0``. Note the shape of its tensor, and what each of the 3 parameters it takes mean.

In [31]:
"""
hidden layer at time-step 0 (h_0)
"""
# first dimension - number of RNN layers (1)
# second dimension - number of examples/sentences in a batch
# third dimension - number of nodes in hidden layer (50, size of the hidden layer)
h0 = torch.randn(1, 4, 50)
print("The shape as as expected: ", h0.shape)

The shape as as expected:  torch.Size([1, 4, 50])


Let us feed both the hidden representation constructed above and tweet embeddings to our RNN model.
We will get back two objects ``output`` and ``hn`` that we will need to understand.

In [32]:
"""
forward propagation over the RNN model
"""
output, hn = rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's of apprpriate size (num_layers, batch, hidden_size) when not provided

But what is ``output``? Well, let's inspect its shape first:

In [33]:
# output = seq_len, batch, hidden_size (output features from last layer of RNN)
print("output size: ", output.size())

output size:  torch.Size([25, 4, 50])


Here's what we need to know about ``output``:
- The first dimension in the ``output`` tensor is the ``max_seq length``. 
- The second dimension is ``batch_size`` (the number of examples/sentences in our batch = 4).
- The third dimension is the ``size of nodes/units`` in our hidden layer (=50). 

Here's what we need to know about ``hn``:
- ``hn`` is a tensor of shape (num_layers, batch_size, hidden_size / number of hidden layer nodes) containing the hidden state for the last ``time step`` 
(``t = max_seq_length``). Actually, `output` is tensor containing the output features (h_t) from the layer of RNN, for each t. Namely, `output` returns all the hidden states of all time steps from RNN. Hence, the last element of `output` (i.e., `output[-1, :, :]`) is `h_n`. 

In [34]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([1, 4, 50])


You can take the output representation for a tweet after processing the last token (t=seq_len or last timestep) and call the resulting representation as the tweet representation that **"summarizes" the information present** in the tweet. This tweet representation can further be used for a useful task like tweet classification (we will try out sentiment analysis later in this tutorial) by adding a classification module on top of the tweet representation.

Let us compute the final tweet representation:

In [35]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (4)
# second dimension - number of features in hidden state h_t (50, size of the hidden layer)

tweet output embeddings size:  torch.Size([4, 50])


### Self Attention Mechanism


The fixed-length context vector carries the burden of encoding the complete "meaning" of the input sequence, regardless of its length. This causes information bottleneck problem. Attention provides a solution to the bottleneck problem.

In [36]:
print(tweet_output_embeddings.size())
print(output.size())

torch.Size([4, 50])
torch.Size([25, 4, 50])


In [37]:
batch_size = output.shape[1]
src_len = output.shape[0]

In [38]:
key_a = nn.Linear(50, 50)
q_a = nn.Linear(50, 50)

In [39]:
output_keys = key_a(output)
tweet_output_embeddings = q_a(tweet_output_embeddings)

Calculate importances between sequence-level representation and token representations.

In [40]:
attn_weights = torch.bmm(output_keys.permute(1,0,2), tweet_output_embeddings.unsqueeze(2)).squeeze(2).permute(1,0)
print(attn_weights.shape)

torch.Size([25, 4])


Ignore "PAD" tokens.

In [41]:
attn_weights = attn_weights.masked_fill(tweets == TEXT.vocab.stoi[TEXT.pad_token], -1e9)

In [42]:
soft_attn_weights = F.softmax(attn_weights, 0).permute(1,0)

In [43]:
soft_attn_weights.shape

torch.Size([4, 25])

In [44]:
new_tweet_output_embedding = torch.bmm(output.permute(1, 2, 0), soft_attn_weights.unsqueeze(2)).squeeze(2)

In [45]:
print(new_tweet_output_embedding.shape)

torch.Size([4, 50])


## Multilayered RNN

For some applications, we may need more than one hidden layer for RNN to model the information flow. Adding more layers requires fews changes.

Firstly, we change the ``num_layers`` argument to reflect the number of layers we want during the RNN module definition.

In [46]:
"""
define the RNN module
"""
# first input - number of dimesnions for word vectors for a vector x (300, size of the word embedding)
# second input - number of nodes in hidden layer (50, size of the hidden layer)
# third input - number of recurrent layers (we set it to 1)
rnn = nn.RNN(input_size=300, hidden_size=50, num_layers=2) # input_size, hidden_size, num_layers

Similar to single layered RNN, Multilayered RNN module takes two inputs: the ``initial hidden state h0`` for each element in the batch (at ``time step t=0``) and the ``input features`` (``tweet_input_embeddings`` in our case).

Let us construct the new initial hidden state for a 2 layered RNN.

In [47]:
"""
hidden layer at time-step 0 (h_0)
"""
# first dimension - number of RNN layers (2)
# second dimension - number of examples/sentences in a batch (4)
# third dimension - number of nodes in hidden layer (50, size of the hidden layer)
h0 = torch.randn(2, 4, 50)
print("The shape as as expected: ", h0.shape)

The shape as as expected:  torch.Size([2, 4, 50])


Let us feed both the hidden representation constructed above and tweet embeddings to our RNN model.

In [48]:
"""
forward propagation over the RNN model
"""
print(tweet_input_embeddings.shape)
output, hn = rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's when not provided

torch.Size([25, 4, 300])


``output`` tensor contains the output features $h_t$ from the last layer of the RNN

In [49]:
# output = seq_len, batch, hidden_size (output features from last layer of RNN)
print("output size: ", output.size())

output size:  torch.Size([25, 4, 50])


``hn`` is a tensor of shape (num_layers, batch_size, hidden_size / number of nodes in a hidden layer) containing the hidden state for last time step ``t = max_seq_len`` for the ``2 layered RNN``.

In [50]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 4, 50])


You can get the last hidden from the last layer by:

In [51]:
hn[-1, :, :]

tensor([[ 0.5613,  0.5971,  0.2494,  0.6146, -0.2725, -0.5024,  0.5964,  0.1208,
         -0.0097, -0.1011, -0.2497,  0.1344,  0.2519,  0.6990,  0.1419,  0.1171,
          0.2397,  0.2506, -0.8315,  0.7039, -0.1909, -0.2245,  0.0495,  0.3401,
         -0.6668,  0.7840,  0.2570, -0.6666,  0.6372, -0.1909,  0.7131,  0.3395,
          0.0861,  0.0372,  0.3110, -0.3719,  0.0879,  0.0951, -0.8802, -0.5366,
         -0.0927, -0.4309,  0.8146,  0.1059,  0.1033, -0.5189,  0.2523,  0.4745,
         -0.5739,  0.3350],
        [ 0.5709,  0.5059,  0.2272,  0.6409, -0.0220, -0.5629,  0.3797,  0.2219,
          0.0491,  0.0966, -0.3847, -0.0554,  0.3238,  0.6879,  0.3584,  0.0714,
          0.2485,  0.1130, -0.8374,  0.6227, -0.2054, -0.1036,  0.0371,  0.2283,
         -0.6539,  0.8107,  0.0551, -0.5544,  0.6408, -0.2999,  0.6253,  0.3933,
         -0.1409,  0.2390,  0.3004, -0.3744, -0.1279,  0.1194, -0.9076, -0.4590,
         -0.1771, -0.4199,  0.7359,  0.0903,  0.1877, -0.4955,  0.1666,  0.3282,


Let us compute the final tweet representation:

In [52]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (4)
# second dimension - number of features in hidden state h_t (50, size of the hidden layer)

tweet output embeddings size:  torch.Size([4, 50])


In [53]:
tweet_output_embeddings

tensor([[ 0.5613,  0.5971,  0.2494,  0.6146, -0.2725, -0.5024,  0.5964,  0.1208,
         -0.0097, -0.1011, -0.2497,  0.1344,  0.2519,  0.6990,  0.1419,  0.1171,
          0.2397,  0.2506, -0.8315,  0.7039, -0.1909, -0.2245,  0.0495,  0.3401,
         -0.6668,  0.7840,  0.2570, -0.6666,  0.6372, -0.1909,  0.7131,  0.3395,
          0.0861,  0.0372,  0.3110, -0.3719,  0.0879,  0.0951, -0.8802, -0.5366,
         -0.0927, -0.4309,  0.8146,  0.1059,  0.1033, -0.5189,  0.2523,  0.4745,
         -0.5739,  0.3350],
        [ 0.5709,  0.5059,  0.2272,  0.6409, -0.0220, -0.5629,  0.3797,  0.2219,
          0.0491,  0.0966, -0.3847, -0.0554,  0.3238,  0.6879,  0.3584,  0.0714,
          0.2485,  0.1130, -0.8374,  0.6227, -0.2054, -0.1036,  0.0371,  0.2283,
         -0.6539,  0.8107,  0.0551, -0.5544,  0.6408, -0.2999,  0.6253,  0.3933,
         -0.1409,  0.2390,  0.3004, -0.3744, -0.1279,  0.1194, -0.9076, -0.4590,
         -0.1771, -0.4199,  0.7359,  0.0903,  0.1877, -0.4955,  0.1666,  0.3282,


#### Update END

## RNN for Sentiment Analysis

In this section we will implement RNN for classifying the sentiment of the tweet (same task used in our previous feedforward neural networks tutorial).

We will pick up most of the functions from our feedforward neural networks code:

In [54]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

# hyperparameters
MAX_EPOCHS = 5
LEARNING_RATE = 0.3
NUM_CLASSES = 3
EMBEDDING_SIZE = 300

Now we can define the full RNN model:

In [74]:
"""
create a model for RNN
"""
class RNNmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size, hidden_size, num_layers, pad_token):
    # In the constructor we define the layers for our model
    super(RNNmodel, self).__init__()
    self.pad_token = pad_token
    # word embedding lookup table
    self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size, sparse=True)
    # core RNN module
    self.rnn_layer = nn.RNN(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers) 
    
    ## attention layer
    self.key_a = nn.Linear(hidden_size, hidden_size)
    self.q_a = nn.Linear(hidden_size, hidden_size)
    
    # activation function
    self.activation_fn = nn.ReLU()
    # classification related modules
    self.linear_layer = nn.Linear(hidden_size, output_size) 
    self.softmax_layer = nn.LogSoftmax(dim=0)
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.embedding(x)
    out, _ = self.rnn_layer(out) # since we are not feeding h_0 explicitly, h_0 will be initialized to zeros by default
    # classify based on the hidden representation after RNN processes the last token
    sequence_rep = out[-1]
    
    # self-attention mechanism
    output_keys = self.key_a(out)
    sequence_rep = self.q_a(sequence_rep)
    
    attn_weights = torch.bmm(output_keys.permute(1,0,2), sequence_rep.unsqueeze(2)).squeeze(2).permute(1,0)
    attn_weights = attn_weights.masked_fill(x == self.pad_token, -1e9)
    
    soft_attn_weights = F.softmax(attn_weights, 0).permute(1,0)
    
    new_output_embedding = torch.bmm(out.permute(1, 2, 0), soft_attn_weights.unsqueeze(2)).squeeze(2)
    ##### end attention
    
    out = self.activation_fn(sequence_rep)
    out = self.linear_layer(out)
    out = self.softmax_layer(out) # accepts 2D or more dimensional inputs
    return out

Some additional hyperparameters for RNN

In [75]:
# hyperparameters of RNN
HIDDEN_SIZE = 50
NUM_LAYERS = 2

### Training and Testing Functions.

In [76]:
from sklearn.metrics import accuracy_score
def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_sample = 0
    for batch in loader:
        # load the current batch
        batch_input = batch.tweet
        batch_output = batch.label
        
        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)
        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_sample += batch_output.shape[0]
    return total_loss/num_sample

# evaluation logic based on classification accuracy
def evaluate(loader):
    all_pred=[]
    all_label = []
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
             # load the current batch
            batch_input = batch.tweet
            batch_output = batch.label

            batch_input = batch_input.to(device)
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(model_outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(batch_output)
            
    accuracy = accuracy_score(all_label, all_pred)
    return accuracy

Let us define the RNN model.

In [77]:
# define the model
model = RNNmodel(EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS, TEXT.vocab.stoi[TEXT.pad_token]) 
model.to(device)
# define the loss function (last node of the graph)
criterion = nn.NLLLoss()

**We need to create a new directory 'ckpt/' to store our model checkpoint.**

In [79]:
import os
# os.mkdir("./ckpt")

**Let us perform the training. We will save our model and optimizer at end of each epoch.**


You can find more information of saving and loading model [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html).

In [82]:
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

# start the training
for epoch in range(MAX_EPOCHS):
    # train the model for one pass over the data
    train_loss = train(train_iter)  
    # compute the training accuracy
    train_acc = evaluate(train_iter)
    # compute the validation accuracy
    val_acc = evaluate(val_iter)
    
    # print the loss for every epoch
    print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))
    
    # save model, optimizer, and number of epoch to a dictionary
    model_save = {
            'epoch': epoch,  # number of epoch
            'model_state_dict': model.state_dict(), # model parameters 
            'optimizer_state_dict': optimizer.state_dict(), # save optimizer 
            'loss': train_loss # training loss
            }
    
    # use torch.save to store 
    torch.save(model_save, "./ckpt/model_{}.pt".format(epoch))

Epoch [1/5], Loss: 0.3453, Training Accuracy: 0.2685, Validation Accuracy: 0.3097
Epoch [2/5], Loss: 0.3450, Training Accuracy: 0.4145, Validation Accuracy: 0.4127
Epoch [3/5], Loss: 0.3449, Training Accuracy: 0.3890, Validation Accuracy: 0.3512
Epoch [4/5], Loss: 0.3445, Training Accuracy: 0.4002, Validation Accuracy: 0.3602
Epoch [5/5], Loss: 0.3440, Training Accuracy: 0.3782, Validation Accuracy: 0.3837


**Load model checkpoint** 

When we have a trained model checkpint, we can load it using `torch.load()`

In [83]:
# define a new model
model2 = RNNmodel(EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS, TEXT.vocab.stoi[TEXT.pad_token]) 
# load checkpoint 
checkpoint = torch.load("./ckpt/model_1.pt")
# assign the parameters of checkpoint to this new model
model2.load_state_dict(checkpoint['model_state_dict'])
model2.to(device)

print(model2)

RNNmodel(
  (embedding): Embedding(3344, 300, sparse=True)
  (rnn_layer): RNN(300, 50, num_layers=2)
  (key_a): Linear(in_features=50, out_features=50, bias=True)
  (q_a): Linear(in_features=50, out_features=50, bias=True)
  (activation_fn): ReLU()
  (linear_layer): Linear(in_features=50, out_features=3, bias=True)
  (softmax_layer): LogSoftmax(dim=0)
)


## GRUs

Gated Recurrent Units (GRUs) are a variant of RNNs that use more complex units for activation. They are created to have more persistent memory thereby making them easier for RNNs to capture long-term dependencies. To learn the theory behind GRUs, we recommend: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 

GRU is defined by ``torch.nn.GRU`` module and its documentation can be fetched [here](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU). Now let us define the GRU module.

In [None]:
"""
define the GRU module
"""
# first input - number of word vector dimensions/embeddings
# second input - number of nodes in hidden layer (50, size of the hidden layer)
# third input - number of recurrent layers (2)
gru_rnn = nn.GRU(input_size=300, hidden_size=50, num_layers=2) # input_size, hidden_size, num_layers

Similar to RNN, GRU module takes two inputs: *the initial hidden state for each element in the batch* (t=0) and the *input features* (``tweet_input_embeddings`` in our case).

Let us feed both the initial hidden state and tweet embeddings to our GRU model.

In [None]:
"""
forward propagation over the GRU model
"""
output, hn = gru_rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contqains the output features $h_t$ from the last layer of the GRU

In [None]:
# output = seq_len, batch, hidden_size (output features from last layer of GRU)
print("output size: ", output.size())

``hn`` is a tensor of shape (num_layers, batch_size, hidden_size / number of nodes in a hidden layer) containing the hidden state for last time step ``t = max_seq_len`` for the ``2 layered RNN``.

In [None]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

Similar to RNN, you can compute the final tweet representation (representation from last hidden state for each tweet) as follows.

In [None]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (5)
# second dimension - number of features in hidden state h_t (20, size of the hidden layer)

## LSTMs

Long short-term memory (LSTMs) are a variant of RNNs that use more complex units for activation. Similar to the spirit of GRU, they are created to have more persistent memory thereby making them easier for RNNs to capture long-term dependencies. To learn the theory behind GRUs, we recommend: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 

LSTM is defined by ``torch.nn.LSTM`` module and its documentation can be fetched [here](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM). Now let us define the LSTM module.

In [None]:
"""
define the LSTM module
"""
# first input - number of features in x (300, size of the word embedding)
# second input - number of number of nodes in a hidden layer (50)
# third input - number of recurrent layers (2)
lstm_rnn = nn.LSTM(input_size=300, hidden_size=50, num_layers=2) # input_size, hidden_size, num_layers

Unlike RNN and GRU, LSTM module takes three inputs: the initial hidden state for each element in the batch (t=0), the input features (tweet_input_embeddings in our case) and initial cell state for each element in the batch.

Let us construct the initial cell state (this construction is similar to that of initial hidden state)

In [None]:
"""
cell state at time-step 0 (h_0)
"""
# first dimension - number of LSTM layers (2)
# second dimension - batch_size (# of tweets/examples/sentences)
# third dimension - hidden_size / number of nodes in a hidden layer (50)
c0 = torch.randn(2, 4, 50)

Let us feed the initial hidden state, initial cell state and tweet embeddings to our LSTM model.

In [None]:
"""
forward propagation over the LSTM model
"""
output, (hn, cn) = lstm_rnn(tweet_input_embeddings, None) # h0 and c0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contains the output features $h_t$ from the last layer of the LSTM

In [None]:
# output = seq_len, batch_size, hidden_size (output features from last layer of LSTM)
print("output size: ", output.size())

``hn`` is a tensor of shape (num_layers, batch, hidden_size) containing the hidden state for t = seq_len

In [None]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

``cn`` is a tensor of shape (num_layers, batch, hidden_size) containing the cell state for t = seq_len.

In [None]:
# c_n = num_layers, batch_size, hidden_size (cell state for t=seq_len or cell state at last timestep)
print("last cell state size: ", hn.size())

Similar to RNN and GRU, you can compute the final tweet representation (representation from last hidden state for each tweet) as follows.

In [None]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (4)
# second dimension - number of features in hidden state h_t (50, size of the hidden layer)