## Recurrent Neural Networks - Supervised Learning II - MDS Computational Linguistics

### Goal of this tutorial
- Learn about embedding layer
- Introduce Recurrent Neural Networks (RNNs)
- Implement RNN for sentiment analysis
- Implement Long-Short Term Memories (LSTMs) for sentiment analysis
- Implement Gated Recurrent Units (GRUs) for sentiment analysis

### General
- This notebook was last tested on Python 3.8, PyTorch 1.7.1 and TorchText 0.8.1 (**Strongly recommended to use the same version for doing this tutorial and lab3**)
- This notebook uses torchtext to process datasets

We would like to acknowledge the following materials that helped as a reference in preparing this tutorial:
- https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf

### Getting Started

In [1]:
# required imports
import pandas as pd
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
import torch
import torch.nn as nn

#### Embedding Layer

Embedding layer is the ubiquitous input layer of deep neural networks used in NLP.

The [``Embedding`` layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#embedding) in Pytorch (that we used in the last week's tutorial on Word2vec) is a lookup table that is typically (in NLP) used to store word embeddings of a fixed vocabulary and word embedding size. Word embeddings can be retrieved from the lookup table by providing a list of word index as input to the layer. We need to know what the input and the output of this layer look like. Let's first look at a dummy example where we have two sentences ``x_1`` and ``x_2``. Let's assume we have the two sentences as:

In [2]:
x_1 = "He is very nice" # sentence 1
x_2 = "She is very kind" # sentence 2

Let's convert the two sentences into indexes (each word is replaced with its index in the vocabulary).
Let's assume our ``vocabulary size`` is set to 100. Remember, vocabulary size is a hyper-parameter.
Let's also store that ``vocabulary size`` in a variable ``VOCAB_SIZE`` now as we will need to pass it to the ``Embedding`` layer later.

In [3]:
x_1 = [1, 25, 40, 5]
x_2 = [4, 25, 40, 99]
VOCAB_SIZE = 100

#### Max sequence length

One last thing we need to think about is the ``length of each sequence``. The two examples above are nicely set to equal length = 4. This does not need to be the case, as we can have sequences of varying lengths. We will be passing a batch of sentences to Pytorch and the max sequence length will be set to the length of the longest sentence (after tokenization) in that batch. The rest of sentences (shorter ones) will be padded with zeros. Now, do we need to explicitly provide the max sequence length to Pytorch? And how do we know the max seq length for each batch, if different batches have sequences of varying lengths and each batch is set to the max sent in that batch? Well, rest assured, we don't really need to worry about that. Pytorch will assign a max seq length for each batch. We will be able to inspect the max seq length for a given batch using output of the ``Embedding`` layer. (We will see that soon).

#### Size of word vector

The ``Embedding`` layer will give us a vector for each word in the vocabulary.
Now, we will need to tell it what size we want for that vector. Popular values for a vector size are usually between 100-300 for many tasks (e.g., sentiment analysis). Let's set it to 200 dimensions. (You are encouraged to play with this value as practice). All words in the vocabulary will have the same embedding size. Let's put that hyper-parameter in a variable ``WORD_VEC_SIZE``:

In [4]:
WORD_VEC_SIZE= 300 # size of word embedding

We are now ready to call the ``Embedding`` class to construct an embeddings tensor:

In [5]:
# Constructing an embedding Layer:
embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE)
print("size of the embedding lookup table = ", embedding.weight.data.size())

# let's create a sample input (word indices) to the embedding layer 
sample_input = torch.LongTensor([ x_1, x_2 ])
print("input (word indices) tensor = \n", sample_input)
print("input (word indices) shape = ", sample_input.size())

size of the embedding lookup table =  torch.Size([100, 300])
input (word indices) tensor = 
 tensor([[ 1, 25, 40,  5],
        [ 4, 25, 40, 99]])
input (word indices) shape =  torch.Size([2, 4])


Let's pass the input to the embedding layer and print the word embeddings:

In [6]:
# let's pass the input to the embedding layer
word_embeddings = embedding(sample_input)
print("word embeddings tensor = \n", word_embeddings)

word embeddings tensor = 
 tensor([[[ 0.9798, -1.1870, -0.6102,  ..., -0.6211,  1.7830, -0.9271],
         [-0.7827,  0.9486,  0.2277,  ...,  0.5441,  0.3347,  0.5759],
         [ 0.6719, -0.4324,  0.2741,  ...,  0.2009, -0.4020, -0.0463],
         [-1.2636, -1.8215,  0.4899,  ...,  0.9269, -1.2691, -0.3428]],

        [[-1.4824, -0.4423,  0.1434,  ..., -0.4140, -2.7924, -1.7520],
         [-0.7827,  0.9486,  0.2277,  ...,  0.5441,  0.3347,  0.5759],
         [ 0.6719, -0.4324,  0.2741,  ...,  0.2009, -0.4020, -0.0463],
         [ 0.1956, -0.6984,  2.5623,  ...,  0.4133,  0.7928,  0.1284]]],
       grad_fn=<EmbeddingBackward>)


Let's print the shape of this tensor:

In [7]:
print("word embeddings shape = ", word_embeddings.size())

word embeddings shape =  torch.Size([2, 4, 300])


Each dimension can be interpreted as:
- **First dimension:** (**2**,4,300): We have ``2 examples`` (that is, our ``x_1`` and ``x_2``). (Note: We will be passing a whole batch to the ``Embedding`` class and so this first dimension will be equal to the ``batch size``.
- **Second dimension:** (2,**4**,300): For each of the two examples, we have a ``max sequence length`` = 4 (x_1 and x_2 each had 4 indexes).
- **Third dimension:** (2,4,**300**): The ``word vector dimension`` is set to 300.

### Max sequence length: Another note

Recall from above we mentioned Pytorch automatically infers the max sequence length for each batch. 
For the example above (as you can see from the second dimension returned by ``word_embeddings.size()``, Pytorch 
inferred the max seq length for this batch of two sentences is 4.

Let's just adjust the second example, **adding two more words** (the string "and kind"). Note, both our ``VOCAB_SIZE`` and ``WORD_VEC_SIZE`` stay the same as before. We assign the word "and" an index of "7" and the word "considerate" an index of "60". Note that we have to pad the first example ``x_1`` with zeros (we will explicitly set the padding index to zero later when we define the embedding layer) in the end. (Try removing the zero padding. What do you observe when you run your code with the ``Embedding`` class? Hint: You will get an error.):

In [8]:
x_1 = "He is very nice"
x_2 = "She is very kind and considerate"
x_1 = [1, 25, 40, 5, 0, 0]
x_2 = [4, 25, 40, 99, 7, 60]

Now, let's create a new ``embedding`` layer by creating a new instance of the ``Embedding`` class:

In [9]:
# constructing an embedding layer:
padded_embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE, padding_idx=0)
print("size of the embedding lookup table = ", padded_embedding.weight.data.size())

size of the embedding lookup table =  torch.Size([100, 300])


Note that **padded_embedding** embedding layer specifies the padding index (corresponds to an embedding initialized to all zeros). In this example, the zeroth index is dedicated for storing padding embedding (vector initialized to all zeros).

Let's create sample input and pass it to the embedding layer.

In [10]:
# let's create a sample input
sample_input = torch.LongTensor([ x_1, x_2 ])
print("input (word indices) tensor = \n", sample_input)
print("input (word indices) shape = ", sample_input.size())

# let's retrieve the word embeddings by passing the sample input to the layer
word_embeddings = padded_embedding(sample_input)
print(word_embeddings)

input (word indices) tensor = 
 tensor([[ 1, 25, 40,  5,  0,  0],
        [ 4, 25, 40, 99,  7, 60]])
input (word indices) shape =  torch.Size([2, 6])
tensor([[[-0.0582,  1.1280,  0.0884,  ..., -0.6426, -0.9552,  1.6909],
         [ 1.6845,  0.8643,  0.2674,  ..., -0.1617, -0.0915, -0.1291],
         [-0.6820, -1.4999, -0.1746,  ..., -0.6563,  2.1366,  0.1573],
         [ 0.1073, -1.8373, -0.9552,  ..., -1.4102, -0.2065,  1.0237],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[ 1.6686, -0.0532,  0.5033,  ..., -0.0621,  0.0885, -0.4420],
         [ 1.6845,  0.8643,  0.2674,  ..., -0.1617, -0.0915, -0.1291],
         [-0.6820, -1.4999, -0.1746,  ..., -0.6563,  2.1366,  0.1573],
         [-0.6999, -0.1217,  1.1299,  ...,  0.1976, -1.0592, -1.5562],
         [-0.8967, -0.3685, -0.9641,  ...,  0.0572,  1.1357, -0.7211],
         [-0.2081, -1.8562,  0.4156,  ...,  0.5736,  0.5827,  0.885

If we inspect the shape of the new tensor ``word_embeddings``, we will see the second dimension now changed to 6, to match the max sequence length:

In [11]:
print(word_embeddings.size())

torch.Size([2, 6, 300])


### How does Pytorch initialize word vector dimensions/weights?

Note that Pytorch initializes the word vectors from a **normal distribution** $ \mathcal{N}(0, 1) $. The word embedding weights are by default learnable parameters in Pytorch and so they will be adjusted during training. (Note: These weights can be initialized from an external word embedding tool such as [Word2vec](https://code.google.com/archive/p/word2vec/), [Fasttext](https://fasttext.cc/), or [Glove](https://nlp.stanford.edu/projects/glove/). Also, the weights can be frozen (by setting ``embedding.weigh.required_grad`` flag to False), which is a reasonable option when initialized from an external tool. You can choose to keep learning them within the model with your training data). Below we show the ones initialized from a normal distribution by Pytorch.

In [12]:
embedding.weight

Parameter containing:
tensor([[-1.0032,  1.1493, -0.6488,  ..., -0.9187, -0.6707, -0.0932],
        [ 0.9798, -1.1870, -0.6102,  ..., -0.6211,  1.7830, -0.9271],
        [ 1.5862,  1.0643,  1.0893,  ...,  0.3026,  0.0307, -2.3768],
        ...,
        [ 0.0657,  0.4122, -0.5181,  ...,  0.4328,  0.3826,  1.4511],
        [ 1.1297, -0.8527, -1.2060,  ..., -1.3148,  0.2078, -1.2655],
        [ 0.1956, -0.6984,  2.5623,  ...,  0.4133,  0.7928,  0.1284]],
       requires_grad=True)

More information about the ``Embedding`` class can be found [here](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) ([source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#Embedding)).

## Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are used to model sequences of arbitrary length (e.g., sequence of words in a sentence, sequence of sentences in a document, sequence of frames in a video). RNNs typically use their internal state (memory) to process sequence of inputs. At each time-step, RNNs output a prediction and hidden state, feeding its previous hidden state into each next step. RNNs are applied in a wide range of NLP applications:
- language modeling, where RNN can condition on **all** previous words in the corpus unlike n-gram language model
- text classification, where the states act as features (we will see sentiment analysis in this tutorial)
- machine translation, where a RNN is used to process a sentence in source language and another RNN is used to decode the sentence in target language (we will see this in the "Machine Translation" course)
- sequence labeling, where the states in RNN are used to predict a category for each item in the sequence 

Recommended reading for understanding the theory of RNNs: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 


### Grabbing few tweets using torchtext

Let us follow **torchtext** tutorial (seen in Week1) to read few tweets from the [sentiment analysis dataset](http://alt.qcri.org/semeval2016/task4/) used in the previous tutorial on feedforward neural networks. The preprocessed (tokenization, removing URLs, mentions, hashtags and so on) tweets are placed under ``data/sentiment-twitter-2016-task4`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``.  

Let us view few tweets from ``train.tsv`` using pandas.

In [13]:
import pandas as pd
df = pd.read_csv("./data/sentiment-twitter-2016-task4/train.tsv", sep = '\t', header=None, names=['tweet','label']) # the separator of tsv file is `\t`
df.head()

Unnamed: 0,tweet,label
0,dear <<<MENTION>>> the newooffice for mac is g...,2
1,<<<MENTION>>> how about you make a system that...,2
2,i may be ignorant on this issue but should we ...,2
3,thanks to <<<MENTION>>> i just may be switchin...,2
4,if i make a game as a <<<HASHTAG>>> universal ...,0


**We import the relevant packages, define the tokenizer and TorchText's fields.**

In [14]:
# import related packages
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset

# define the white space tokenizer to get tokens
def tokenize_en(tweet):
    """
    Tokenizes English tweet from a string into a list of strings (tokens)
    """
    return tweet.strip().split()

# define the TorchText's fields
TEXT = Field(sequential=True, tokenize=tokenize_en, lower=True)
LABEL = Field(sequential=False, unk_token = None)



**To use the different splits (training, development and testing), we use `TabularDataset` class to load datasets.**

In [15]:
train, val, test = TabularDataset.splits(
    path="./data/sentiment-twitter-2016-task4/", # the root directory where the data lies
    train='train.tsv', validation="dev.tsv", test="test.tsv", # file names
    format='tsv',
    skip_header=False, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
    fields=[('tweet', TEXT), ('label', LABEL)])



**Build our vocabulary to map words to integers.**

In [16]:
TEXT.build_vocab(train, min_freq=3) # builds vocabulary based on all the words that occur at least twice in the training set
LABEL.build_vocab(train)

**Initialize the iterators for the train, validation, and test data. Note that we set ``sort`` as `False` so as to not sort examples based on similar lengths which minimizes padding.**

In [17]:
from torchtext.data import Iterator, BucketIterator

train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(4,64,64),
 sort_key=lambda x: len(x.tweet), 
 sort=False,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=False
)



**Create a batch of four examples and print them**

In [18]:
# create a single batch and terminate the loop
for batch in train_iter:
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.

# print the four examples with padding and corresponding label
print("processed tweets: ")
for j in range(tweets.shape[1]): # sample loop
    tokens = []
    for i in range(tweets.shape[0]): # token loop
        tokens.append(TEXT.vocab.itos[tweets[i,j]])
    print(j," sample:",tokens," label:", labels[j].item())

processed tweets: 
0  sample: ['granted', "i'm", '<unk>', 'it', 'to', 'giving', 'birth', 'to', 'our', 'kid', 'but', "i'll", 'add', 'my', 'well', 'that', 'was', '<unk>', 'to', 'the', '<unk>', 're', 'amazon', 'prime', 'day']  label: 2
1  sample: ['<<<mention>>>', 'when', 'you', 'talk', 'about', 'ben', 'carson', 'please', 'show', 'him', 'not', 'jeb', 'bush', 'like', 'you', 'did', 'on', 'your', 'thurs', 'show', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']  label: 2
2  sample: ['in', 'the', 'ufc', 'game', 'conor', 'mcgregor', 'is', 'the', '<unk>', 'in', 'his', 'weight', 'class', 'but', 'guess', 'who', 'is', 'champion', 'now', '<<<url>>>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']  label: 1
3  sample: ["harper's", 'worst', '<unk>', 'against', 'refugees', 'may', 'be', 'climate', 'record', 'as', 'rising', '<unk>', 'add', 'to', '<unk>', 'in', 'the', 'middle', 'east', '<<<url>>>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']  label: 1




**Now we set ``sort`` as `True` so as to sort examples based on similar lengths which minimizes padding.**

**Let us initialize the new iterators for the train, validation, and test data.**

In [19]:
train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(4,64,64),
 sort_key=lambda x: len(x.tweet), 
 sort=True,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=True
)

**Let us pick up 4 tweets from the training set and convert them to tensors.**

**Create a batch of four examples and print them**

In [20]:
# create a single batch and terminate the loop
for batch in train_iter:
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.

# print the four examples with padding and corresponding label
print("processed tweets: ")
for j in range(tweets.shape[1]): # sample loop
    tokens = []
    for i in range(tweets.shape[0]): # token loop
        tokens.append(TEXT.vocab.itos[tweets[i,j]])
    print(j," sample:",tokens," label:", labels[j].item())

processed tweets: 
0  sample: ['ihop', 'is', 'the', 'move', 'tomorrow']  label: 0
1  sample: ['<<<mention>>>', 'make', 'david', 'beckham', 'tomorrow']  label: 1
2  sample: ['bringing', 'the', 'bentley', 'out', 'tomorrow']  label: 0
3  sample: ['new', '<unk>', 'with', 'bentley', 'tomorrow']  label: 0


### Creating a single hidden layer RNN

PyTorch has ``torch.nn.RNN`` module that implements the vanilla (Elman) RNN with *tanh* or *ReLU* non-linearity. The documentation for this module is [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html?highlight=nn%20rnn#torch.nn.RNN). Let us use the sample batch of five examples created before to understand this module.

In this tutorial, we will represent the input tweet using a sequence of word embeddings (for each word present in the tweet). We will use ``torch.nn.Embedding`` layer to store word vectors corresponding to words in the vocabulary.

Before implementing the embedding module for our usecase, let us compute the size of the word vocabulary.

In [21]:
# print the size of the word vocabulary
VOCAB_SIZE = len(TEXT.vocab.stoi)
print(VOCAB_SIZE)

3338


We have 3338 unique words in the vocabulary.

Let us implement the embedding module (whose underlying weight matrix shape is (``vocabulary size`` $\times$ ``word embedding size``) for our usecase:

In [22]:
# set the word embedding size
WORD_VEC_SIZE = 300

# an Embedding module containing 300 dimensional tensor for each word in the vocabulary
# Note, the parameters to Embedding class below are:
# num_embeddings (int): size of the dictionary of embeddings
# embedding_dim (int): the size of each embedding vector
# For more details on Embedding class, see: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/sparse.py
embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE, sparse=True)
print("lookup table shape = ", embedding.weight.size())

lookup table shape =  torch.Size([3338, 300])


Let us now feed the tensors of our sample batch to the embedding module and extract the sequence of word embeddings for each tweet.

In [23]:
# print tensor containing word ids for our batch
print("*"*50, "\n Word ids for the first batch (recall, it has 4 sentences, each column representing a sentence): \n", tweets.data, "\n","*"*50,)

# feed the "word ids" tensor to the embedding module
tweet_input_embeddings = embedding(tweets)

# print the dimensions of the tweet_embeddings
print("*"*50, "\n Tweet input word embeddings size: ", tweet_input_embeddings.size(), "\n","*"*50,) 
# first dimension - sequence length: number of words per example (same across the whole batch, after padding) --> max_seq = 22
# second dimension -  batch size / number of examples in the batch --> 4
# third dimension - number of dimensions in the word vector

************************************************** 
 Word ids for the first batch (recall, it has 4 sentences, each column representing a sentence): 
 tensor([[ 191,    4, 1273,   49],
        [  14,   82,    2,    0],
        [   2,   73,  215,   18],
        [ 598,  145,   48,  215],
        [  21,   21,   21,   21]]) 
 **************************************************
************************************************** 
 Tweet input word embeddings size:  torch.Size([5, 4, 300]) 
 **************************************************


Let's actually view the actual word embeddings tensor for this batch:

In [24]:
print("*"*50, "\n Embeddings for the first batch: \n", tweet_input_embeddings, "\n","*"*50,) 

************************************************** 
 Embeddings for the first batch: 
 tensor([[[-0.3152, -1.0347,  2.4082,  ...,  0.2543, -0.8464,  0.7146],
         [-0.5987,  1.0007,  0.3721,  ...,  0.8515,  0.0784,  0.6186],
         [-0.2582, -2.1945, -0.0172,  ..., -0.6653, -1.2621,  0.6183],
         [ 0.3153,  0.5245, -0.2765,  ..., -1.8035,  1.1769,  1.6999]],

        [[ 0.8911,  0.1990, -1.3665,  ..., -0.9696,  0.2140,  1.0128],
         [-0.7123,  0.0674,  0.8630,  ..., -0.1489,  1.5371, -0.3056],
         [ 1.4509, -0.9218,  2.5389,  ..., -1.4941,  0.2399,  0.7264],
         [-2.0550,  0.6465, -0.7391,  ...,  0.8182,  0.9721, -0.3993]],

        [[ 1.4509, -0.9218,  2.5389,  ..., -1.4941,  0.2399,  0.7264],
         [ 0.0084,  0.9770, -0.0948,  ..., -0.5597,  1.5041,  1.1349],
         [ 0.2611, -1.3939, -1.3628,  ..., -1.3538, -0.3191,  1.3667],
         [ 0.3166,  0.3445, -0.8028,  ..., -1.0034, -0.2659, -0.4560]],

        [[-0.6088, -0.2902, -0.2125,  ...,  0.3276, -0.

What we are seeing is the actual word vectors representing each of the 4 sentences (i.e., whole batch).
This is dimension 2 in ``tweet_input_embeddings``. 

In [25]:
tweet_input_embeddings.size()[1]

4

As mentioned, ``max_seq length`` for this batch is ``5``, which is dimension 1 (indexed as 0 in Pytorch, similar to Python) 
in ``tweet_input_embeddings``: 

In [26]:
tweet_input_embeddings.size()[0]

5

Now, dimension 3 in ``tweet_input_embeddings`` (indexed as 2) is the size of the word vectors:

In [27]:
tweet_input_embeddings.size()[2]

300

Let's look at the vector for the ``first word`` in the ``first sentence`` in the batch:

In [28]:
tweet_input_embeddings[:1, :1, :].shape

torch.Size([1, 1, 300])

In [29]:
tweet_input_embeddings[:1, :1, :]

tensor([[[-3.1517e-01, -1.0347e+00,  2.4082e+00, -1.0549e+00, -1.1876e+00,
           5.8067e-01,  1.4999e-01,  1.7858e-01, -1.4571e-01,  1.3772e-02,
          -2.2149e-01, -2.6993e-01, -1.9553e-01,  1.4110e+00,  1.7070e+00,
          -1.0162e+00, -1.3707e-01, -5.1514e-01,  1.7628e+00, -7.4432e-02,
          -1.0379e+00,  1.4239e+00,  9.3586e-01,  2.3170e-01,  4.1945e-01,
           5.9157e-01, -2.8613e-01, -1.9314e+00, -1.4645e+00, -7.0700e-01,
          -4.0687e-03,  9.3376e-01, -4.4677e-02, -1.3625e+00,  4.9781e-01,
           5.5226e-01,  1.4603e+00, -2.7101e-01, -1.3471e+00,  4.8935e-01,
           2.8223e-01,  9.9673e-02, -6.0737e-01,  1.5748e-01,  5.2830e-01,
           1.0143e+00, -1.4411e-01,  9.4106e-01,  1.1552e+00, -1.3455e+00,
           2.1770e+00,  2.8645e-01,  3.4648e-01,  8.3725e-01,  4.9333e-01,
           1.4122e+00,  1.4048e+00,  4.7007e-01,  1.5953e+00, -1.0767e+00,
          -1.5796e+00,  8.7937e-01, -9.1238e-01,  7.3900e-01,  1.7272e+00,
          -7.7277e-01, -7

Let's look at the ``first 5 dimensions`` of that same ``first word`` of the ``first sentence``:

In [30]:
tweet_input_embeddings[:1, :1, :5]

tensor([[[-0.3152, -1.0347,  2.4082, -1.0549, -1.1876]]],
       grad_fn=<SliceBackward>)

The following shows you the ``first 5 dimensions`` of the ``first word`` from ``each of the 4 sentences``

In [31]:
tweet_input_embeddings[:1, :, :5]

tensor([[[-0.3152, -1.0347,  2.4082, -1.0549, -1.1876],
         [-0.5987,  1.0007,  0.3721, -0.9764, -0.1339],
         [-0.2582, -2.1945, -0.0172, -1.1707,  0.1426],
         [ 0.3153,  0.5245, -0.2765, -0.4259, -0.1367]]],
       grad_fn=<SliceBackward>)

The following shows you the ``last 7 dimensions`` of the ``last word`` from ``the last sentence``. Enjoy!

In [32]:
tweet_input_embeddings[-1:, -1:, -7:]

tensor([[[-0.3914,  1.2455, -0.7605,  0.9315, -0.3592, -0.1625,  0.0032]]],
       grad_fn=<SliceBackward>)

We will be passing the sequence of word embeddings for each sentence in the batch as input to the RNN. But let's now define an RNN module first:

In [33]:
"""
define the RNN module
"""
# first input - number of dimensions for word vectors for a vector x (300, size of the word embedding)
# second input - number of nodes in hidden state h_t (50, size of the hidden layer)
# third input - number of recurrent layers (we set it to 1)
rnn = nn.RNN(input_size=300, hidden_size=50, num_layers=1) # input_size, hidden_size, num_layers
print(rnn)

RNN(300, 50)


We will now pass the ``tweet_input_embeddings`` (representations of words in our batch) to RNN. Before we do, we need to know RNN also *optionally* takes a parameter for the ``initial hidden state h0`` (that is, the hidden state we will input to the model before the forward propagation starts. If this vector is not explicitly specified, Pytorch will just initialize h0 to a tensor of zeros.)

Let's construct an ``initial hidden state h0``. Note the shape of its tensor, and what each of the 3 parameters it takes mean.

In [34]:
"""
hidden layer at time-step 0 (h_0)
"""
# first dimension - number of RNN layers (1)
# second dimension - number of examples/sentences in a batch
# third dimension - number of nodes in hidden layer (50, size of the hidden layer, that we specified as hidden_size in RNN construction)
h0 = torch.randn(1, 4, 50)
print("The shape as as expected: ", h0.shape)

The shape as as expected:  torch.Size([1, 4, 50])


Let us feed both the hidden representation constructed above and tweet embeddings to our RNN model.
We will get back two objects ``output`` and ``hn`` that we will need to understand.

In [35]:
"""
forward propagation over the RNN model
"""
output, hn = rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's of apprpriate size (num_layers, batch, hidden_size) when not provided

But what is ``output``? Well, let's inspect its shape first:

In [36]:
# output = seq_len, batch, hidden_size (output features from last layer of RNN)
print("output size: ", output.size())

output size:  torch.Size([5, 4, 50])


Here's what we need to know about ``output``:
- The first dimension in the ``output`` tensor is the ``max_seq length`` (5). 
- The second dimension is ``batch_size`` (the number of examples/sentences in our batch = 4).
- The third dimension is the ``size of nodes/units`` in our hidden layer (=50). 

What is the shape of hn (tensor containing the hidden state for t=max_seq_length) ?

In [37]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([1, 4, 50])


Here's what we need to know about ``hn``:
- ``hn`` is a tensor of shape (num_layers, batch_size, hidden_size / number of hidden layer nodes) containing the hidden state for the last ``time step`` 
(``t = max_seq_length``).

You can take the output representation for a tweet after processing the last token (t=seq_len or last timestep) and call the resulting representation as the tweet representation that **"summarizes" the information present** in the tweet. This tweet representation can further be used for a useful task like tweet classification (we will try out sentiment analysis later in this tutorial) by adding a classification module on top of the tweet representation.

Let us compute the final tweet representation:

In [38]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (4)
# second dimension - number of features in hidden state h_t (50, size of the hidden layer)

tweet output embeddings size:  torch.Size([4, 50])


## Multilayered RNN

For some applications, we may need more than one hidden layer for RNN to model the information flow. Adding more layers requires fews changes.

Firstly, we change the ``num_layers`` argument to reflect the number of layers we want during the RNN module definition (we will define two hidden layers).

In [39]:
"""
define the RNN module
"""
# first input - number of dimesnions for word vectors for a vector x (300, size of the word embedding)
# second input - number of nodes in hidden layer (50, size of the hidden layer)
# third input - number of recurrent layers (we set it to 2)
rnn = nn.RNN(input_size=300, hidden_size=50, num_layers=2) # input_size, hidden_size, num_layers

Similar to single layered RNN, Multilayered RNN module takes two inputs: the ``initial hidden state h0`` for each element in the batch (at ``time step t=0``) and the ``input features`` (``tweet_input_embeddings`` in our case).

Let us construct the new initial hidden state for a 2 layered RNN.

In [40]:
"""
hidden layer at time-step 0 (h_0)
"""
# first dimension - number of RNN layers (2)
# second dimension - number of examples/sentences in a batch (4)
# third dimension - number of nodes in hidden layer (50, size of the hidden layer)
h0 = torch.randn(2, 4, 50)
print("The shape as as expected: ", h0.shape)

The shape as as expected:  torch.Size([2, 4, 50])


Let us feed both the hidden representation constructed above and tweet embeddings to our RNN model.

In [41]:
"""
forward propagation over the RNN model
"""
print(tweet_input_embeddings.shape)
output, hn = rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's when not provided

torch.Size([5, 4, 300])


``output`` tensor contains the output features $h_t$ from the last layer of the RNN

In [42]:
# output = seq_len, batch, hidden_size (output features from last layer of RNN)
print("output size: ", output.size())

output size:  torch.Size([5, 4, 50])


``hn`` is a tensor of shape (num_layers, batch_size, hidden_size / number of nodes in a hidden layer) containing the hidden state for last time step ``t = max_seq_len`` for the ``2 layered RNN``.

In [43]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 4, 50])


#### Building tweet representation

Actually, `output` is tensor containing the output features (h_t) from the last layer of the RNN, for each t. Namely, `output` returns all the hidden states of all time steps from the last layer of the RNN. Hence, the last element of `output` is `h_n`. 
Let us print them out:

In [44]:
print("last element of output:\n", output[-1])

last element of output:
 tensor([[ 2.1954e-02, -4.5405e-01,  1.7242e-01, -2.1612e-01,  8.5858e-01,
         -5.0062e-02,  4.6210e-01,  2.3468e-01, -2.3579e-01, -7.0481e-01,
         -4.2845e-01, -3.2530e-01,  1.5369e-01, -1.4745e-03, -2.4446e-01,
         -7.9483e-01,  5.5982e-01, -2.3135e-01,  3.0479e-01, -5.2203e-01,
         -4.0454e-01,  4.7099e-01, -6.7573e-01, -8.0995e-01,  2.7188e-01,
          6.6281e-01, -4.8907e-01, -7.8016e-01, -1.4256e-01, -2.8643e-01,
          3.8399e-01,  9.2092e-02,  7.7104e-01,  8.5698e-03,  4.1950e-01,
          3.8412e-01, -1.7274e-02,  1.5807e-01,  2.8053e-01, -4.5239e-01,
          6.2239e-01,  1.4339e-01, -6.6002e-01, -6.4669e-01,  5.6979e-01,
          9.2288e-02,  4.3814e-01, -3.7532e-01, -2.6342e-01, -7.0075e-01],
        [-1.5976e-01, -3.8695e-01,  3.6826e-01, -2.2092e-01,  6.9086e-01,
         -5.4064e-01,  6.1590e-01,  7.6471e-02,  4.4412e-01, -6.2336e-01,
         -2.4065e-01, -4.9375e-02,  8.9721e-02,  4.5547e-01,  2.6374e-01,
         -8.

Then let us print out the last hidden state of last layer. 
Notice, we have two RNN layer, we only want to use the hidden state of last layer. 
You can find the these value are same as `output[-1]`.

In [45]:
print("last hidden state h_n:\n", hn[-1])

last hidden state h_n:
 tensor([[ 2.1954e-02, -4.5405e-01,  1.7242e-01, -2.1612e-01,  8.5858e-01,
         -5.0062e-02,  4.6210e-01,  2.3468e-01, -2.3579e-01, -7.0481e-01,
         -4.2845e-01, -3.2530e-01,  1.5369e-01, -1.4745e-03, -2.4446e-01,
         -7.9483e-01,  5.5982e-01, -2.3135e-01,  3.0479e-01, -5.2203e-01,
         -4.0454e-01,  4.7099e-01, -6.7573e-01, -8.0995e-01,  2.7188e-01,
          6.6281e-01, -4.8907e-01, -7.8016e-01, -1.4256e-01, -2.8643e-01,
          3.8399e-01,  9.2092e-02,  7.7104e-01,  8.5698e-03,  4.1950e-01,
          3.8412e-01, -1.7274e-02,  1.5807e-01,  2.8053e-01, -4.5239e-01,
          6.2239e-01,  1.4339e-01, -6.6002e-01, -6.4669e-01,  5.6979e-01,
          9.2288e-02,  4.3814e-01, -3.7532e-01, -2.6342e-01, -7.0075e-01],
        [-1.5976e-01, -3.8695e-01,  3.6826e-01, -2.2092e-01,  6.9086e-01,
         -5.4064e-01,  6.1590e-01,  7.6471e-02,  4.4412e-01, -6.2336e-01,
         -2.4065e-01, -4.9375e-02,  8.9721e-02,  4.5547e-01,  2.6374e-01,
         -8.4

Let us compute the final tweet representation:

In [46]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (4)
# second dimension - number of features in hidden state h_t (50, size of the hidden layer)

tweet output embeddings size:  torch.Size([4, 50])


In [47]:
tweet_output_embeddings

tensor([[ 2.1954e-02, -4.5405e-01,  1.7242e-01, -2.1612e-01,  8.5858e-01,
         -5.0062e-02,  4.6210e-01,  2.3468e-01, -2.3579e-01, -7.0481e-01,
         -4.2845e-01, -3.2530e-01,  1.5369e-01, -1.4745e-03, -2.4446e-01,
         -7.9483e-01,  5.5982e-01, -2.3135e-01,  3.0479e-01, -5.2203e-01,
         -4.0454e-01,  4.7099e-01, -6.7573e-01, -8.0995e-01,  2.7188e-01,
          6.6281e-01, -4.8907e-01, -7.8016e-01, -1.4256e-01, -2.8643e-01,
          3.8399e-01,  9.2092e-02,  7.7104e-01,  8.5698e-03,  4.1950e-01,
          3.8412e-01, -1.7274e-02,  1.5807e-01,  2.8053e-01, -4.5239e-01,
          6.2239e-01,  1.4339e-01, -6.6002e-01, -6.4669e-01,  5.6979e-01,
          9.2288e-02,  4.3814e-01, -3.7532e-01, -2.6342e-01, -7.0075e-01],
        [-1.5976e-01, -3.8695e-01,  3.6826e-01, -2.2092e-01,  6.9086e-01,
         -5.4064e-01,  6.1590e-01,  7.6471e-02,  4.4412e-01, -6.2336e-01,
         -2.4065e-01, -4.9375e-02,  8.9721e-02,  4.5547e-01,  2.6374e-01,
         -8.4492e-01,  6.4244e-01, -4

## RNN for Sentiment Analysis

In this section we will implement RNN for classifying the sentiment of the tweet (same task used in our previous feedforward neural networks tutorial).

We will pick up most of the functions from our feedforward neural networks code:

In [48]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed (for reproducibility)
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
  torch.cuda.manual_seed(manual_seed)

# hyperparameters
MAX_EPOCHS = 5 # number of passes over the training data
LEARNING_RATE = 0.3 # learning rate for the weight update rule
NUM_CLASSES = 3 # number of classes for the problem
EMBEDDING_SIZE = 300 # size of the word embedding

Now we can define the full RNN model:

In [49]:
"""
create a model for RNN
"""
class RNNmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size, hidden_size, num_layers):
    # In the constructor we define the layers for our model
    super(RNNmodel, self).__init__()
    # word embedding lookup table
    self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size, sparse=True)
    # core RNN module
    self.rnn_layer = nn.RNN(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers) 
    # activation function
    self.activation_fn = nn.ReLU()
    # classification related modules
    self.linear_layer = nn.Linear(hidden_size, output_size) 
    self.softmax_layer = nn.LogSoftmax(dim=0)
    self.debug = False
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    if self.debug:
        print("input word indices shape = ", x.size())
    out = self.embedding(x)
    if self.debug:
        print("word embeddings shape = ", out.size())
    out, _ = self.rnn_layer(out) # since we are not feeding h_0 explicitly, h_0 will be initialized to zeros by default
    if self.debug:
        print("RNN output (features from last layer of RNN for all timesteps) shape = ", out.size())
    # classify based on the hidden representation after RNN processes the last token
    out = out[-1]
    if self.debug:
        print("Tweet embeddings or RNN output (features from last layer of RNN for the last timestep only) shape = ", out.size())
    out = self.activation_fn(out)
    if self.debug:
        print("ReLU output shape = ", out.size())
    out = self.linear_layer(out)
    if self.debug:
        print("linear layer output shape = ", out.size())
    out = self.softmax_layer(out) # accepts 2D or more dimensional inputs
    if self.debug:
        print("softmax layer output shape = ", out.size())
    return out

Some additional hyperparameters for RNN

In [50]:
# hyperparameters of RNN
HIDDEN_SIZE = 50 # no. of units in the hidden layer
NUM_LAYERS = 2 # no. of hidden layers

Rest of the pipeline looks similar to our feedforward neural networks code (except that we are using **torchtext** instead of **DataLoader**):

In [51]:
from sklearn.metrics import accuracy_score

def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_sample = 0
    for batch in loader:
        # load the current batch
        batch_input = batch.tweet
        batch_output = batch.label
        
        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)
        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_sample += batch_output.shape[0]
    return total_loss/num_sample

# evaluation logic based on classification accuracy
def evaluate(loader):
    all_pred=[]
    all_label = []
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
             # load the current batch
            batch_input = batch.tweet
            batch_output = batch.label

            batch_input = batch_input.to(device)
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(model_outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(batch_output)
            
    accuracy = accuracy_score(all_label, all_pred)
    return accuracy

Let us define the RNN model.

In [52]:
# define the model
model = RNNmodel(EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS) 
model.to(device) # ship it to the right device

# define the loss function (last node of the graph)
criterion = nn.NLLLoss()

Let us make a full forward propagation pass over a sample input batch to the RNN model. Closely pay attention to the shapes of intermediate layers (by turning on debug mode of the model)

In [53]:
# turn on the debug mode
model.debug = True

# print the sample input batch and labels
print("sample input = ", tweets)
print("sample output = ", labels)

# feed the batch as input to the RNN model
model_prediction = model(tweets)
print('model prediction shape = ', model_prediction.size())

# feed the model prediction and labels to the loss function
loss = criterion(model_prediction, labels)
print("loss = ", loss.item())

# turn off the debug mode (as we go for training from now)
model.debug = False

sample input =  tensor([[ 191,    4, 1273,   49],
        [  14,   82,    2,    0],
        [   2,   73,  215,   18],
        [ 598,  145,   48,  215],
        [  21,   21,   21,   21]])
sample output =  tensor([0, 1, 0, 0])
input word indices shape =  torch.Size([5, 4])
word embeddings shape =  torch.Size([5, 4, 300])
RNN output (features from last layer of RNN for all timesteps) shape =  torch.Size([5, 4, 50])
Tweet embeddings or RNN output (features from last layer of RNN for the last timestep only) shape =  torch.Size([4, 50])
ReLU output shape =  torch.Size([4, 50])
linear layer output shape =  torch.Size([4, 3])
softmax layer output shape =  torch.Size([4, 3])
model prediction shape =  torch.Size([4, 3])
loss =  1.3757504224777222


**We need to create a new directory 'ckpt/' to store our model checkpoint.**

In [54]:
import os
if not os.path.exists("./ckpt"): # check if the directory doesn't exist already
    os.mkdir("./ckpt")

**Let us perform the training. We will save our model and optimizer at end of each epoch.**


You can find more information of saving and loading model [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html).

In [55]:
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

# start the training
for epoch in range(MAX_EPOCHS):
    # train the model for one pass over the data
    train_loss = train(train_iter)  
    # compute the training accuracy
    train_acc = evaluate(train_iter)
    # compute the validation accuracy
    val_acc = evaluate(val_iter)
    
    # print the loss for every epoch
    print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))
    
    # save model, optimizer, and number of epoch to a dictionary
    model_save = {
            'epoch': epoch,  # number of epoch
            'model_state_dict': model.state_dict(), # model parameters 
            'optimizer_state_dict': optimizer.state_dict(), # save optimizer 
            'loss': train_loss # training loss
            }
    
    # use torch.save to store 
    torch.save(model_save, "./ckpt/model_{}.pt".format(epoch))



Epoch [1/5], Loss: 0.3480, Training Accuracy: 0.3655, Validation Accuracy: 0.3417




Epoch [2/5], Loss: 0.3464, Training Accuracy: 0.4268, Validation Accuracy: 0.3922




Epoch [3/5], Loss: 0.3450, Training Accuracy: 0.3728, Validation Accuracy: 0.3472




Epoch [4/5], Loss: 0.3437, Training Accuracy: 0.4113, Validation Accuracy: 0.3742




Epoch [5/5], Loss: 0.3471, Training Accuracy: 0.3773, Validation Accuracy: 0.3607


We trained the network only for 5 epochs, but it already overfits on validation set after epoch 2. 
In the coming sessions, we will look at methods to ``regularize`` the network (this will help us deal with overfitting).

**Load model checkpoint** 

When we have a trained model checkpint, we can load it using `torch.load()`

In [56]:
# define a new model
model2 = RNNmodel(EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS) 

# load checkpoint 
checkpoint = torch.load("./ckpt/model_1.pt") # loading the model obatined after 2nd epoch

# assign the parameters of checkpoint to this new model
model2.load_state_dict(checkpoint['model_state_dict'])
model2.to(device)

print(model2) # can be used for inference or for further training

RNNmodel(
  (embedding): Embedding(3338, 300, sparse=True)
  (rnn_layer): RNN(300, 50, num_layers=2)
  (activation_fn): ReLU()
  (linear_layer): Linear(in_features=50, out_features=3, bias=True)
  (softmax_layer): LogSoftmax(dim=0)
)


## GRUs

Gated Recurrent Units (GRUs) are a variant of RNNs that use more complex units for activation. They are created to have more persistent memory thereby making them easier for RNNs to capture long-term dependencies. To learn the theory behind GRUs, we recommend: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 

GRU is defined by ``torch.nn.GRU`` module and its documentation can be fetched [here](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#gru). Now let us define the GRU module.

In [57]:
"""
define the GRU module
"""
# first input - number of word vector dimensions/embeddings
# second input - number of nodes in hidden layer (50, size of the hidden layer)
# third input - number of recurrent layers (2)
gru_rnn = nn.GRU(input_size=300, hidden_size=50, num_layers=2) # input_size, hidden_size, num_layers

Similar to RNN, GRU module takes two inputs: *the initial hidden state for each element in the batch* (t=0) and the *input features* (``tweet_input_embeddings`` in our case).

Let us feed both the initial hidden state and tweet embeddings to our GRU model.

In [58]:
"""
forward propagation over the GRU model
"""
output, hn = gru_rnn(tweet_input_embeddings, h0) # h0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contains the output features $h_t$ from the last layer of the GRU

In [59]:
# output = seq_len, batch, hidden_size (output features from last layer of GRU)
print("output size: ", output.size())

output size:  torch.Size([5, 4, 50])


``hn`` is a tensor of shape (num_layers, batch_size, hidden_size / number of nodes in a hidden layer) containing the hidden state for last time step ``t = max_seq_len`` for the ``2 layered RNN``.

In [60]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 4, 50])


Similar to RNN, you can compute the final tweet representation (representation from last hidden state for each tweet) as follows.

In [61]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (5)
# second dimension - number of features in hidden state h_t (20, size of the hidden layer)

tweet output embeddings size:  torch.Size([4, 50])


## LSTMs

Long short-term memory (LSTMs) are a variant of RNNs that use more complex units for activation. Similar to the spirit of GRU, they are created to have more persistent memory thereby making them easier for RNNs to capture long-term dependencies. To learn the theory behind GRUs, we recommend: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/RNN.pdf 

LSTM is defined by ``torch.nn.LSTM`` module and its documentation can be fetched [here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#lstm). Now let us define the LSTM module.

In [62]:
"""
define the LSTM module
"""
# first input - number of features in x (300, size of the word embedding)
# second input - number of number of nodes in a hidden layer (50)
# third input - number of recurrent layers (2)
lstm_rnn = nn.LSTM(input_size=300, hidden_size=50, num_layers=2) # input_size, hidden_size, num_layers

Unlike RNN and GRU, LSTM module takes **three inputs**: the **initial hidden state** for each element in the batch (t=0), the **input features** (tweet_input_embeddings in our case) and the **initial cell state** for each element in the batch.

Let us construct the initial cell state (this construction is similar to that of initial hidden state)

In [63]:
"""
cell state at time-step 0 (h_0)
"""
# first dimension - number of LSTM layers (2)
# second dimension - batch_size (# of tweets/examples/sentences)
# third dimension - hidden_size / number of nodes in a hidden layer (50)
c0 = torch.randn(2, 4, 50)

Let us feed the initial hidden state, initial cell state and tweet embeddings to our LSTM model.

In [64]:
"""
forward propagation over the LSTM model
"""
output, (hn, cn) = lstm_rnn(tweet_input_embeddings, (h0, c0)) # h0 and c0 is optional input, defaults to tensor of 0's when not provided

``output`` tensor contains the output features $h_t$ from the last layer of the LSTM

In [65]:
# output = seq_len, batch_size, hidden_size (output features from last layer of LSTM)
print("output size: ", output.size())

output size:  torch.Size([5, 4, 50])


``hn`` is a tensor of shape (num_layers, batch, hidden_size) containing the hidden state for t = seq_len

In [66]:
# h_n = num_layers, batch, hidden_size (hidden state for t=seq_len or hidden state at last timestep)
print("last hidden state size: ", hn.size())

last hidden state size:  torch.Size([2, 4, 50])


``cn`` is a tensor of shape (num_layers, batch, hidden_size) containing the cell state for t = seq_len.

In [67]:
# c_n = num_layers, batch_size, hidden_size (cell state for t=seq_len or cell state at last timestep)
print("last cell state size: ", hn.size())

last cell state size:  torch.Size([2, 4, 50])


Similar to RNN and GRU, you can compute the final tweet representation (representation from last hidden state for each tweet) as follows.

In [68]:
tweet_output_embeddings = output[-1,:,:] # -1 fetches the embeddings from the last timestep
print("tweet output embeddings size: ", tweet_output_embeddings.size())
# first dimension - number of tweets in the batch (4)
# second dimension - number of features in hidden state h_t (50, size of the hidden layer)

tweet output embeddings size:  torch.Size([4, 50])


That's it!