# LSTM Example in Pytorch
## STAT 940 - Deep Learning

### Introduction

In this tutorial we will train an LSTM (Long short-term memory) Network for an NLP (Natural Language Processing) task in following 4 steps:
- Step 1)  Import libraries and set parameters
- Step 2) Load and prepare data 
- Step 3) Model set up
- Step 4) Set up optimizer and loss, & train the model

Also we will see how to:
- Save and load learned models
- Predict labels of test data
- Check accuracy of model

All hyperparameters are named in caps.

**Disclaimer: This notebook is a demonstration of an LSTM model for NLP on PyTorch. Use it at your own risk.**

## STEP 1 - Import libraries and set up some configurations

First, we will import the necessary libraries and packages that are needed for Convolutional Neural Network.
- 1) Python essentials
- 2) `torch`, `torch.nn`, `torch.nn.functional`, `torch.optim`: Pytorch imports for building custom neural networks containing
    - Sequential model type: This provides linear stack of neural network layers
    - Core layers (Linear, Dropout, ReLU): these layers are used in most neural networks
    - Recurrent layers 
    - Data tools: `torch.utils.data.Dataset` for custom data sets, `torch.utils.data.DataLoader` for loading in data
    - Optimizers in `torch.optim`
- 3) `torchtext`: PyTorch package for NLP (Natural Language Processing) tasks, which contains popular datasets and utils for easy use.
- 4) `tokenizers`: a [HuggingFace](https://huggingface.co/) library that enables quick tokenizer training and vocab building.

In [1]:
# 1)
import numpy as np
# 2)
import torch     
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
# 3)
#!pip install torchtext  # if your environment doesn't have this package 
import torchtext

Seed everything!

In [2]:
SEED = 940
np.random.seed(SEED)    # set seed for reproducibility (within numpy)
torch.manual_seed(SEED) # set seed for reproducibility (within pytorch)

<torch._C.Generator at 0x7fbe8e296bb0>

#### Set-up GPU (optional)

Leverage GPUs for faster training of neural networks. This section will help you set up GPUs on PyTorch. Run this code whether or not you are using a GPU, and it will detect your current device and pass the info to PyTorch.

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # check is GPU is available
print(device)

cuda:0


## STEP 2 - Load and prepare data

Our toy dataset provided in the function `torchtext.datasets.IMDB()` is for a sentiment analysis task, in which we are required to classify movie reviews in English into two types of sentiments, positive or negative. This dataset was contributed by the AI research team at Stanford (see their [website](http://ai.stanford.edu/~amaas/data/sentiment/)).  It consists of 25,000 training examples and 25,000 test examples. After executing the below cell, you can check the `'./IMDB/aclImdb/README'` file for more info on the dataset.

If the data is not stored at the PyTorch site, you might need to load data from your own directory using different types of dataset objects, or even writing a custom subclass of `torch.utils.data.Dataset`.

In [4]:
from torchtext.datasets import IMDB as imdb

# Download dataset from pytorch and untar
train = imdb(root='./', split='train')  # "./" represents the current working directory
test = imdb(root='./', split='test')

print('# of training examples:', len(train))
print("# of test examples:", len(test))

# of training examples: 25000
# of test examples: 25000


In [5]:
# show some examples
train = list(train)
test = list(test)
train[0:5]

[('neg',
  'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far 

In [6]:
test[0:5]

[('neg',
  'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish a

In [7]:
del train[:]; del test[:]

# delete all elements in the list to release memory
# we won't be storing the datasets in RAM all the time

### Train tokenizer and build vocabulary

This notebook shows you how to perform tokenization and vocab building using the HuggingFace library. Check out their [github repo](https://github.com/huggingface/tokenizers). 

To learn more about different trainable tokenizers & vocab builders, see this reddit [post](https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/). 

In [8]:
%%capture
!pip install tokenizers

In [9]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import WordPieceTrainer

# get a tokenizer model
tokenizer = Tokenizer(WordPiece())

# pre-tokenizer splits a sentence to 'prototype' tokens according to a set of rules
# which will then be fed to the tokenizer trainer 
# and the final trained tokens will be generated from disentangling and recombining those prototype tokens
tokenizer.pre_tokenizer = Whitespace() 

def yield_text(data_iter):
    '''Creates a generator for all texts in the whole dataset'''

    for _, text in data_iter:
        yield text

VOCAB_SIZE = 30000
trainer = WordPieceTrainer(vocab_size=VOCAB_SIZE, special_tokens=["[UNK]"])

# the IMDB function returns an iterator of all datapoints
# so you will see this line multiple times
train_data = imdb(root='./', split='train')

# train tokenizer on training data
tokenizer.train_from_iterator(yield_text(train_data), trainer=trainer)

In [10]:
# get tokens of a string
tokenizer.encode('This is a notebook demonstrating the LSTM model for STAT 940.').tokens

['This',
 'is',
 'a',
 'note',
 '##book',
 'demonstrating',
 'the',
 'L',
 '##ST',
 '##M',
 'model',
 'for',
 'ST',
 '##AT',
 '94',
 '##0',
 '.']

In [11]:
# get token ids of a string
tokenizer.encode('This is a notebook demonstrating the LSTM model for STAT 940.').ids

[524,
 351,
 67,
 2924,
 10474,
 24657,
 327,
 46,
 2131,
 233,
 4886,
 388,
 3407,
 2059,
 25702,
 225,
 16]

In [12]:
# vocab is a dictionary that maps included tokens to integer ids
# it is already computed when the tokenizer is trained, and stored in tokenizer.get_vocab()
list(tokenizer.get_vocab().items())[0:10]

[('##irds', 6118),
 ('Cros', 12247),
 ('##OPLE', 16908),
 ('Transyl', 20237),
 ('EAR', 19142),
 ('stole', 9585),
 ('##arios', 12281),
 ('brainwashed', 22845),
 ('Firstly', 11266),
 ('town', 1786)]

In [13]:
tokenizer.get_vocab()["[UNK]"]

0

### Define data pipelines

Text pipeline: tokenize, map to integers, cut extra length

Label pipeline: encode labels as integers

In [14]:
MAXLEN = 80

def cut(x, maxlen=128):
    """Cuts a sequence x if its length exceeds `maxlen` """

    return x[0:maxlen] if len(x) > maxlen else x

def text_pipeline(x):
    """Pipeline of preprocessing text data. 
       Transformations include tokenizing, mapping from tokens to integers, and cut extra length"""

    x = tokenizer.encode(x).ids
    x = cut(x, maxlen=MAXLEN)
    return x

def label_pipeline(x): 
    """Pipeline of encoding labels. 
    New label <- 1 if original label == 'pos', new label <- 0 if otherwise"""

    return 1 if x == 'pos' else 0


Next step: **collate function for DataLoaders**.

Pytorch `DataLoaders` are essential for loading data before training. Collate functions are a part of the algorithm inside the dataloaders that batch examples together from the training set, and convert the batched examples to a single tensor. 

However, in our case, since the sequences after the preprocessing pipeline still have varying lengths (some are shorter than `MAXLEN` and are left unprocessed in the pipeline), we need to zero pad those short ones to `MAXLEN` before batching them. Otherwise, PyTorch wouldn't stack tensors with different sizes together. This will be done inside the collate function `collate_batch` as well.

In [15]:
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    """
    Transforms texts and labels according to corresponding pipelines,
    zero pad shorter sequences, and stack all sequences in the batch along a new axis
    
    Input: batch -- a list of tuples with size == BATCH_SIZE, 
            1st element of the tuple is the label, 2nd element of the tuple is the text
    Output: a tuple of two tensors, the 1st tensor contains labels in the batch, of size (BATCH_SIZE,),
            the 2nd tensor contains processed sequences in the batch, of size (BATCH_SIZE, MAXLEN)"""

    label_list, text_list = [], []

    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)

    label_list = torch.tensor(label_list, dtype=torch.float32)  
    # float type labels required for BCE loss
    
    # zero pad sequences that are shorter than MAXLEN == 128
    text_list = pad_sequence(text_list, batch_first=True)
    
    return label_list.to(device), text_list.to(device)

# now you can call DataLoader to load up data
# train_loader = DataLoader(split_train_, batch_size=BATCH_SIZE,
#                              shuffle=True, collate_fn=collate_batch)

## STEP 3 - Model set up
Here, we define a model with an embedding layer, three bidirectional LSTM layers (`num_layers=3, bidir=True`), and a linear layer with sigmoid activation. This is for providing you with a flexibility to tweak your LSTM. For example, you can make it the most simple form by setting `bidir=False` so that it is one-directional again, and also setting `num_layers=1`. You can also pass in `dropout=0.5` to `nn.LSTM` to add dropout layers with probability 0.5 after every LSTM layer (except the final layer). 

In [16]:
class Net(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=3, 
                 output_size=1,
                 bidir=True):
        """Defines individual layers and config the model class"""

        super().__init__()
        self.hidden_dim = hidden_dim
        self.output_size = output_size
        self.embedding = nn.Embedding(vocab_size,embed_dim,padding_idx=0)
        self.num_layers = num_layers  # of recurrent layers in LSTM

        # the embed vector at the padding_idx remains all zeros during training
        self.lstm = nn.LSTM(embed_dim,hidden_dim,num_layers,
                           batch_first=True,
                           bidirectional=bidir)
      
        # A multiplier of layers if bidirectional
        self.bidir = 2 if bidir else 1
                
        self.fc = nn.Linear(hidden_dim*self.num_layers*self.bidir, self.output_size)
        self.sigmoid = nn.Sigmoid()

        # weights have to be initialized away from zeros if using SGD
        self.init_weights()   

    def init_weights(self):
        """Initializes model parameters"""

        initrange = 0.5
        nn.init.uniform_(self.embedding.weight,-initrange,initrange)
        for k in range(self.num_layers):
            nn.init.uniform_(eval("self.lstm.weight_hh_l"+str(k)), -initrange, initrange)
            nn.init.uniform_(eval("self.lstm.weight_ih_l"+str(k)), -initrange, initrange)
            nn.init.zeros_(eval("self.lstm.bias_hh_l"+str(k)))
            nn.init.zeros_(eval("self.lstm.bias_ih_l"+str(k)))
        nn.init.uniform_(self.fc.weight,-initrange,initrange)
        nn.init.zeros_(self.fc.bias)

    def init_hidden(self, bsz):
        """
        Initializes hidden state and cell state. 
        The first output is the hidden state, and the 2nd is the cell state"""

        return (torch.zeros(self.bidir*self.num_layers, bsz, self.hidden_dim).to(device),
                torch.zeros(self.bidir*self.num_layers, bsz, self.hidden_dim).to(device))

    def forward(self, text, hidden):
        """Foward pass function called during training"""

        embedded = self.embedding(text)
        # embedded tensor: (BATCH_SIZE, MAXLEN, EMBED_DIM)

        lstm_out, (hn, cn) = self.lstm(embedded, hidden)
        # lstm_out: (BATCH_SIZE, MAXLEN, HIDDEN_DIM)

        # swap 1st and 2nd axes so that datapoint indices corresponds to the 1st axis
        hn = torch.transpose(hn,0,1)  

        # for each datapoint, concat hidden states of all layers  
        hn = hn.reshape(-1, self.num_layers*self.hidden_dim*self.bidir)
        # hn: (BATCH_SIZE, NUM_LAYERS*bidir*HIDDEN_DIM)
        
        a = self.fc(hn)
        # a: (BATCH_SIZE, 1)

        return self.sigmoid(a).squeeze() # unroll dimensions with length 1, i.e. flattens output to 1d

EMBED_DIM  = 128   # size of the embedding 
HIDDEN_DIM = 64    # size of the hidden layer in LSTM
net = Net(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, num_layers=3, bidir=True).to(device)
print(net)

Net(
  (embedding): Embedding(30000, 128, padding_idx=0)
  (lstm): LSTM(128, 64, num_layers=3, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=384, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


## STEP 4 - Set up optimizer and loss, train the model

In [17]:
LR = 1e-2                   # learning rate
MOMENTUM = 0.9              # SGD momentum

# binary cross entropy loss
criterion = nn.BCELoss() 

# stochastic gradient descent as optimizer
optimizer = optim.SGD(net.parameters(), lr=LR, momentum=MOMENTUM)

#### Train & evaluate functions

In [18]:
import time

def train(dataloader):
    # tell the model that we're training it now
    net.train()

    # initialize hidden states to all zeros
    hidden = net.init_hidden(bsz=BATCH_SIZE)

    # initialize useful training metrics
    total_acc, total_count, avg_loss = 0, 0, 0
    log_interval = 100
    start_time = time.time()


    for idx, (label, text) in enumerate(dataloader):
        # at the final batch, the actual batch size < BATCH_SIZE
        # so reinit the hidden states to match the correct shapes
        if label.size(0) != BATCH_SIZE:
            hidden = net.init_hidden(bsz=label.size(0))

        # reset gradients 
        optimizer.zero_grad()

        # forward pass 
        output = net(text, hidden) 

        # compute loss
        loss = criterion(output, label)
        avg_loss += loss/label.size(0)

        # compute gradients
        loss.backward()
        
        # update model parameters
        optimizer.step()

        # count accurate predictions 
        total_acc += (torch.round(output).int() == label).sum().item()
        
        # count cumulative examples in the current epoch
        total_count += label.size(0)

        # report training progress every 100 batches
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| loss {:.4f} | accuracy {:8.3f}'.format(epoch, idx, len(dataloader), 
                                              avg_loss, total_acc/total_count))
            # reset metrics every 100 batches
            total_acc, total_count, avg_loss = 0, 0, 0
            start_time = time.time()

def evaluate(dataloader):
    # switch to evaluation mode
    net.eval()
    hidden = net.init_hidden(bsz=BATCH_SIZE)

    total_acc, total_count, avg_loss = 0, 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            if label.size(0) != BATCH_SIZE:
                hidden = net.init_hidden(bsz=label.size(0))
            output = net(text, hidden)
            loss = criterion(output, label)
            avg_loss += loss/label.size(0)
            total_acc += (torch.round(output).int() == label).sum().item()
            total_count += label.size(0)
    
    # return validation loss & accuracy
    return avg_loss, total_acc/total_count

#### Training Loop

In [19]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

EPOCHS = 40       # number of epochs
BATCH_SIZE = 64   # batch size for training
TRAIN_SPLIT = 0.8 # split 80% of training data to train model, 20% to end-of-epoch validation

train_data = imdb(root='./', split='train')

# in order to shuffle data during training (which is essential for training)
# convert the dataset type to map-style
train_data = to_map_style_dataset(train_data)

# training and validation split
num_train = int(len(train_data) * TRAIN_SPLIT)
split_train_, split_valid_ = \
    random_split(train_data, [num_train, len(train_data) - num_train])

# data loaders for the training loop
train_loader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_loader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=False, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    # record starting time of the epoch
    epoch_start_time = time.time()

    # training
    train(train_loader)

    # validation
    loss_val, accu_val = evaluate(valid_loader)

    print('-' * 90)
    print('| end of epoch {:3d} | time: {:5.2f}s '
          '| valid loss {:.4f} | valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           loss_val, accu_val))
    print('-' * 90)

| epoch   1 |   100/  313 batches | loss 1.1711 | accuracy    0.504
| epoch   1 |   200/  313 batches | loss 1.1177 | accuracy    0.518
| epoch   1 |   300/  313 batches | loss 1.1058 | accuracy    0.514
------------------------------------------------------------------------------------------
| end of epoch   1 | time: 14.69s | | valid loss 0.9581 | valid accuracy    0.522 
------------------------------------------------------------------------------------------
| epoch   2 |   100/  313 batches | loss 1.0859 | accuracy    0.553
| epoch   2 |   200/  313 batches | loss 1.0768 | accuracy    0.547
| epoch   2 |   300/  313 batches | loss 1.0741 | accuracy    0.546
------------------------------------------------------------------------------------------
| end of epoch   2 | time: 14.35s | | valid loss 0.9450 | valid accuracy    0.536 
------------------------------------------------------------------------------------------
| epoch   3 |   100/  313 batches | loss 1.0702 | accuracy    

### Save and load the model

In [20]:
model_path = './imdb_lstm.pth'

torch.save(net.state_dict(), model_path) # save model to path

net = Net(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM).to(device)
net.load_state_dict(torch.load(model_path)) # load the weights of saved model

<All keys matched successfully>

### Check test accuracy of model

In [21]:
test_data = imdb(root='./', split='test')
test_data = to_map_style_dataset(test_data)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False,
                             collate_fn=collate_batch) 
# if your test data is unlabelled you might need to manually include the text_pipeline inside the training function, 
# as our collate_batch calls the label_pipeline, which is not useful for unlabelled data
# so in that case you won't pass in collate_fn = collate_batch

# But here we do have labels in the test set so there's no problem

_, test_acc = evaluate(test_loader)
print("Test accuracy:", test_acc)

Test accuracy: 0.62888


### Get predictions from one batch

In [22]:
_, seqs = next(iter(test_loader))
net.eval()
with torch.no_grad():
    h = net.init_hidden(bsz=seqs.size(0))
    outputs = net(seqs, h)
    preds = torch.round(outputs).int()
preds

tensor([1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1,
        1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], device='cuda:0',
       dtype=torch.int32)