# Homework 5: Compare LSTM and GRU

By Yiheng Xiao

I will use:
- Long short term memory RNN (LSTM)
- Gated Recurrent Units RNN (GRU)

This will allow us to achieve ~86% test accuracy.

In [1]:
import torch
from torchtext import data
from torchtext import datasets
import random
print("Let's use", torch.cuda.device_count(), "GPU(s)!")

Let's use 1 GPU(s)!


## Preparing Data

The same as before, we'll set the seed, define the `Fields` and get the train/valid/test splits.

In [2]:
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

train, test = datasets.IMDB.splits(TEXT, LABEL)

train, valid = train.split(random_state=random.seed(SEED))

The first update, is the addition of pre-trained word embeddings. These vectors have been trained on corpuses of billions of tokens. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors, where words that appear in similar contexts appear nearby in this vector space.

The first step to using these is to specify the vectors and download them, which is passed as an argument to `build_vocab`. The `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens. `100d` indicates these vectors are 100-dimensional.

**Note**: recently, 'glove.6B.100d' from https://nlp.stanford.edu/projects/glove/ crashed down. I use "charngram.100d" instead.

In [3]:
TEXT.build_vocab(train, max_size=25000, vectors = "charngram.100d")
## ['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d']
LABEL.build_vocab(train)

As before, we create the iterators.

**Note:** Please select suitable `BATCH_SIZE` according to the computation power of GPU. With larger internal memory, say 8G, you can raise the `BATCH_SIZE` to 64.

In [4]:
BATCH_SIZE = 8

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

# LSTM

In LSTM, the hidden state can be thought of as a "memory" of the words seen by the model. It is difficult to train a standard RNN as the gradient decays exponentially along the sequence, causing the RNN to "forget" what has happened earlier in the sequence. LSTMs have an extra recurrent state called a _cell_, which can be thought of as the "memory" of the LSTM and can remember information for many time steps. LSTMs also use multiple _gates_, these control the flow of information into and out of the memory.

![](https://i.imgur.com/knsIzeh.png)

In [5]:
import torch.nn as nn

class RNN_LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        """
        param:
            vocab_size: dimension of one-hot vector, which is length of TEXT.vocab
            embedding_dim: dim(dense word vector) approx sqrt(cocab_size), after embedding, one-hot key vectors are converted to dense word vector , 
            hidden_dim: size of hidden states, subsequent words are fed to hidden states successively
            output_dim: The output dimension is usually the number of classes, 
                however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, 
                i.e. a single scalar.
            n_layers: number of layers in the neural network.
                The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another layer.
            bidirectional: adds an extra layer that processes values from last to first, where originally only from first to last
            dropout: a regularization method to avoid overfitting, randomly dropout a node from the forward process, getting less
                parameters and hence avoid over parameterization
        """
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # LSTM package that takes in specifications
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        # defines forward process, bidirectional requires the square of hidden dimension
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        return: returns the output and a tuple of the final hidden state and the final cell state, 
            whereas the standard RNN only returned the output and final hidden state.
        """
        #x = [sent len, batch size]
        
        # regularization in the embedding process
        embedded = self.dropout(self.embedding(x))
        #embedded = [sent len, batch size, emb dim]
        
        # output of the LSTM RNN process in each node, including the output, new hidden state, and cell state
        output, (hidden, cell) = self.rnn(embedded)        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        # regularize the hidden state to avoid overfitting
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))


class RNN_GRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        """
        same as LSTM, use GRU package instead
        """
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        difference between LSTM and GRU here is that GRU output does not have cell state,
        as we can see from mathematical definition above
        """
        
        #x = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

## set parameters

In [6]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model_lstm = RNN_LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)
model_gru = RNN_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [7]:
"""
Check size of pretrained embeddings
"""

pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [8]:
"""
Assign pretrained embeddings to embedding layer for 2 separate models
"""
model_lstm.embedding.weight.data.copy_(pretrained_embeddings)


tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-1.2320, -0.2417, -0.8178,  ...,  0.3751, -0.3104,  1.0578],
        [-0.0731, -0.1633, -0.1367,  ..., -0.2390,  0.2612,  0.2366],
        ...,
        [-0.2472, -0.1776, -0.1791,  ...,  0.0936, -0.2388,  0.5512],
        [-0.1875, -0.0243, -0.3291,  ..., -0.0950, -0.3361,  0.6293],
        [-0.0925, -0.1457, -0.2654,  ..., -0.4071, -0.0766,  0.4163]])

In [10]:
import torch.optim as optim
"""
change optim.SGD to optim.Adam, 
also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile initial learning rate.
"""
optimizer_lstm = optim.Adam(model_lstm.parameters())

In [11]:
"""
specify loss function: BCE with logits loss
"""
criterion = nn.BCEWithLogitsLoss()

"""
use GPU if availbale, otherwise use CPU
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_lstm = model_lstm.to(device)
criterion = criterion.to(device)

In [12]:
import torch as F

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

In [13]:
def train_model(model, iterator, optimizer, criterion):
    """
    main train function to train with each batch in iterator, iterates through all examples
    """
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        """
        optimization step
        """
        
        # first zero the gradients
        optimizer.zero_grad()
        
        # feed batch of sentences to model
        predictions = model(batch.text).squeeze(1)
        
        # calculate loss
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        # calculate gradient
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [14]:
def evaluate(model, iterator, criterion):
    """
    main function for evaluation
    similar to train
    do not need to zero gradients
    do not update parameters
    """
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## training

In [15]:
N_EPOCHS = 5
"""
train model for 5 epochs and output training statistics and validation statstics
"""
for epoch in range(N_EPOCHS):

    train_loss_lstm, train_acc_lstm = train_model(model_lstm, train_iterator, optimizer_lstm, criterion)
    valid_loss_lstm, valid_acc_lstm = evaluate(model_lstm, valid_iterator, criterion)
    torch.cuda.empty_cache()
    print("LSTM training set")
    print(f'Epoch: {epoch+1:04}, Train Loss: {train_loss_lstm:.4f}, Train Acc: {train_acc_lstm*100:.4f}%, Val. Loss: {valid_loss_lstm:.4f}, Val. Acc: {valid_acc_lstm*100:.4f}%')

  return Variable(arr, volatile=not train)


LSTM training set
Epoch: 0001, Train Loss: 0.6973, Train Acc: 50.9255%, Val. Loss: 0.6909, Val. Acc: 51.9323%
LSTM training set
Epoch: 0002, Train Loss: 0.6744, Train Acc: 57.6325%, Val. Loss: 0.6317, Val. Acc: 65.5251%
LSTM training set
Epoch: 0003, Train Loss: 0.5072, Train Acc: 75.9426%, Val. Loss: 0.3264, Val. Acc: 87.2468%
LSTM training set
Epoch: 0004, Train Loss: 0.2683, Train Acc: 89.6881%, Val. Loss: 0.2729, Val. Acc: 89.1125%
LSTM training set
Epoch: 0005, Train Loss: 0.1772, Train Acc: 93.6586%, Val. Loss: 0.2732, Val. Acc: 89.7255%


## testing

In [16]:
"""
test final model
"""
test_loss_lstm, test_acc_lstm = evaluate(model_lstm, test_iterator, criterion)
torch.cuda.empty_cache()
print("LSTM test result")
print(f'Test Loss: {test_loss_lstm:.3f}, Test Acc: {test_acc_lstm*100:.2f}%')

  return Variable(arr, volatile=not train)


LSTM test result
Test Loss: 0.340, Test Acc: 86.81%


# GRU

In [6]:
import torch.nn as nn
class RNN_GRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        """
        same as LSTM specification, instead we use GRU package
        """
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        difference between LSTM and GRU here is that GRU output does not have cell state,
        as we can see from mathematical definition above
        """
        
        #x = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

In [7]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model_gru = RNN_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [8]:
"""
Check size of pretrained embeddings
"""

pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [9]:
model_gru.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-1.2320, -0.2417, -0.8178,  ...,  0.3751, -0.3104,  1.0578],
        [-0.0731, -0.1633, -0.1367,  ..., -0.2390,  0.2612,  0.2366],
        ...,
        [-0.2472, -0.1776, -0.1791,  ...,  0.0936, -0.2388,  0.5512],
        [-0.1875, -0.0243, -0.3291,  ..., -0.0950, -0.3361,  0.6293],
        [-0.0925, -0.1457, -0.2654,  ..., -0.4071, -0.0766,  0.4163]])

In [10]:
import torch.optim as optim
"""
change optim.SGD to optim.Adam, 
also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile initial learning rate.
"""
optimizer_gru = optim.Adam(model_gru.parameters())

In [11]:
"""
specify loss function: BCE with logits loss
"""
criterion = nn.BCEWithLogitsLoss()

"""
use GPU if availbale, otherwise use CPU
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_gru = model_gru.to(device)
criterion = criterion.to(device)

In [12]:
print(device)

cuda


In [13]:
import torch as F

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

In [14]:
def train_model(model, iterator, optimizer, criterion):
    """
    main train function to train with each batch in iterator, iterates through all examples
    """
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        """
        optimization step
        """
        
        # first zero the gradients
        optimizer.zero_grad()
        
        # feed batch of sentences to model
        predictions = model(batch.text).squeeze(1)
        
        # calculate loss
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        # calculate gradient
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [16]:
def evaluate(model, iterator, criterion):
    """
    main function for evaluation
    similar to train
    do not need to zero gradients
    do not update parameters
    """
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [17]:
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss_gru, train_acc_gru = train_model(model_gru, train_iterator, optimizer_gru, criterion)
    valid_loss_gru, valid_acc_gru = evaluate(model_gru, valid_iterator, criterion)
    torch.cuda.empty_cache()
    print("GRU training set")
    print(f'Epoch: {epoch+1:04}, Train Loss: {train_loss_gru:.4f}, Train Acc: {train_acc_gru*100:.4f}%, Val. Loss: {valid_loss_gru:.4f}, Val. Acc: {valid_acc_gru*100:.4f}%')

  return Variable(arr, volatile=not train)


GRU training set
Epoch: 0001, Train Loss: 0.5897, Train Acc: 65.8307%, Val. Loss: 0.3630, Val. Acc: 84.1818%
GRU training set
Epoch: 0002, Train Loss: 0.2834, Train Acc: 88.5740%, Val. Loss: 0.2952, Val. Acc: 88.3662%
GRU training set
Epoch: 0003, Train Loss: 0.1771, Train Acc: 93.4129%, Val. Loss: 0.2381, Val. Acc: 90.0187%
GRU training set
Epoch: 0004, Train Loss: 0.1177, Train Acc: 95.7724%, Val. Loss: 0.2577, Val. Acc: 91.0581%
GRU training set
Epoch: 0005, Train Loss: 0.0860, Train Acc: 97.1549%, Val. Loss: 0.3565, Val. Acc: 88.8859%


In [18]:
test_loss_gru, test_acc_gru = evaluate(model_gru, test_iterator, criterion)
torch.cuda.empty_cache()
print("RNN-GRU test result")
print(f'Test Loss: {test_loss_gru:.4f}, Test Acc: {test_acc_gru*100:.4f}%')

  return Variable(arr, volatile=not train)


RNN-GRU test result
Test Loss: 0.3823, Test Acc: 87.6480%


# examples

In [19]:
import spacy
nlp = spacy.load('en')

def predict_sentiment_lstm(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction_lstm = F.sigmoid(model_lstm(tensor))
    return prediction_lstm.item()

def predict_sentiment_gru(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction_gru = F.sigmoid(model_gru(tensor))
    return prediction_gru.item()

Here is a bottom line from a review of Titanic, `Titanic, one of the greatest movies of all time, as it manages to capture audiences' heart also skilfully enrapturing us with the dedication shown by people.`. Lets judge the sentiment by LSTM and GRU.

## LSTM

In [48]:
# positive
predict_sentiment_lstm("Titanic, one of the greatest movies of all time, as it manages to capture audiences' heart also skilfully enrapturing us with the dedication shown by people.")

0.9940310120582581

predict value  = 0.99, which is sufficiently close to 1, postive sentiment!

In [47]:
# negative
predict_sentiment_lstm("An disappointing plot comes after a dull routine debut. The conclsion is boring.")

0.0016002169577404857

predict value  = 0.0016, which is sufficiently close to 0, negative sentiment!

## GRU

In [20]:
# positive
predict_sentiment_gru("Titanic, one of the greatest movies of all time, as it manages to capture audiences' heart also skilfully enrapturing us with the dedication shown by people.")

0.9991788268089294

predict value  = 0.9992, which is sufficiently close to 1, postive sentiment!

In [21]:
# negative
predict_sentiment_gru("An disappointing plot comes after a dull routine debut. The conclsion is boring.")

0.00265431497246027

predict value  = 0.0027, which is sufficiently close to 0, negative sentiment!