# Homework 5
## Wanlin Li (wl596)

### - Compare GRU model with LSTM model

## Step 1: Preparing Data

In [1]:
# Import libraries and download datasets.
import torch
from torchtext import data
from torchtext import datasets # Downloads the IMDb dataset.
import random

# Set the random seeds for reproducibility.
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# The parameters of a Field specify how the data should be processed.

# The TEXT field has tokenize='spacy', which defines that the "tokenization" 
# (the act of splitting the string into discrete "tokens") should be done using the spaCy English tokenizer. 
# If no tokenize argument is passed, the default is simply splitting the string on spaces.

# LABEL is defined by a LabelField, a special subset of the Field class specifically for handling labels.
# LABEL indicates the 'pos' or 'neg' of a given movie comment.
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

# Split the IMDb datasets into the canonical train/test splits as torchtext.datasets objects. 
# It uses the Fields we have previously defined.
train, test = datasets.IMDB.splits(TEXT, LABEL)

# Create a validation set using the .split() method.
# Pass the random seed to the random_state argument to ensure that we get the same train/validation split each time
train, valid = train.split(random_state=random.seed(SEED))

The first update, is the addition of pre-trained word embeddings. These vectors have been trained on corpuses of billions of tokens. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors, where words that appear in similar contexts appear nearby in this vector space.

The first step to using these is to specify the vectors and download them, which is passed as an argument to `build_vocab`. The `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens. `100d` indicates these vectors are 100-dimensional.

**Note**: these vectors are about 862MB, so watch out if you have a limited internet connection.

In [2]:
# To effectively cut down the vocabulary, we take the top 25000 most common words by specifing the max_size=25000.
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

Then, we create the iterators.

In [3]:
# BucketIterator first sorts of the examples using the sort_key, 
# here we use the length of the sentences (i.e., sort_key=lambda x: len(x.text)), 
# and then partitions them into buckets with batch_size=8. 

# When the iterator is called it returns a batch of examples from the same bucket. 
# This will return a batch of examples where each example is a similar length, minimizing the amount of padding.

BATCH_SIZE = 4

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Step 2: Build the Model

The model features the most drastic changes.

### - LSTM Model

In [4]:
import torch.nn as nn

class RNN_LSTM(nn.Module):
    
    # Define the layers of the module in __init__ function
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        """
        vocab_size: int; dimention of the one-hot vector (i.e., len(TEXT.vocab)). 
        
        embedding_dim: int; the size of the dense word vectors, this is usually around the square root of the vocab size.
        
        hidden_dim:  int; the size of the hidden states, this is usually around 100-500 dimensions, 
                     but depends on the vocab size, embedding dimension and the complexity of the task.
                     
        output_dim:  int; usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 
                     and thus can be 1-dimensional, i.e. a single scalar.
                     
        n_layers: int; number of recurrent layers; 
                  e.g., setting n_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, 
                  with the second LSTM taking in outputs of the first LSTM and computing the final results. 
        
        bidirectional: bool; if True, becomes a bidirectional LSTM
        
        dropout: doublel; regularization parameter; 
                 if non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, 
                 with dropout probability equal to dropout.
        """
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    # Define the forward process
    def forward(self, x):
        
        #x = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, (hidden, cell) = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

## - GRU Model

In [5]:
import torch.nn as nn

class RNN_GRU(nn.Module):
    
    # Define the layers of the module in __init__ function
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        """
        vocab_size: int; dimention of the one-hot vector (i.e., len(TEXT.vocab)). 
        
        embedding_dim: int; the size of the dense word vectors, this is usually around the square root of the vocab size.
        
        hidden_dim:  int; the size of the hidden states, this is usually around 100-500 dimensions, 
                     but depends on the vocab size, embedding dimension and the complexity of the task.
                     
        output_dim:  int; usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 
                     and thus can be 1-dimensional, i.e. a single scalar.
                     
        n_layers: int; number of recurrent layers; 
                  e.g., setting n_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, 
                  with the second LSTM taking in outputs of the first LSTM and computing the final results. 
        
        bidirectional: bool; if True, becomes a bidirectional LSTM
        
        dropout: doublel; regularization parameter; 
                 if non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, 
                 with dropout probability equal to dropout.
        """
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    # Define the forward process
    def forward(self, x):
        """
        Note: GRU does not have cell state.
        """
        
        #x = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

Like before, we'll create an instance of our RNN class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.

To ensure the pre-trained vectors can be loaded into the model, the `EMBEDDING_DIM` must be equal to that of the pre-trained GloVe vectors loaded earlier.

In [6]:
# Specify the input parameters 
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

# Creat instances of the RNN_LSTM class and RNN_GRU class
model_lstm = RNN_LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)
model_gru = RNN_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

The final addition is copying the pre-trained word embeddings we loaded earlier into the `embedding` layer of our model.

We retrieve the embeddings from the field's vocab, and ensure they're the correct size, _**[vocab size, embedding dim]**_ 

In [7]:
# Retrieve the embeddings from the field's vocab, and ensure they're the correct size
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We then replace the initial weights of the `embedding` layer with the pre-trained embeddings.

In [8]:
# Replace the initial weights of the embedding layer with the pre-trained embeddings

model_lstm.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1123,  0.3113,  0.3317,  ..., -0.4576,  0.6191,  0.5304],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [9]:
# Replace the initial weights of the embedding layer with the pre-trained embeddings

model_gru.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1123,  0.3113,  0.3317,  ..., -0.4576,  0.6191,  0.5304],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

## Step 3: Train the Model

Now to training the model.

In [10]:
import torch.optim as optim

# Creat a optimizer used to update the parameters of the module.
# lstm model
optimizer_lstm = optim.Adam(model_lstm.parameters())
# gru model
optimizer_gru = optim.Adam(model_gru.parameters())

The rest of the steps are training the model.

We define the criterion and place the model and criterion on the GPU (if available)...

In [11]:
# Define the loss function, which is commonly called a criterion in PyTorch
# The loss function here is binary cross entropy with logits
# This loss function conbines a sigmoid layer used to restrict the unbound input number between 0 and 1
criterion = nn.BCEWithLogitsLoss()

# PyTorch has excellent support for NVIDIA GPUs via CUDA. torch.cuda.is_available() returns True if PyTorch detects a GPU.
# Using .to, we can place the model and the criterion on the GPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# lstm model
model_lstm = model_lstm.to(device)
# gru model
model_gru = model_gru.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [12]:
import torch.nn.functional as F

# Define a function to calculate the accuracy

# This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, 
# then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment).

# We then calculate how many rounded predictions equal the actual labels and average it across the batch.
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

We define a function for training our model...

**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

In [13]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    # model.train() is used to put the model in "training mode", which turns on dropout and batch normalization.
    model.train()
    
    # The train function iterates over all examples, a batch at a time.
    for batch in iterator:
        
        # Zero the gradients.
        optimizer.zero_grad()
        
        # Feed the batch of sentences, batch.text, into the model.
        # The squeeze is needed as the predictions are initially size [batch size, 1], 
        # and we need to remove the dimension of size 1.
        predictions = model(batch.text).squeeze(1)
        
        # Calculate the loss and accuracy using predictions and the labels, batch.label.
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        
        # Calculate the gradient of each parameter with loss.backward().
        loss.backward()
        
        # Update the parameters using the gradients and optimizer algorithm.
        optimizer.step()
        
        # The loss and accuracy is accumulated across the epoch.
        # .item() method is used to extract a scalar from a tensor which only contains a single value.
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define a function for testing our model...

**Note**: as we are now using dropout, we must remember to use `model.eval()` to ensure the dropout is "turned off" while evaluating.

In [14]:
def evaluate(model, iterator, criterion):
    """
    Evaluate is similar to train, with a few modifications as we don't need to update the parameters when evaluating.
    """
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Finally, we train our model...
### - Train LSTM Model

In [17]:
# Train the model through multiple epochs, an epoch being a complete pass through all examples in the split.
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model_lstm, train_iterator, optimizer_lstm, criterion)
    valid_loss, valid_acc = evaluate(model_lstm, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.666, Train Acc: 57.59%, Val. Loss: 0.426, Val. Acc: 81.37%
Epoch: 02, Train Loss: 0.329, Train Acc: 86.54%, Val. Loss: 0.266, Val. Acc: 89.04%
Epoch: 03, Train Loss: 0.203, Train Acc: 92.69%, Val. Loss: 0.278, Val. Acc: 89.53%
Epoch: 04, Train Loss: 0.139, Train Acc: 95.05%, Val. Loss: 0.291, Val. Acc: 89.92%
Epoch: 05, Train Loss: 0.098, Train Acc: 96.74%, Val. Loss: 0.315, Val. Acc: 88.92%


...and get our new and vastly improved test accuracy!

In [18]:
# Calculate the test loss and accuracy.
test_loss, test_acc = evaluate(model_lstm, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.400, Test Acc: 86.03%


### - Train GRU Model

In [15]:
# Train the model through multiple epochs, an epoch being a complete pass through all examples in the split.
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model_gru, train_iterator, optimizer_gru, criterion)
    valid_loss, valid_acc = evaluate(model_gru, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.462, Train Acc: 76.81%, Val. Loss: 0.282, Val. Acc: 88.81%
Epoch: 02, Train Loss: 0.228, Train Acc: 91.34%, Val. Loss: 0.267, Val. Acc: 89.53%
Epoch: 03, Train Loss: 0.153, Train Acc: 94.47%, Val. Loss: 0.252, Val. Acc: 90.87%
Epoch: 04, Train Loss: 0.096, Train Acc: 96.70%, Val. Loss: 0.311, Val. Acc: 89.63%
Epoch: 05, Train Loss: 0.068, Train Acc: 97.66%, Val. Loss: 0.384, Val. Acc: 89.87%


In [16]:
# Calculate the test loss and accuracy.
test_loss, test_acc = evaluate(model_gru, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.526, Test Acc: 86.54%


## Step 4: Model Implementation
### - User Input

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

Our `predict_sentiment` function does a few things:
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing 
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.
### - LSTM Model

In [19]:
import spacy
nlp = spacy.load('en')

def predict_sentiment_lstm(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = F.sigmoid(model_lstm(tensor))
    return prediction.item()

An example negative review...

In [20]:
predict_sentiment_lstm("This film is terrible")



0.0016802072059363127

An example positive review...

In [21]:
predict_sentiment_lstm("This film is great")



0.9873371720314026

### - GRU Model

In [22]:
nlp = spacy.load('en')

def predict_sentiment_gru(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = F.sigmoid(model_gru(tensor))
    return prediction.item()

An example negative review...

In [23]:
predict_sentiment_gru("This film is terrible")



0.0021957994904369116

An example positive review...

In [24]:
predict_sentiment_gru("This film is great")



0.9987700581550598

## Step 5: Conclusion

We've now built a decent sentiment analysis model for movie reviews, using both the LSTM model and the GRU model.  
According to the model training step, the test accuracy is 86.54% using GRU model and 86.03% using LSTM model. Therefore, the accuracy using GRU and LSTM are similar, with GRU slightly better than LSTM with 0.5%.  
Also, since for the both models, the test error and the training error are similar, we can conclude that both models are not overfitting.  
However, since LSTM is more complex then GRU, GRU is more preferable in this case.