#Advanced RNN Model

In the Simple RNN Model, we observed very poor performance. We will now try to improve that performance by using the following:
- packed padded sequences
- pre-trained word embeddings
- different RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer

Making this enhancements helps us to achieve ~84% test accuracy. It can be ran by changing the runtime type to "GPU" and selecting "run all".

In [None]:
# testing for GPU
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

## Preparing the Data

Firstly, we set the seed, define the `Fields`, and retrieve the train/val/test splits.

In [None]:
!pip install torchtext==0.6

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.6
  Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.98-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.1
    Uninstalling torchtext-0.15.1:
      Successfully uninstalled torchtext-0.15.1
Successfully installed sentencepiece-0.1.98 torchtext-0.6.0


We'll be using *packed padded sequences* by setting `include_lengths = True` for our `TEXT` field which will make our RNN only process the non-padded elements of our sequence, and for any padded element the `output` will be a zero tensor.  This will cause `batch.text` to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

In [None]:
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  include_lengths = True)

LABEL = data.LabelField(dtype = torch.float)

print(TEXT)
print(LABEL)

<torchtext.data.field.Field object at 0x7f82c3d52370>
<torchtext.data.field.LabelField object at 0x7f835fcbf520>


Load the IMDb dataset and split into train and test sets.

In [None]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)


downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 11.2MB/s]


Then create the validation set from our training set.

In [None]:
import random

train_data, valid_data = train_data.split(split_ratio=0.5, random_state = random.seed(SEED))

We indicate that we want to use pre-trained word embeddings by passing `"glove.6B.100d" vectors"` as an argument to `build_vocab`. `glove` is the algorithm used to calculate the vectors.

In [None]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.36MB/s]                           
100%|█████████▉| 399999/400000 [00:19<00:00, 20604.59it/s]


## Batch Sizes
In the below cell, we test different batch sizes of 32, 64, and 128 for the iterator. Additionally, we ensure that all the packed padded sequences tensors are sorted by their lengths by setting `sort_within_batch = True` in the iterator.

In [None]:
# We can adjust the batch size here
BATCH_SIZE = 64
# BATCH_SIZE = 32
# BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

## Build the Model
#### Different RNN Architecture
We used an entirely different RNN architecture (LSTM) so that we can overcome the vanishing gradient problem RNNs have. 

#### Bidirectional RNN
As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$. 

#### Multi-layer RNN
We also utilized a multi-layer RNN (also known as Deep RNNs) by adding additional RNNs on top of the initial standard RNN where each added RNN is a new layer. 

#### Regularization

To combat poor generalization and overfitting, we use a method of regularization called dropout. 

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout) 
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):     
        embedded = self.dropout(self.embedding(text))      
        # pack sequence and ensure lengths are on CPU
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'))
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))           
        return self.fc(hidden)

Now we create an instance of our RNN class with our new parameters and arguments for the number of layers, bidirectionality, and dropout probability.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5 # test with 0.6, 0.8

# retrieving pad token index from vocabulary in string format
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [None]:
# printing the number of parameters in our model 
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 4,810,857 trainable parameters


In [None]:
# copying the pre-trained word embeddings into embedding layer of our model.
pretrained_embeddings = TEXT.vocab.vectors

# print shape to check if embeddings are the correct size, [vocab size, embedding dimension]
print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We then replace the initial weights of the `embedding` layer with the pre-trained embeddings.

In [None]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.4514,  0.2532, -0.4848,  ..., -0.8656,  0.0834,  0.3125],
        [-2.1498,  0.0503, -2.1136,  ...,  0.7646, -0.3180, -0.1118],
        [-0.6409,  1.7305,  1.1259,  ...,  0.0879,  0.1361,  0.4924]])

In [None]:
# initializing both <unk> and <pad> token to all zeros to explicity tell model that they are irrelevant.
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.4514,  0.2532, -0.4848,  ..., -0.8656,  0.0834,  0.3125],
        [-2.1498,  0.0503, -2.1136,  ...,  0.7646, -0.3180, -0.1118],
        [-0.6409,  1.7305,  1.1259,  ...,  0.0879,  0.1361,  0.4924]])


## Training and Choosing Optimizer and Criterion for the Model

Now we train the model. We choose Adam for the optimizer and use a BCEWithLogitsLoss function for the criterion and place the model on a GPU if available. Additionally, we experiment with another optimizer, SGD.

In [None]:
import torch.optim as optim

# trying SGD
# optimizer = optim.SGD(model.parameters(), lr=1e-3)

# trying Adam
optimizer = optim.Adam(model.parameters())

In [None]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

We implement the function to calculate precision.

In [None]:
def binary_precision(preds, y):
    """
    Returns precision per batch, i.e. if you get 8 true positives/10 true positives + false positives right, this returns 0.8, NOT 8
    """
    # round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    # correct = (rounded_preds == y).float() #convert into float for division 
    true_pos = 0
    rounded_preds.tolist()
    y.tolist()
    for idx, x in enumerate(rounded_preds):
      if x == 1 and y[idx] == 1:
        true_pos += 1

    # true + false positives
    pred_ones = (rounded_preds == 1.).sum(dim=0)

    # true + false positives
    precision = true_pos/ (pred_ones+1e-8)
    return precision

We implement the function to calculate recall.

In [None]:
import numpy as np 

def binary_recall(preds, y):
    """
    Returns recall per batch, i.e. if you get 8 true positives/10 true positives + false negatives right, this returns 0.8, NOT 8
    """
    # round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))

    true_pos = 0
    false_neg = 0

    rounded_preds.tolist()
    y.tolist()
    for idx, x in enumerate(rounded_preds):
      if x == 1 and y[idx] == 1:
        true_pos += 1
      if x == 0 and y[idx] == 1:
        false_neg += 1

    # true positives + false neg
    t_pos_f_neg = (y == 1.).sum(dim=0)
    # t_pos_f_neg = true_pos + false_neg 

    # print("True Positives: " + str(true_pos))
    # print("True Positives + False Negatives: " + str(t_pos_f_neg))
    recall = true_pos/ (t_pos_f_neg+1e-8)
    # print("Recall: " + str(recall))
    return recall

We implement the function calculate F1 Score. 

In [None]:
def binary_f1(prec, recall):
    """
    Returns f1 per batch
    """
    f1 = 2 * prec * recall / ((prec + recall) + 1e-8)
    return f1

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    epoch_prec = 0
    epoch_rec = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()        
        text, text_lengths = batch.text        
        predictions = model(text, text_lengths).squeeze(1)        
        loss = criterion(predictions, batch.label)        
        acc = binary_accuracy(predictions, batch.label)
        prec = binary_precision(predictions, batch.label)
        rec = binary_recall(predictions, batch.label)       
        loss.backward()        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        epoch_prec += prec.item()
        epoch_rec += rec.item()
       
    return epoch_loss / len(iterator), epoch_acc / len(iterator), epoch_prec / len(iterator), epoch_rec / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    epoch_prec = 0
    epoch_rec = 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text           
            predictions = model(text, text_lengths).squeeze(1)            
            loss = criterion(predictions, batch.label)           
            acc = binary_accuracy(predictions, batch.label)
            prec = binary_precision(predictions, batch.label)
            rec = binary_recall(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            epoch_prec += prec.item()
            epoch_rec += rec.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator), epoch_prec / len(iterator), epoch_rec / len(iterator)

We define a function to inform us how long each epoch takes to run.

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Train the Model

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()  
    train_loss, train_acc, train_prec, train_rec = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc, valid_prec, valid_rec = evaluate(model, valid_iterator, criterion)   
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')

    train_f1 = binary_f1(train_prec, train_rec)
    valid_f1 = binary_f1(valid_prec, valid_rec)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f} | Train Prec: {train_prec*100:.2f} | Train Recall: {train_rec*100:.2f} | Train F1: {train_f1*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f} | Val. Prec: {valid_prec*100:.2f} | Val Recall: {valid_rec*100:.2f} | Val F1: {valid_f1*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 30s
	Train Loss: -78.023 | Train Acc: 50.72 | Train Prec: 0.00 | Train Recall: 0.00 | Train F1: 0.00%
	 Val. Loss: -105.602 |  Val. Acc: 49.26 | Val. Prec: 0.00 | Val Recall: 0.00 | Val F1: 0.00%
Epoch: 02 | Epoch Time: 0m 30s
	Train Loss: -126.359 | Train Acc: 50.74 | Train Prec: 0.00 | Train Recall: 0.00 | Train F1: 0.00%
	 Val. Loss: -155.292 |  Val. Acc: 49.26 | Val. Prec: 0.00 | Val Recall: 0.00 | Val F1: 0.00%
Epoch: 03 | Epoch Time: 0m 29s
	Train Loss: -175.204 | Train Acc: 50.65 | Train Prec: 0.00 | Train Recall: 0.00 | Train F1: 0.00%
	 Val. Loss: -205.341 |  Val. Acc: 49.26 | Val. Prec: 0.00 | Val Recall: 0.00 | Val F1: 0.00%
Epoch: 04 | Epoch Time: 0m 30s
	Train Loss: -223.781 | Train Acc: 50.72 | Train Prec: 0.00 | Train Recall: 0.00 | Train F1: 0.00%
	 Val. Loss: -255.407 |  Val. Acc: 49.26 | Val. Prec: 0.00 | Val Recall: 0.00 | Val F1: 0.00%
Epoch: 05 | Epoch Time: 0m 30s
	Train Loss: -271.890 | Train Acc: 50.76 | Train Prec: 0.00 | Train Recall

### Obtain Test Accuracy

In [None]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc, test_prec, test_recall = evaluate(model, test_iterator, criterion)

test_f1 = binary_f1(test_prec, test_recall)

# print(test_recall)

print(f'Test Loss: {test_loss:.3f}% | Test Acc: {test_acc*100:.2f}% | Test Precision: {test_prec*100:.2f}% | Test Recall: {test_recall*100:.2f}% | Test F1 Score: {test_f1*100:.2f}%')