## Assignment 2.3: Text classification via RNN (30 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language.

In [1]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
import os
if torch.cuda.is_available():
    device = 'cuda'
    torch.cuda.set_device(3)
else:
    device = 'cpu'


print("device: ", device)

device:  cuda


In [3]:
torch.cuda.current_device()

3

### Preparing Data

In [4]:
TEXT = Field(sequential=True, lower=True)
LABEL = LabelField()

In [5]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()

In [6]:
%%time
TEXT.build_vocab(trn)

CPU times: user 1.45 s, sys: 126 ms, total: 1.58 s
Wall time: 1.88 s


In [7]:
LABEL.build_vocab(trn)

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [8]:
TEXT.vocab.freqs.most_common(10)

[('the', 227670),
 ('a', 112248),
 ('and', 111447),
 ('of', 102003),
 ('to', 94477),
 ('is', 73358),
 ('in', 63743),
 ('i', 49263),
 ('this', 48896),
 ('that', 46565)]

In [9]:
print("TEXT vocabulary size:",len(TEXT.vocab))
print("LABEL vocabulary size:",len(LABEL.vocab))

TEXT vocabulary size: 202507
LABEL vocabulary size: 2


### Creating the Iterator (2 points)

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [10]:
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(64, 64, 64),
        sort=False,
        sort_key=lambda x: len(x.text),
        sort_within_batch=False,
        device='cuda',
        repeat=False
)

Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [11]:
batch = next(train_iter.__iter__()); batch.text

tensor([[ 2294,    10, 21966,  ...,     9,    10,     2],
        [   10,     7,   310,  ...,   266,    20,   331],
        [   14,     3,    34,  ...,    10,    74,   329],
        ...,
        [    1,     1,     1,  ...,     1,     1,     1],
        [    1,     1,     1,  ...,     1,     1,     1],
        [    1,     1,     1,  ...,     1,     1,     1]], device='cuda:3')

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [12]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

### Define the RNN-based text classification model (10 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [13]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, hidden_dim, emb_dim, n_layers, device=device):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.emb_dim = emb_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_dim)
        self.gru = nn.GRU(self.emb_dim, self.hidden_dim, num_layers=self.n_layers, dropout=0.5, bidirectional=True)
        self.linear = nn.Linear(in_features=self.hidden_dim*2, out_features=1)
        self.act = nn.Sigmoid()
            
    def forward(self, seq):
        
        emb = self.embedding(seq)
        outputs, hidden = self.gru(emb)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        outputs = self.linear(hidden)
        preds = self.act(outputs)

        return preds

In [14]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 2

model = RNNBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

RNNBaseline(
  (embedding): Embedding(201435, 200)
  (gru): GRU(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [15]:
model.cuda()

RNNBaseline(
  (embedding): Embedding(201435, 200)
  (gru): GRU(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

### The training loop (3 points)

Define the optimization and the loss functions.

In [16]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

Define the stopping criteria.

In [17]:
epochs = 3

In [18]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x).squeeze() 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010312874983038222, Validation Loss: 0.009885733830928802
Epoch: 2, Training Loss: 0.00819954081262861, Validation Loss: 0.006121420164903005
Epoch: 3, Training Loss: 0.004075386614458902, Validation Loss: 0.0052307642281055455
CPU times: user 3min 54s, sys: 55.8 s, total: 4min 50s
Wall time: 5min 26s


### Calculate performance of the trained model (5 points)

In [19]:
tp = 0
fp = 0
fn = 0
tn = 0
for batch in test_iter:
    x = batch.text
    y = batch.label
    predictions = model(x).squeeze()
    rounded_preds = torch.round(predictions)
    confusion_vector = rounded_preds / y.float()
    
    
    
    tp += torch.sum(confusion_vector == 1).item()
    fp += torch.sum(confusion_vector == float('inf')).item()
    tn += torch.sum(torch.isnan(confusion_vector)).item()
    fn += torch.sum(confusion_vector == 0).item()
    

In [21]:
accuracy = (tp+tn)/(tp+tn+fn+fp)
print("accuracy: ", accuracy)
precision = tp/(tp+fp)
print("precision: ", precision)
recall = tp/(tp+fn)
print("recall: ", recall)
f1 = 2*precision*recall/(precision+recall)
print("f1: ", f1)

accuracy:  0.85732
precision:  0.8618650247103622
recall:  0.85104
f1:  0.8564183069677576


Write down the calculated performance

### Accuracy:  0.8573
### Precision: 0.8619
### Recall: 0.8510
### F1: 0.8564

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

## 1.
### Use LSTM instead of GRU but use Adam optimizer.

In [13]:
class LSTMBaseline(nn.Module):
    def __init__(self, vocab_size, hidden_dim, emb_dim, n_layers, device=device):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.emb_dim = emb_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_dim)
        self.lstm = nn.LSTM(self.emb_dim, self.hidden_dim, num_layers=self.n_layers, dropout=0.5, bidirectional=True)
        self.linear = nn.Linear(in_features=self.hidden_dim*2, out_features=1)
        self.act = nn.Sigmoid()
            
    def forward(self, seq):
        
        emb = self.embedding(seq)
        outputs, (hidden, cell) = self.lstm(emb)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        outputs = self.linear(hidden)
        preds = self.act(outputs)

        return preds

In [23]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 2

In [24]:
model = LSTMBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

LSTMBaseline(
  (embedding): Embedding(201435, 200)
  (lstm): LSTM(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [25]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [26]:
model.cuda()

LSTMBaseline(
  (embedding): Embedding(201435, 200)
  (lstm): LSTM(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [27]:
epochs = 5

In [28]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x).squeeze() 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010667503012929644, Validation Loss: 0.010176380038261414
Epoch: 2, Training Loss: 0.008569264927932195, Validation Loss: 0.008188266328970592
Epoch: 3, Training Loss: 0.005305669948032924, Validation Loss: 0.0060926742573579155
Epoch: 4, Training Loss: 0.00366311129118715, Validation Loss: 0.006672459904352824
Epoch: 5, Training Loss: 0.0024167179827179227, Validation Loss: 0.005812775735060374
CPU times: user 7min 2s, sys: 1min 44s, total: 8min 46s
Wall time: 9min 46s


In [29]:
tp = 0
fp = 0
fn = 0
tn = 0
for batch in test_iter:
    x = batch.text
    y = batch.label
    predictions = model(x).squeeze()
    rounded_preds = torch.round(predictions)
    confusion_vector = rounded_preds / y.float()
    
    
    
    tp += torch.sum(confusion_vector == 1).item()
    fp += torch.sum(confusion_vector == float('inf')).item()
    tn += torch.sum(torch.isnan(confusion_vector)).item()
    fn += torch.sum(confusion_vector == 0).item()
    
accuracy = (tp+tn)/(tp+tn+fn+fp)
print("accuracy: ", accuracy)
precision = tp/(tp+fp)
print("precision: ", precision)
recall = tp/(tp+fn)
print("recall: ", recall)
f1 = 2*precision*recall/(precision+recall)
print("f1: ", f1)

accuracy:  0.8438
precision:  0.8275783215184084
recall:  0.86856
f1:  0.8475740661227995


## 2.
### Change number of layers.

In [14]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 4
model = LSTMBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

LSTMBaseline(
  (embedding): Embedding(202507, 200)
  (lstm): LSTM(200, 300, num_layers=4, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [15]:
model.cuda()

LSTMBaseline(
  (embedding): Embedding(202507, 200)
  (lstm): LSTM(200, 300, num_layers=4, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [16]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [17]:
epochs = 5

In [18]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x).squeeze() 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010766579798289707, Validation Loss: 0.010535019485155742
Epoch: 2, Training Loss: 0.010609476024763925, Validation Loss: 0.010886118125915528
Epoch: 3, Training Loss: 0.01066178231239319, Validation Loss: 0.010738253450393677
Epoch: 4, Training Loss: 0.010423775676318577, Validation Loss: 0.010440375026067098
Epoch: 5, Training Loss: 0.008297995703560965, Validation Loss: 0.008229103203614552
CPU times: user 15min 50s, sys: 5min 10s, total: 21min 1s
Wall time: 22min 39s


In [19]:
tp = 0
fp = 0
fn = 0
tn = 0
for batch in test_iter:
    x = batch.text
    y = batch.label
    predictions = model(x).squeeze()
    rounded_preds = torch.round(predictions)
    confusion_vector = rounded_preds / y.float()
    
    
    
    tp += torch.sum(confusion_vector == 1).item()
    fp += torch.sum(confusion_vector == float('inf')).item()
    tn += torch.sum(torch.isnan(confusion_vector)).item()
    fn += torch.sum(confusion_vector == 0).item()
    
accuracy = (tp+tn)/(tp+tn+fn+fp)
print("accuracy: ", accuracy)
precision = tp/(tp+fp)
print("precision: ", precision)
recall = tp/(tp+fn)
print("recall: ", recall)
f1 = 2*precision*recall/(precision+recall)
print("f1: ", f1)

accuracy:  0.75804
precision:  0.8401349783823685
recall:  0.63736
f1:  0.7248328253650549
