## Assignment 2.3: Text classification via RNN (30 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language.

In [84]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [85]:
if torch.cuda.is_available():
    device = 'cuda'
    torch.cuda.set_device(0)
else:
    device = 'cpu'


print("device: ", device)

device:  cuda


### Preparing Data

In [86]:
TEXT = Field(sequential=True, lower=True)
LABEL = LabelField()

In [87]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()

In [88]:
%%time
TEXT.build_vocab(trn)

CPU times: user 1.54 s, sys: 64 ms, total: 1.6 s
Wall time: 1.6 s


In [89]:
LABEL.build_vocab(trn)

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [90]:
TEXT.vocab.freqs.most_common(10)

[('the', 224196),
 ('a', 111988),
 ('and', 111048),
 ('of', 100799),
 ('to', 93554),
 ('is', 72929),
 ('in', 63266),
 ('i', 49492),
 ('this', 48857),
 ('that', 46209)]

In [91]:
print("TEXT vocabulary size:",len(TEXT.vocab))
print("LABEL vocabulary size:",len(LABEL.vocab))

TEXT vocabulary size: 202013
LABEL vocabulary size: 2


### Creating the Iterator (2 points)

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [92]:
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(64, 64, 64),
        sort=False,
        sort_key=lambda x: len(x.text),
        sort_within_batch=False,
        device='cuda',
        repeat=False
)

Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [93]:
batch = next(train_iter.__iter__()); batch.text

tensor([[   3,   94,  950,  ..., 5981,    8,    9],
        [ 222,    2,   10,  ...,    6, 2480,  492],
        [   5,  695,   20,  ...,  110,    5,  182],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:0')

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [94]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

### Define the RNN-based text classification model (10 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [60]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, hidden_dim, emb_dim, n_layers, device=device):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.emb_dim = emb_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_dim)
        self.gru = nn.GRU(self.emb_dim, self.hidden_dim, num_layers=self.n_layers, dropout=0.5, bidirectional=True)
        self.linear = nn.Linear(in_features=self.hidden_dim*2, out_features=1)
        self.act = nn.Sigmoid()
            
    def forward(self, seq):
        
        emb = self.embedding(seq)
        outputs, hidden = self.gru(emb)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        outputs = self.linear(hidden)
        preds = self.act(outputs)

        return preds

In [90]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 2

model = RNNBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

RNNBaseline(
  (embedding): Embedding(202477, 200)
  (gru): GRU(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [91]:
model.cuda()

RNNBaseline(
  (embedding): Embedding(202477, 200)
  (gru): GRU(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

### The training loop (3 points)

Define the optimization and the loss functions.

In [92]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

Define the stopping criteria.

In [93]:
epochs = 5

In [95]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x) 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

  return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)


Epoch: 1, Training Loss: 0.006406023572172437, Validation Loss: 0.005995762546857198
Epoch: 2, Training Loss: 0.003078684065597398, Validation Loss: 0.005681985412041346
Epoch: 3, Training Loss: 0.0010622044105614934, Validation Loss: 0.009135842084884644
Epoch: 4, Training Loss: 0.00043887639168782957, Validation Loss: 0.011650349374612172
Epoch: 5, Training Loss: 0.0002040364532415489, Validation Loss: 0.017380772598584494
CPU times: user 7min, sys: 1min 46s, total: 8min 46s
Wall time: 9min 53s


### Calculate performance of the trained model (5 points)

In [96]:
for batch in test_iter:
    x = batch.text
    y = batch.label

In [103]:
predictions = model(x).squeeze()

In [104]:
predictions

tensor([1.0000, 0.9999, 1.0000, 1.0000, 1.0000, 0.0889, 0.9999, 0.9995, 0.9982,
        0.9999, 0.9993, 0.9995, 0.9944, 0.9999, 0.9999, 0.9999, 1.0000, 1.0000,
        0.9999, 0.9817, 1.0000, 1.0000, 0.9948, 0.9999, 0.9999, 0.9982, 0.9999,
        1.0000, 0.9968, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9995, 0.9997,
        0.9348, 1.0000, 0.0728, 0.9998], device='cuda:2',
       grad_fn=<SqueezeBackward0>)

In [105]:
y

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2')

In [112]:
rounded_preds = torch.round(predictions)
correct = (rounded_preds == y.float()).float() 
accuracy = correct.sum()/len(correct)
print(accuracy)

tensor(0.9500, device='cuda:2')


In [113]:
correct

tensor([1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 0., 1.], device='cuda:2')

In [116]:
correct_true_amount =  (correct * y.float()).sum()
recall = correct_true_amount / y.float().sum()
print(recall)

tensor(0.9500, device='cuda:2')


In [117]:
precision = correct_true_amount / predictions.sum()
print(precision)

tensor(0.9985, device='cuda:2', grad_fn=<DivBackward0>)


In [119]:
f1 = 2*precision*recall / (precision + recall)
print(f1)

tensor(0.9736, device='cuda:2', grad_fn=<DivBackward0>)


Write down the calculated performance

### Accuracy:  0.9500
### Precision: 0.9985
### Recall: 0.9500
### F1: 0.9736

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

## 1.
### Use LSTM instead of GRU but use Adam optimizer.

In [67]:
class LSTMBaseline(nn.Module):
    def __init__(self, vocab_size, hidden_dim, emb_dim, n_layers, device=device):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.emb_dim = emb_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_dim)
        self.lstm = nn.LSTM(self.emb_dim, self.hidden_dim, num_layers=self.n_layers, dropout=0.5, bidirectional=True)
        self.linear = nn.Linear(in_features=self.hidden_dim*2, out_features=1)
        self.act = nn.Sigmoid()
            
    def forward(self, seq):
        
        emb = self.embedding(seq)
        outputs, (hidden, cell) = self.lstm(emb)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        outputs = self.linear(hidden)
        preds = self.act(outputs)

        return preds

In [68]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 2

LSTMBaseline(
  (embedding): Embedding(202013, 200)
  (lstm): LSTM(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [75]:
model = LSTMBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

LSTMBaseline(
  (embedding): Embedding(202013, 200)
  (lstm): LSTM(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [76]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [77]:
model.cuda()

LSTMBaseline(
  (embedding): Embedding(202013, 200)
  (lstm): LSTM(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [78]:
epochs = 5

In [79]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x).squeeze() 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.01051305285862514, Validation Loss: 0.009663916476567586
Epoch: 2, Training Loss: 0.009447988227435521, Validation Loss: 0.008468868442376454
Epoch: 3, Training Loss: 0.006968900963238308, Validation Loss: 0.008018794314066569
Epoch: 4, Training Loss: 0.004223328038624355, Validation Loss: 0.006067880711952845
Epoch: 5, Training Loss: 0.002570329620582717, Validation Loss: 0.00728962122797966
CPU times: user 7min 24s, sys: 1min 39s, total: 9min 4s
Wall time: 9min 4s


In [95]:
for batch in test_iter:
    x = batch.text
    y = batch.label

In [97]:
predictions = model(x).squeeze()
print("predictions: ", predictions)
print("y: ", y)
rounded_preds = torch.round(predictions)
correct = (rounded_preds == y.float()).float() 
accuracy = correct.sum()/len(correct)
print("accuracy: ", accuracy)
correct_true_amount =  (correct * y.float()).sum()
recall = correct_true_amount / y.float().sum()
print("recall: ", recall)
precision = correct_true_amount / predictions.sum()
print("precision: ", precision)
f1 = 2*precision*recall / (precision + recall)
print("f1: ", f1)

predictions:  tensor([0.0089, 0.0068, 0.0336, 0.0046, 0.0124, 0.3891, 0.0861, 0.4370, 0.0465,
        0.0058, 0.1585, 0.0079, 0.0511, 0.0152, 0.0106, 0.0428, 0.0056, 0.0164,
        0.0080, 0.0472, 0.0056, 0.0060, 0.0355, 0.0067, 0.0076, 0.0117, 0.1587,
        0.0062, 0.0131, 0.0056, 0.0043, 0.0062, 0.0083, 0.0059, 0.0082, 0.4056,
        0.3826, 0.0060, 0.2953, 0.0121], device='cuda:0',
       grad_fn=<SqueezeBackward0>)
y:  tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
accuracy:  tensor(1., device='cuda:0')
recall:  tensor(nan, device='cuda:0')
precision:  tensor(0., device='cuda:0', grad_fn=<DivBackward0>)
f1:  tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)


It very good on accuracy but it is very strange that there are only zeroes in y. So I don't know why it is.

## 2.
### Change number of layers.

In [100]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 4
model = LSTMBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

LSTMBaseline(
  (embedding): Embedding(202013, 200)
  (lstm): LSTM(200, 300, num_layers=4, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [101]:
model.cuda()

LSTMBaseline(
  (embedding): Embedding(202013, 200)
  (lstm): LSTM(200, 300, num_layers=4, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

In [102]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [103]:
epochs = 5

In [104]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x).squeeze() 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010511533805302212, Validation Loss: 0.009953414384524028
Epoch: 2, Training Loss: 0.008259902024269104, Validation Loss: 0.008046958049138388
Epoch: 3, Training Loss: 0.005282344491992678, Validation Loss: 0.00589160217444102
Epoch: 4, Training Loss: 0.003222078377859933, Validation Loss: 0.006427731857697169
Epoch: 5, Training Loss: 0.0018292595868664129, Validation Loss: 0.006051302402218183
CPU times: user 15min 54s, sys: 4min 16s, total: 20min 11s
Wall time: 20min 10s


In [105]:
for batch in test_iter:
    x = batch.text
    y = batch.label

In [106]:
predictions = model(x).squeeze()
print("predictions: ", predictions)
print("y: ", y)
rounded_preds = torch.round(predictions)
correct = (rounded_preds == y.float()).float() 
accuracy = correct.sum()/len(correct)
print("accuracy: ", accuracy)
correct_true_amount =  (correct * y.float()).sum()
recall = correct_true_amount / y.float().sum()
print("recall: ", recall)
precision = correct_true_amount / predictions.sum()
print("precision: ", precision)
f1 = 2*precision*recall / (precision + recall)
print("f1: ", f1)

predictions:  tensor([0.0925, 0.0348, 0.0335, 0.0053, 0.5546, 0.7839, 0.0399, 0.7578, 0.0246,
        0.0104, 0.0431, 0.0376, 0.4270, 0.5420, 0.0063, 0.2888, 0.0105, 0.1067,
        0.0129, 0.0311, 0.0139, 0.0054, 0.0131, 0.1670, 0.0071, 0.0405, 0.0437,
        0.0053, 0.3596, 0.0095, 0.0056, 0.0292, 0.0759, 0.0094, 0.0830, 0.4709,
        0.7604, 0.0102, 0.9877, 0.0097], device='cuda:0',
       grad_fn=<SqueezeBackward0>)
y:  tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
accuracy:  tensor(0.8500, device='cuda:0')
recall:  tensor(nan, device='cuda:0')
precision:  tensor(0., device='cuda:0', grad_fn=<DivBackward0>)
f1:  tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
