## Assignment 2.3: Text classification via RNN (30 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language.

In [1]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
if torch.cuda.is_available():
    device = 'cuda'
    torch.cuda.set_device(2)
else:
    device = 'cpu'


print("device: ", device)

device:  cuda


### Preparing Data

In [3]:
TEXT = Field(sequential=True, lower=True)
LABEL = LabelField()

In [4]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()

In [5]:
%%time
TEXT.build_vocab(trn)

CPU times: user 1.79 s, sys: 115 ms, total: 1.9 s
Wall time: 2.39 s


In [6]:
LABEL.build_vocab(trn)

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [7]:
TEXT.vocab.freqs.most_common(10)

[('the', 225879),
 ('a', 112390),
 ('and', 110975),
 ('of', 101369),
 ('to', 93979),
 ('is', 73121),
 ('in', 63012),
 ('i', 48940),
 ('this', 48915),
 ('that', 46219)]

In [8]:
print("TEXT vocabulary size:",len(TEXT.vocab))
print("LABEL vocabulary size:",len(LABEL.vocab))

TEXT vocabulary size: 202477
LABEL vocabulary size: 2


### Creating the Iterator (2 points)

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [9]:
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(64, 64, 64),
        sort=False,
        sort_key=lambda x: len(x.text),
        sort_within_batch=False,
        device='cuda',
        repeat=False
)

Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [10]:
batch = next(train_iter.__iter__()); batch.text

tensor([[   10,   290,     9,  ...,   968,     9,     3],
        [   24,  1377,    87,  ...,   937,   814, 12126],
        [    7,    41,    23,  ...,    16,     6,  5238],
        ...,
        [    1,     1,     1,  ...,     1,     1,     1],
        [    1,     1,     1,  ...,     1,     1,     1],
        [    1,     1,     1,  ...,     1,     1,     1]], device='cuda:2')

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [11]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

### Define the RNN-based text classification model (10 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [60]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, hidden_dim, emb_dim, n_layers, device=device):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.emb_dim = emb_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_dim)
        self.gru = nn.GRU(self.emb_dim, self.hidden_dim, num_layers=self.n_layers, dropout=0.5, bidirectional=True)
        self.linear = nn.Linear(in_features=self.hidden_dim*2, out_features=1)
        self.act = nn.Sigmoid()
            
    def forward(self, seq):
        
        emb = self.embedding(seq)
        outputs, hidden = self.gru(emb)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        outputs = self.linear(hidden)
        preds = self.act(outputs)

        return preds

In [90]:
em_sz = 200
nh = 300
vocab_size = len(TEXT.vocab)
num_layers = 2

model = RNNBaseline(vocab_size=vocab_size, hidden_dim=nh, emb_dim=em_sz, n_layers=num_layers, device=device); model

RNNBaseline(
  (embedding): Embedding(202477, 200)
  (gru): GRU(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [91]:
model.cuda()

RNNBaseline(
  (embedding): Embedding(202477, 200)
  (gru): GRU(200, 300, num_layers=2, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=600, out_features=1, bias=True)
  (act): Sigmoid()
)

### The training loop (3 points)

Define the optimization and the loss functions.

In [92]:
opt = optim.Adam(model.parameters())
loss_func = nn.BCELoss()

Define the stopping criteria.

In [93]:
epochs = 5

In [95]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label
        opt.zero_grad()
        preds = model(x).squeeze()
        loss = loss_func(preds, y.float())
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x) 
        
        loss = loss_func(preds, y.float())
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

  return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)


Epoch: 1, Training Loss: 0.006406023572172437, Validation Loss: 0.005995762546857198
Epoch: 2, Training Loss: 0.003078684065597398, Validation Loss: 0.005681985412041346
Epoch: 3, Training Loss: 0.0010622044105614934, Validation Loss: 0.009135842084884644
Epoch: 4, Training Loss: 0.00043887639168782957, Validation Loss: 0.011650349374612172
Epoch: 5, Training Loss: 0.0002040364532415489, Validation Loss: 0.017380772598584494
CPU times: user 7min, sys: 1min 46s, total: 8min 46s
Wall time: 9min 53s


### Calculate performance of the trained model (5 points)

In [96]:
for batch in test_iter:
    x = batch.text
    y = batch.label

In [103]:
predictions = model(x).squeeze()

In [104]:
predictions

tensor([1.0000, 0.9999, 1.0000, 1.0000, 1.0000, 0.0889, 0.9999, 0.9995, 0.9982,
        0.9999, 0.9993, 0.9995, 0.9944, 0.9999, 0.9999, 0.9999, 1.0000, 1.0000,
        0.9999, 0.9817, 1.0000, 1.0000, 0.9948, 0.9999, 0.9999, 0.9982, 0.9999,
        1.0000, 0.9968, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9995, 0.9997,
        0.9348, 1.0000, 0.0728, 0.9998], device='cuda:2',
       grad_fn=<SqueezeBackward0>)

In [105]:
y

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2')

In [112]:
rounded_preds = torch.round(predictions)
correct = (rounded_preds == y.float()).float() 
accuracy = correct.sum()/len(correct)
print(accuracy)

tensor(0.9500, device='cuda:2')


In [113]:
correct

tensor([1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 0., 1.], device='cuda:2')

In [116]:
correct_true_amount =  (correct * y.float()).sum()
recall = correct_true_amount / y.float().sum()
print(recall)

tensor(0.9500, device='cuda:2')


In [117]:
precision = correct_true_amount / predictions.sum()
print(precision)

tensor(0.9985, device='cuda:2', grad_fn=<DivBackward0>)


In [119]:
f1 = 2*precision*recall / (precision + recall)
print(f1)

tensor(0.9736, device='cuda:2', grad_fn=<DivBackward0>)


Write down the calculated performance

### Accuracy:  0.9500
### Precision: 0.9985
### Recall: 0.9500
### F1: 0.9736

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

### 1. ?
### 2. ?
### 3. ?