<a href="https://colab.research.google.com/github/franciscodlsb/MLSS2020TU/blob/master/MLSS2020_Deep_Learning_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Practical 6: Deep Learning for NLP**
for Machine Learning Summer School (MLSS) 2020 by Genta Indra Winata.

https://mlss.telkomuniversity.ac.id/

This tutorial is divided into three main sections:
1. Implement a simple neural network with a single linear layer and an embedding layer trained from scratch
2. Leverage pre-trained word embeddings for transfer learning.
3. Explore techniques to improve the model's robustness and generalization.

We will use an existing IMDB sentiment text analysis task. 

## Train from scratch

First, we import classes from PyTorch library and define hyper-parameters we will use to train our model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
print(f"PyTorch version: {torch.__version__}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"device: {device}")

PyTorch version: 1.6.0+cu101
device: cuda


Define the hyper-parameters here

In [None]:
from collections import namedtuple

args = {
  "hidden_size": 100,
  "output_size": 1,
  "lr": 1e-4,
  "seed": 1234,
  "max_vocab_size": 1000,
  "batch_size": 64,
  "num_epoch": 10,
  "dropout": 0.0
}
args = namedtuple('Struct', args.keys())(*args.values())

### Data Preprocessing
We will load the IMDB sentiment analysis dataset by simply downloading the data from `torchtext` package.
Using `data.Field`, we specify the tokenizer.

In [None]:
from torchtext import datasets
from torchtext import data
import random

TEXT = data.Field(tokenize = 'spacy', include_lengths = True) # add spacy tokenizer
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(args.seed))

We can check the data split and print a training sample.

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of valid examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')
print(f"Text: {vars(train_data.examples[0])['text']}")
print(f"Label: {vars(train_data.examples[0])['label']}")

Number of training examples: 17500
Number of valid examples: 7500
Number of testing examples: 25000
Text: ['This', 'film', 'is', 'the', 'freshman', 'effort', 'of', 'Stephanie', 'Beaton', 'and', 'her', 'new', 'production', 'company', '.', 'While', 'it', 'suffers', 'from', 'a', 'few', 'problems', ',', 'as', 'every', 'low', 'budget', 'production', 'does', ',', 'it', 'is', 'a', 'good', 'start', 'for', 'Ms.', 'Beaton', 'and', 'her', 'company.<br', '/><br', '/>The', 'story', 'is', 'not', 'terribly', 'new', 'having', 'been', 'done', 'in', 'films', 'like', 'The', 'Burning', 'and', 'every', 'Friday', 'the', '13th', 'since', 'part', '2', '.', 'But', ',', 'the', 'performances', 'are', 'heartfelt', '.', 'So', 'many', 'big', 'budget', 'movies', 'just', 'have', 'the', 'actors', 'going', 'through', 'the', 'motions', ',', 'its', 'always', 'nice', 'to', 'see', 'actors', 'really', 'trying', 'to', 'hone', 'their', 'craft.<br', '/><br', '/>The', 'story', 'deals', 'with', 'the', 'murder(and', 'possible', '

We can also utilize the `torchtext` field classes for building the vocabulary and labels.

In [None]:
TEXT.build_vocab(train_data, max_size=args.max_vocab_size)
LABEL.build_vocab(train_data)

In [None]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 1002
Unique tokens in LABEL vocabulary: 2


### Build the model

Let's define a linear layer model. This model consists of an embedding layer and a linear layer as the decoder.

Padding:
I love cat = [1 2 3]
I love black cat = [1 2 4 3]
I love = [1 2]
[
[1 2 4 3]
[1 2 3 0]
[1 2 0 0]
]

In [None]:
class SingleNNLayer(nn.Module):
  def __init__(self, vocab_size, hidden_size, output_size, dropout=0.0, pad_idx=0):
    super(SingleNNLayer, self).__init__()
    self.emb = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_idx)
    self.layer = nn.Linear(hidden_size, output_size)
    self.drop = nn.Dropout(dropout)

  def forward(self, inputs, inputs_len):
    """
      inputs: LongTensor (seq_len, batch_size)
    """
    inputs = inputs.transpose(0, 1) # (batch_size, seq_len)
    embedded_inputs = self.drop(self.emb(inputs)) # (batch_size, seq_len, emb_size)
    pooled_inputs = F.avg_pool2d(embedded_inputs, (embedded_inputs.shape[1], 1)).squeeze(1)  # (batch_size, emb_size)
    outputs = self.drop(self.layer(pooled_inputs)) # (batch_size, output_size)
    return outputs

We use accuracy as our metric to evaluate our model on the test data. We apply sigmoid as the activation function, if the prediction is higher or equal to 0.5, it classifies as positive, otherwise negative.

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    rounded_preds = torch.round(torch.sigmoid(preds)) # threshold 0.5
    correct = (rounded_preds == y).float() # convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

We define the `train` function. The function uses to sample each batch from the iterator. 

`optimizer.zero_grad()` is called to zero out the gradients.
Then, we compute the loss and accuracy average.

In [None]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward() # compute gradient
        optimizer.step() # update parameters
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define the `evaluate` function to compute the loss and accuracy of the test set. 

`model.eval` is called to remove all dropouts.

`torch.no_grad()` is called to ignore the gradient computation.

In [None]:
def evaluate(model, iterator):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    criterion = nn.BCEWithLogitsLoss()
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define the `train_model` to start our training for `args.num_epoch` epochs.

In [None]:
import time

def train_model(model, train_iterator, valid_iterator, saved_name="best_model.pt", optimizer=None):
  criterion = nn.BCEWithLogitsLoss()
  if optimizer == None:
    optimizer = optim.Adam(model.parameters())

  model.to(device)
  criterion.to(device)

  best_valid_loss = float('inf')
  for i in range(args.num_epoch):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    end_time = time.time()
    valid_loss, valid_acc = evaluate(model, valid_iterator)
    elapsed_time = end_time - start_time

    print(f"Epoch: {i+1} train loss:{train_loss:.3f} acc:{train_acc:.3f} valid loss:{valid_loss:.3f} acc:{valid_acc:.3f} time:{elapsed_time:.3f}s")

    # Choose the best valid loss
    if best_valid_loss > valid_loss:
      torch.save(model.state_dict(), saved_name)
      best_valid_loss = valid_loss
      print("Save model")

### Training
The data is split into train, valid, and test.
Each batch is sorted according to the sequence length.

In [None]:
# Data
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = args.batch_size, 
    sort_within_batch = True,
    device = device)

Let's create our model and start the training.

In [None]:
# Model to train
model = SingleNNLayer(len(TEXT.vocab), args.hidden_size, args.output_size)
train_model(model, train_iterator, valid_iterator, "best_model.pt")

Epoch: 1 train loss:0.672 acc:0.639 valid loss:0.644 acc:0.697 time:1.686s
Save model
Epoch: 2 train loss:0.592 acc:0.740 valid loss:0.550 acc:0.755 time:1.616s
Save model
Epoch: 3 train loss:0.499 acc:0.787 valid loss:0.478 acc:0.794 time:1.584s
Save model
Epoch: 4 train loss:0.439 acc:0.816 valid loss:0.433 acc:0.815 time:1.609s
Save model
Epoch: 5 train loss:0.402 acc:0.833 valid loss:0.407 acc:0.827 time:1.597s
Save model
Epoch: 6 train loss:0.379 acc:0.844 valid loss:0.391 acc:0.832 time:1.601s
Save model
Epoch: 7 train loss:0.364 acc:0.849 valid loss:0.381 acc:0.838 time:1.613s
Save model
Epoch: 8 train loss:0.352 acc:0.855 valid loss:0.375 acc:0.840 time:1.595s
Save model
Epoch: 9 train loss:0.343 acc:0.860 valid loss:0.368 acc:0.842 time:1.599s
Save model
Epoch: 10 train loss:0.337 acc:0.863 valid loss:0.364 acc:0.845 time:1.574s
Save model


### Evaluation
We evaluate our best model.

In [None]:
model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(model, test_iterator)
print(f"train loss:{test_loss:.3f} acc:{test_acc:.3f}")

train loss:0.354 acc:0.850


## Transfer Learning with Pre-trained Word Embeddings

We will use pre-trained GLoVe word embeddings.

### Data Preprocessing
We load `glove.6B.100d` embeddings from the `torchtext` package.

In [None]:
TEXT.build_vocab(train_data, 
  max_size = args.max_vocab_size, 
  vectors = "glove.6B.100d", 
  unk_init = torch.Tensor.normal_)

In [None]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary: 1002


### Training

We will set the `PAD_IDX` and `UNK_IDX` to the model's embeddings.

In [None]:
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

# Load pretrained word embeddings
glove_model = SingleNNLayer(len(TEXT.vocab), args.hidden_size, args.output_size, pad_idx=PAD_IDX)
pretrained_embeddings = TEXT.vocab.vectors
glove_model.emb.weight.data.copy_(pretrained_embeddings)
glove_model.emb.weight.data[UNK_IDX] = torch.zeros(args.hidden_size)
glove_model.emb.weight.data[PAD_IDX] = torch.zeros(args.hidden_size)

train_model(glove_model, train_iterator, valid_iterator, "best_glove_model.pt")

Epoch: 1 train loss:0.673 acc:0.639 valid loss:0.639 acc:0.723 time:1.497s
Save model
Epoch: 2 train loss:0.593 acc:0.741 valid loss:0.537 acc:0.781 time:1.459s
Save model
Epoch: 3 train loss:0.496 acc:0.798 valid loss:0.458 acc:0.816 time:1.478s
Save model
Epoch: 4 train loss:0.432 acc:0.823 valid loss:0.416 acc:0.832 time:1.467s
Save model
Epoch: 5 train loss:0.396 acc:0.839 valid loss:0.394 acc:0.835 time:1.507s
Save model
Epoch: 6 train loss:0.373 acc:0.845 valid loss:0.381 acc:0.840 time:1.493s
Save model
Epoch: 7 train loss:0.358 acc:0.851 valid loss:0.374 acc:0.840 time:1.503s
Save model
Epoch: 8 train loss:0.347 acc:0.856 valid loss:0.366 acc:0.845 time:1.454s
Save model
Epoch: 9 train loss:0.338 acc:0.861 valid loss:0.363 acc:0.846 time:1.454s
Save model
Epoch: 10 train loss:0.332 acc:0.863 valid loss:0.362 acc:0.846 time:1.492s
Save model
Epoch: 11 train loss:0.326 acc:0.866 valid loss:0.359 acc:0.848 time:1.467s
Save model
Epoch: 12 train loss:0.322 acc:0.867 valid loss:0.35

### Evaluation

In [None]:
glove_model.load_state_dict(torch.load('best_glove_model.pt'))
test_loss, test_acc = evaluate(glove_model, test_iterator)
print(f"test loss:{test_loss:.3f} acc:{test_acc:.3f}")

test loss:0.344 acc:0.856


## Improving the Model's Robustness and Generalization

We can apply simple techniques to improve the generalization of the model.

### Adding Dropout
Dropout: A Simple Way to Prevent Neural Networks from Overfitting 
http://jmlr.org/papers/v15/srivastava14a.html

In [None]:
# Load pretrained word embeddings
glove_model = SingleNNLayer(len(TEXT.vocab), args.hidden_size, args.output_size, pad_idx=PAD_IDX, dropout=0.2)
pretrained_embeddings = TEXT.vocab.vectors
glove_model.emb.weight.data.copy_(pretrained_embeddings)
glove_model.emb.weight.data[UNK_IDX] = torch.zeros(args.hidden_size)
glove_model.emb.weight.data[PAD_IDX] = torch.zeros(args.hidden_size)
train_model(glove_model, train_iterator, valid_iterator, "best_glove_model_with_dropout.pt")

Epoch: 1 train loss:0.676 acc:0.613 valid loss:0.648 acc:0.713 time:1.527s
Save model
Epoch: 2 train loss:0.615 acc:0.691 valid loss:0.565 acc:0.780 time:1.508s
Save model
Epoch: 3 train loss:0.541 acc:0.736 valid loss:0.491 acc:0.813 time:1.505s
Save model
Epoch: 4 train loss:0.491 acc:0.759 valid loss:0.447 acc:0.825 time:1.503s
Save model
Epoch: 5 train loss:0.462 acc:0.766 valid loss:0.421 acc:0.831 time:1.511s
Save model
Epoch: 6 train loss:0.443 acc:0.775 valid loss:0.402 acc:0.834 time:1.530s
Save model
Epoch: 7 train loss:0.429 acc:0.777 valid loss:0.391 acc:0.838 time:1.515s
Save model
Epoch: 8 train loss:0.420 acc:0.786 valid loss:0.381 acc:0.844 time:1.519s
Save model
Epoch: 9 train loss:0.415 acc:0.783 valid loss:0.375 acc:0.844 time:1.529s
Save model
Epoch: 10 train loss:0.408 acc:0.786 valid loss:0.371 acc:0.846 time:1.515s
Save model
Epoch: 11 train loss:0.404 acc:0.793 valid loss:0.367 acc:0.846 time:1.504s
Save model
Epoch: 12 train loss:0.399 acc:0.793 valid loss:0.36

In [None]:
glove_model.load_state_dict(torch.load('best_glove_model_with_dropout.pt'))
test_loss, test_acc = evaluate(glove_model, test_iterator)
print(f"test loss:{test_loss:.3f} acc:{test_acc:.3f}")

test loss:0.341 acc:0.859


### Adding a non-linear function

We can add a non-linear function between layers to introduce a non-linearity to the model.

In [None]:
class SingleNNLayerWithRELU(nn.Module):
  def __init__(self, vocab_size, hidden_size, output_size, dropout=0.0, pad_idx=0):
    super(SingleNNLayerWithRELU, self).__init__()
    self.emb = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_idx)
    self.layer = nn.Linear(hidden_size, output_size)
    self.drop = nn.Dropout(dropout)

  def forward(self, inputs, inputs_len):
    """
      inputs: LongTensor (seq_len, batch_size)
    """
    inputs = inputs.transpose(0, 1)
    embedded_inputs = self.drop(self.emb(inputs)) # (batch_size, seq_len, emb_size)
    pooled_inputs = F.avg_pool2d(embedded_inputs, (embedded_inputs.shape[1], 1)).squeeze(1)  # (batch_size, emb_size)
    outputs = self.drop(self.layer(F.relu(pooled_inputs))) # (batch_size, output_size)
    return outputs

In [None]:
# Load pretrained word embeddings
glove_model = SingleNNLayerWithRELU(len(TEXT.vocab), args.hidden_size, args.output_size, pad_idx=PAD_IDX, dropout=0.3)
pretrained_embeddings = TEXT.vocab.vectors
glove_model.emb.weight.data.copy_(pretrained_embeddings)
glove_model.emb.weight.data[UNK_IDX] = torch.zeros(args.hidden_size)
glove_model.emb.weight.data[PAD_IDX] = torch.zeros(args.hidden_size)
train_model(glove_model, train_iterator, valid_iterator, "best_glove_model_with_relu.pt")

Epoch: 1 train loss:0.691 acc:0.532 valid loss:0.668 acc:0.613
Save model
Epoch: 2 train loss:0.675 acc:0.609 valid loss:0.597 acc:0.706
Save model
Epoch: 3 train loss:0.648 acc:0.650 valid loss:0.528 acc:0.747
Save model
Epoch: 4 train loss:0.617 acc:0.679 valid loss:0.480 acc:0.774
Save model
Epoch: 5 train loss:0.589 acc:0.704 valid loss:0.443 acc:0.798
Save model
Epoch: 6 train loss:0.565 acc:0.718 valid loss:0.425 acc:0.815
Save model
Epoch: 7 train loss:0.543 acc:0.725 valid loss:0.424 acc:0.823
Save model
Epoch: 8 train loss:0.527 acc:0.731 valid loss:0.418 acc:0.829
Save model
Epoch: 9 train loss:0.515 acc:0.738 valid loss:0.421 acc:0.834
Epoch: 10 train loss:0.504 acc:0.739 valid loss:0.427 acc:0.835
Epoch: 11 train loss:0.492 acc:0.743 valid loss:0.437 acc:0.839
Epoch: 12 train loss:0.486 acc:0.744 valid loss:0.442 acc:0.842
Epoch: 13 train loss:0.477 acc:0.744 valid loss:0.452 acc:0.845
Epoch: 14 train loss:0.477 acc:0.745 valid loss:0.457 acc:0.848
Epoch: 15 train loss:0.47

In [None]:
glove_model.load_state_dict(torch.load('best_glove_model_with_relu.pt'))
test_loss, test_acc = evaluate(model_with_relu, test_iterator, criterion)
print(f"test loss:{test_loss:.3f} acc:{test_acc:.3f}")

test loss:0.432 acc:0.821


# Practical 7: RNNs and Transformers

The tutorial is divided into few sections: 
1. Implement an LSTM-based language model on English Penn Tree Bank data
2. Implement an LSTM and a Transformer model for the imdb sentiment sequence classification model

## LSTM Language Model

References:

Some parts of the code are taken from https://github.com/salesforce/awd-lstm-lm/ and https://github.com/gentaiscool/multi-task-cs-lm

### Download dataset
English Penn Tree Bank

In [None]:
import urllib.request
import os

TRAIN_PATH = "https://raw.githubusercontent.com/tmatha/lstm/master/ptb.train.txt"
VALID_PATH = "https://raw.githubusercontent.com/tmatha/lstm/master/ptb.valid.txt"
TEST_PATH = "https://raw.githubusercontent.com/tmatha/lstm/master/ptb.test.txt"

### Build the model
Let's define our model `RNNModel`. You can set the model to instantiate `RNN`, `LSTM` or `GRU`. You can use `tied weights` for sharing the same parameters on both input and output embeddings' weights.

In [None]:
import torch.nn as nn
import torch
import torch.nn.functional as F
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

from tqdm import tqdm
from torch.autograd import Variable

class RNNModel(nn.Module):
    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        super(RNNModel, self).__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp

        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)

        self.decoder = nn.Linear(nhid, ntoken)

        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight
        self.tie_weights = tie_weights
        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

        self.init_weights()

    def init_weights(self):
        initrange = 0.1

        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))
        
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)

        decoded = F.log_softmax(self.decoder(output.view(output.size(0)*output.size(1), output.size(2))))
        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters()).data
        if self.rnn_type == 'LSTM':
            return (Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()),
                    Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()))
        else:
            return Variable(weight.new(self.nlayers, bsz, self.nhid).zero_())

We define the `Dictionary` class. This helper class stores the map from words to their indices, and vice versa in `word2idx` and `idx2word`.

In [None]:
import os
import torch

class Dictionary:
    def __init__(self):
        self.word2idx = {} # {"apple": 1, "hello": 2}
        self.idx2word = {} # {1 : "apple", 2: "hello"}

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word[len(self.idx2word)] = word
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

This is the `Corpus` class to store our preprocessed train, valid, and test data.

In [None]:
class Corpus:
    def __init__(self):
        self.dictionary = Dictionary()
        self.train = self.tokenize(TRAIN_PATH)
        print("train:", len(self.dictionary))
        self.valid = self.tokenize(VALID_PATH)
        print("valid:", len(self.dictionary))
        self.test = self.tokenize(TEST_PATH)
        print("test:", len(self.dictionary))
        print("dictionary size:", len(self.dictionary))

    def tokenize(self, path):
        """Tokenizes a text file."""

        # Add words to the dictionary
        self.dictionary.add_word("<oov>")

        with urllib.request.urlopen(path) as f:
            tokens = 0
            for line in f:
                line = line.strip().decode("utf-8")
                line = line.replace("  ", " ")
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with urllib.request.urlopen(path) as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                line = line.decode("utf-8")
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids

### Training

In [None]:
import argparse
import time
import math
import os
import unicodedata
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
from collections import namedtuple

args = {
    "name": "name",
    "model": "LSTM",
    "emsize": 200,
    "nhid": 200,
    "nlayers": 2,
    "lr": 20,
    "clip": 0.25,
    "epochs": 20,
    "batch_size": 20,
    "bptt": 35,
    "dropout": 0.2,
    "tied": False,
    "pad": True,
    "seed": 1234,
    "cuda": True,
    "save": ".",
    "log_path": ".",
    "log_interval": 200
}

args = namedtuple('Struct', args.keys())(*args.values())
log_name = str(args.name) + "_model" + str(args.model) + "_layers" + str(args.nlayers) + "_nhid" + str(args.nhid) + "_emsize" + str(args.emsize) + ".txt"
log_file = open(args.log_path + "/" + log_name, "w+")

save_path = args.save + "/" + log_name + ".pt"

is_pad = False
if args.pad:
    is_pad = args.pad

torch.manual_seed(args.seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(args.seed)

# Load data
corpus = Corpus()

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    if args.cuda:
        data = data.cuda()
    return data


eval_batch_size = 32

train_data = batchify(corpus.train, args.batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

# Build the model
ntokens = len(corpus.dictionary)
model = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied)
print(model)
if args.cuda:
    model.cuda()

# Training code
def repackage_hidden(h):
    """Wraps hidden states in new Tensors,
    to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

def get_batch(source, i, evaluation=False):
    seq_len = min(args.bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

word2idx = corpus.dictionary.word2idx
idx2word = corpus.dictionary.idx2word
num_word = len(corpus.dictionary.idx2word)

def evaluate(data_source, type_evaluation="val"):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(eval_batch_size)
    criterion = nn.CrossEntropyLoss()

    for i in range(0, data_source.size(0) - 1, args.bptt):
        data, targets = get_batch(data_source, i, evaluation=True)
        output, hidden = model(data, hidden)
        output_flat = output.view(-1, ntokens)
        total_loss += len(data) * criterion(output_flat, targets).data
        hidden = repackage_hidden(hidden)
    return total_loss.item() / len(data_source)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(args.batch_size)
    criterion = nn.CrossEntropyLoss()
    
    batch_idx = 0

    for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
        data, targets = get_batch(train_data, i)
        hidden = repackage_hidden(hidden)
        model.zero_grad()
        
        output, hidden = model(data, hidden)

        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        batch_idx += data.size(1)

        # clip the grad
        torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
        opt = optim.SGD(model.parameters(), lr=lr)
        opt.step()

        total_loss += loss.data
        
        if batch % args.log_interval == 0 and batch > 0:
            cur_loss = total_loss.item() / args.log_interval
            elapsed = time.time() - start_time

            log = '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | word_loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // args.bptt, lr,
                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss))
            print(log)
            total_loss = 0
            start_time = time.time()

lr = args.lr
best_val_loss = None
counter = 0

for epoch in range(1, args.epochs+1):
    epoch_start_time = time.time()
    train()
    val_loss = evaluate(val_data, "dev")

    log = '-' * 89 + "\n" + '| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                        val_loss, math.exp(val_loss)) + '-' * 89
    print(log)

    # Save the model if the validation loss is the best we've seen so far.
    if not best_val_loss or val_loss < best_val_loss:
        with open(save_path, 'wb') as f:
            torch.save(model, f)
        best_val_loss = val_loss
        counter = 0
    else:
        lr /= 4.0
        counter += 1

        if counter == 5:
            break

# Load the best saved model.
with open(save_path, 'rb') as f:
    model = torch.load(f)

# Run on test data.
test_loss = evaluate(test_data, "test")

log = ('=' * 89) + '| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)) + ('=' * 89)

train: 10001
valid: 10001
test: 10001
dictionary size: 10001
RNNModel(
  (drop): Dropout(p=0.2, inplace=False)
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200, num_layers=2, dropout=0.2)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)




| epoch   1 |   200/ 1327 batches | lr 20.00 | ms/batch  8.67 | word_loss  6.93 | ppl  1019.09
| epoch   1 |   400/ 1327 batches | lr 20.00 | ms/batch  8.32 | word_loss  6.30 | ppl   545.50
| epoch   1 |   600/ 1327 batches | lr 20.00 | ms/batch  8.35 | word_loss  6.04 | ppl   421.60
| epoch   1 |   800/ 1327 batches | lr 20.00 | ms/batch  8.31 | word_loss  5.77 | ppl   319.78
| epoch   1 |  1000/ 1327 batches | lr 20.00 | ms/batch  8.30 | word_loss  5.63 | ppl   278.45
| epoch   1 |  1200/ 1327 batches | lr 20.00 | ms/batch  8.33 | word_loss  5.47 | ppl   237.58
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 11.36s | valid loss  5.38 | valid ppl   217.10-----------------------------------------------------------------------------------------
| epoch   2 |   200/ 1327 batches | lr 20.00 | ms/batch  8.42 | word_loss  5.38 | ppl   217.88
| epoch   2 |   400/ 1327 batches | lr 20.00 | ms/batch  8.45 | word_loss  5.31 | 

### Generating a sentence
How about if we generate some sentences from the trained model?

Let's take our pre-trained model to generate some words. In this part, we take a word `you` to start the generation. We keep the hidden states of the model and the predicted word, and use it to generate the next word. 

This process is done in an autogressive manner.

In [None]:
temperature = 1
words = 10

with open(save_path, 'rb') as f:
    model = torch.load(f)
model.eval()

start_word = "the"

with torch.no_grad():
  ntokens = len(corpus.dictionary)
  hidden = model.init_hidden(1)
  input = Variable(torch.rand(1, 1).mul(ntokens).long(), volatile=True)
  input.data.fill_(corpus.dictionary.word2idx[start_word])
  if args.cuda:
      input.data = input.data.cuda()

  sentences = start_word + " "
  for i in range(words):
      output, hidden = model(input, hidden)
      word_weights = output.squeeze().data.div(temperature).exp().cpu()
      word_idx = torch.multinomial(word_weights, 1)[0]
      input.data.fill_(word_idx)
      word = corpus.dictionary.idx2word[word_idx.item()]
      sentences += word + " "
  print(sentences)

the <unk> manufacturer of new greed when the trip is a 


  del sys.path[0]


## Revisiting the Sentiment Analysis
Let's implement `LSTMModel` and `TransformerModel` for IMDB sentiment text analysis.

Import all required libraries

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
print(f"PyTorch version: {torch.__version__}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"device: {device}")

PyTorch version: 1.6.0+cu101
device: cuda


Define the hyper-parameters

In [None]:
from collections import namedtuple

args = {
  "hidden_size": 100,
  "output_size": 1,
  "lr": 1e-4,
  "seed": 1234,
  "max_vocab_size": 1000,
  "batch_size": 64,
  "num_epoch": 10,
  "dropout": 0.0
}
args = namedtuple('Struct', args.keys())(*args.values())

Define the metric for evaluation. We follow the same function as in Practical 6.

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    rounded_preds = torch.round(torch.sigmoid(preds)) # threshold 0.5
    correct = (rounded_preds == y).float() # convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

### Data Preprocessing

In [None]:
from torchtext import datasets
from torchtext import data
import random

TEXT = data.Field(tokenize = 'spacy', include_lengths = True) # add spacy tokenizer
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(args.seed))

TEXT.build_vocab(train_data, max_size=args.max_vocab_size)
LABEL.build_vocab(train_data)

# Data
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = args.batch_size, 
    sort_within_batch = True,
    device = device)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 10.5MB/s]


### Build an LSTM model

Let's define the `LSTMModel` for sequence classification. Please note that we are using `pack_padded_sequence` and `pad_packed_sequence` to allow the model to ignore padding during computations (the batch must be sorted).

In [None]:
class LSTMModel(nn.Module):
  def __init__(self, vocab_size, hidden_size, output_size, dropout=0.0, pad_idx=0, num_layer=2):
    super(LSTMModel, self).__init__()
    self.emb = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_idx)
    self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers=num_layer, bidirectional=True)
    self.layer = nn.Linear(hidden_size*2, output_size)
    self.drop = nn.Dropout(dropout)

  def forward(self, inputs, inputs_len):
    """
      inputs: LongTensor (seq_len, batch_size)
    """
    embedded_inputs = self.drop(self.emb(inputs)) # (seq_len, batch_size, emb_size)
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded_inputs, inputs_len)    
    packed_output, (hidden, cell) = self.rnn(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

    hidden = self.drop(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)) # (batch size, hid_dim * num directions)
    return self.layer(hidden)

#### Training

In [None]:
model = LSTMModel(len(TEXT.vocab), args.hidden_size, args.output_size, num_layer=2)
train_model(model, train_iterator, valid_iterator, "best_lstm_model.pt")

Epoch: 1 train loss:0.622 acc:0.640 valid loss:0.577 acc:0.703 time:15.180s
Save model
Epoch: 2 train loss:0.498 acc:0.760 valid loss:0.563 acc:0.710 time:14.933s
Save model
Epoch: 3 train loss:0.464 acc:0.776 valid loss:0.518 acc:0.751 time:15.036s
Save model
Epoch: 4 train loss:0.411 acc:0.813 valid loss:0.490 acc:0.787 time:15.099s
Save model
Epoch: 5 train loss:0.427 acc:0.802 valid loss:0.453 acc:0.796 time:15.037s
Save model
Epoch: 6 train loss:0.389 acc:0.826 valid loss:0.461 acc:0.789 time:15.254s
Epoch: 7 train loss:0.380 acc:0.835 valid loss:0.482 acc:0.782 time:15.180s
Epoch: 8 train loss:0.331 acc:0.860 valid loss:0.376 acc:0.835 time:15.185s
Save model
Epoch: 9 train loss:0.320 acc:0.864 valid loss:0.384 acc:0.833 time:15.169s
Epoch: 10 train loss:0.286 acc:0.884 valid loss:0.394 acc:0.825 time:15.222s


#### Evaluation

In [None]:
model.load_state_dict(torch.load('best_lstm_model.pt'))
test_loss, test_acc = evaluate(model, test_iterator)
print(f"test loss:{test_loss:.3f} acc:{test_acc:.3f}")

test loss:0.352 acc:0.847


### Build a Transfomer model
Here, we implement `PositionalEncoding` and `TransformerModel`.

In [None]:
import math

class PositionalEncoding(nn.Module):
  def __init__(self, d_model, dropout=0.1, max_len=5000):
      super(PositionalEncoding, self).__init__()
      self.dropout = nn.Dropout(p=dropout)

      pe = torch.zeros(max_len, d_model)
      position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
      div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
      pe[:, 0::2] = torch.sin(position * div_term)
      pe[:, 1::2] = torch.cos(position * div_term)
      pe = pe.unsqueeze(0).transpose(0, 1)
      self.register_buffer('pe', pe)

  def forward(self, x):
      x = x + self.pe[:x.size(0), :]
      return self.dropout(x)

class TransformerModel(nn.Module):
  def __init__(self, vocab_size, hidden_size, output_size, dropout=0.0, pad_idx=0, num_layer=2):
    super(TransformerModel, self).__init__()
    self.emb = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_idx)
    encoder_layer = nn.TransformerEncoderLayer(d_model=hidden_size, nhead=4)
    self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
    self.pos_encoder = PositionalEncoding(hidden_size, dropout)
    self.layer = nn.Linear(hidden_size, output_size)
    self.drop = nn.Dropout(dropout)
    self.hidden_size = hidden_size
    self.src_mask = None

  def _generate_square_subsequent_mask(self, sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

  def forward(self, inputs, inputs_len):
    """
      inputs: LongTensor (seq_len, batch_size)
    """
    inputs = inputs.transpose(1, 0)
    inputs = self.drop(self.emb(inputs)) # (batch_size, seq_len, emb_size)

    if self.src_mask is None or self.src_mask.size(0) != len(inputs):
        device = inputs.device
        mask = self._generate_square_subsequent_mask(len(inputs)).to(device)
        self.src_mask = mask

    src = self.encoder(inputs) * math.sqrt(self.hidden_size)
    src = self.pos_encoder(src)
    pooled_inputs = F.avg_pool2d(src, (src.shape[1], 1)).squeeze(1)  # (batch_size, emb_size)
    
    return self.layer(pooled_inputs)

#### Training

In [None]:
model = TransformerModel(len(TEXT.vocab), args.hidden_size, args.output_size, num_layer=2)
train_model(model, train_iterator, valid_iterator, "best_transformer_model.pt")

Epoch: 1 train loss:0.914 acc:0.512 valid loss:0.701 acc:0.511 time:12.462s
Save model
Epoch: 2 train loss:0.606 acc:0.662 valid loss:0.592 acc:0.684 time:12.436s
Save model
Epoch: 3 train loss:0.489 acc:0.765 valid loss:0.499 acc:0.754 time:12.428s
Save model
Epoch: 4 train loss:0.451 acc:0.789 valid loss:0.456 acc:0.793 time:12.390s
Save model
Epoch: 5 train loss:0.438 acc:0.798 valid loss:0.436 acc:0.799 time:12.456s
Save model
Epoch: 6 train loss:0.417 acc:0.812 valid loss:0.418 acc:0.813 time:12.435s
Save model
Epoch: 7 train loss:0.406 acc:0.816 valid loss:0.417 acc:0.816 time:12.388s
Save model
Epoch: 8 train loss:0.389 acc:0.826 valid loss:0.407 acc:0.817 time:12.348s
Save model
Epoch: 9 train loss:0.390 acc:0.829 valid loss:0.405 acc:0.819 time:12.301s
Save model
Epoch: 10 train loss:0.378 acc:0.832 valid loss:0.475 acc:0.778 time:12.364s


#### Evaluation

In [None]:
model.load_state_dict(torch.load('best_transformer_model.pt'))
test_loss, test_acc = evaluate(model, test_iterator)
print(f"test loss:{test_loss:.3f} acc:{test_acc:.3f}")

RuntimeError: ignored

### Build a model with a pre-trained BERT model

We can also take a pre-trained contextual language model and fine-tune the model for a downstreamed task, such as sentiment analysis.

It will take almost forever in this Colab GPU server. Consider to run it in your private / GCP GPU server.

#### Download Huggingface's `transfomers` package

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 7.5MB/s eta 0:00:01[K     |▉                               | 20kB 4.4MB/s eta 0:00:01[K     |█▎                              | 30kB 4.2MB/s eta 0:00:01[K     |█▊                              | 40kB 4.5MB/s eta 0:00:01[K     |██▏                             | 51kB 4.9MB/s eta 0:00:01[K     |██▋                             | 61kB 5.4MB/s eta 0:00:01[K     |███                             | 71kB 5.6MB/s eta 0:00:01[K     |███▍                            | 81kB 5.5MB/s eta 0:00:01[K     |███▉                            | 92kB 5.6MB/s eta 0:00:01[K     |████▎                           | 102kB 5.8MB/s eta 0:00:01[K     |████▊                           | 112kB 5.8MB/s eta 0:00:01[K     |█████▏                          | 122kB 5.8MB

#### Training

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  include_lengths = True,
                  tokenize = tokenize,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)

In [None]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(args.seed))

In [None]:
LABEL.build_vocab(train_data)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = args.batch_size, 
    device = device)

Implement `BERTTransformerModel`

In [None]:
from transformers import BertTokenizer, BertModel

class BERTTransformerModel(nn.Module):
    def __init__(self, bert, hidden_size, output_size, dropout=0.0):
        super(BERTTransformerModel, self).__init__()
        
        self.bert = bert
        self.out = nn.Linear(hidden_size, output_size)
        self.drop = nn.Dropout(dropout)
        
    def forward(self, inputs, inputs_len):
        """
          inputs: LongTensor (batch_size, seq_len)
        """
        with torch.no_grad():
          embedded_inputs = self.bert(inputs)[0]

        pooled_inputs = F.avg_pool2d(embedded_inputs, (embedded_inputs.shape[1], 1)).squeeze(1)  # (batch_size, emb_size)
        output = self.out(pooled_inputs)
        
        return output

In [None]:
bert = BertModel.from_pretrained('bert-base-uncased')
embedding_dim = bert.config.to_dict()['hidden_size']
model = BERTTransformerModel(bert, embedding_dim, args.output_size)
train_model(model, train_iterator, valid_iterator, "best_transformer_model.pt")

Epoch: 1 train loss:0.649 acc:0.665 valid loss:0.523 acc:0.796 time:1218.773s
Save model
Epoch: 2 train loss:0.578 acc:0.760 valid loss:0.455 acc:0.804 time:1226.984s
Save model
Epoch: 3 train loss:0.538 acc:0.782 valid loss:0.414 acc:0.822 time:1228.352s
Save model
Epoch: 4 train loss:0.509 acc:0.798 valid loss:0.387 acc:0.834 time:1228.644s
Save model
