<a href="https://colab.research.google.com/github/alanwuha/ce7455-nlp/blob/master/assignment-1/word-level-language-model-rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Two

Please make sure to run all cells in order.

## Setup

### Download wikitext-2 dataset

In [1]:
!rm -r data
!mkdir data
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip -P /content/data
!unzip /content/data/wikitext-2-v1.zip -d /content/data

rm: cannot remove 'data': No such file or directory
--2020-02-08 07:30:51--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.254
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4475746 (4.3M) [application/zip]
Saving to: ‘/content/data/wikitext-2-v1.zip’


2020-02-08 07:30:59 (2.47 MB/s) - ‘/content/data/wikitext-2-v1.zip’ saved [4475746/4475746]

Archive:  /content/data/wikitext-2-v1.zip
   creating: /content/data/wikitext-2/
  inflating: /content/data/wikitext-2/wiki.test.tokens  
  inflating: /content/data/wikitext-2/wiki.valid.tokens  
  inflating: /content/data/wikitext-2/wiki.train.tokens  


### Import libraries

In [0]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.onnx
import argparse
import time
import os
from io import open

### Specify arg variables

In [0]:
data = './data/wikitext-2'  # location of the data corpus
args_model = 'FNN'          # type of recurrent net
emsize = 200                # size of word embeddings
nhid = 200                  # number of hidden units per layer
nlayers = 2                 # number of layers
lr = 20                     # initial learning rate
clip = 0.25                 # gradient clipping
epochs = 40                 # upper epoch limit
batch_size = 20             # batch size
bptt = 35                   # sequence length
dropout = 0.2               # dropout applied to layers (0 = no dropout)
tied = False                # tie the word embedding and softmax weights
seed = 1111                 # random seed
cuda = True                 # use CUDA
log_interval = 200          # report interval
save = 'model.pt'           # path to save the final model
onnx_export = ''            # path to export the final model in onnx format
nhead = 2                   # the number of heads in the encoder/decoder of the transformer model

### Data Preprocessing Classes

In [0]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'wiki.train.tokens'))
        self.valid = self.tokenize(os.path.join(path, 'wiki.valid.tokens'))
        self.test = self.tokenize(os.path.join(path, 'wiki.test.tokens'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)

        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

### Helper Methods

In [0]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)


# get_batch subdivides the source data into chunks of length bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    if args_model not in ['Transformer', 'FNN']:
        hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            if args_model in ['Transformer', 'FNN']:
                output = model(data)
            else:
                output, hidden = model(data, hidden)
                hidden = repackage_hidden(hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    if args_model not in ['Transformer', 'FNN']:
        hidden = model.init_hidden(batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        # model.zero_grad()
        
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output
        if args_model in ['Transformer', 'FNN']:
            output = model(data)
        else:
            hidden = repackage_hidden(hidden)
            output, hidden = model(data, hidden)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(output.view(-1, ntokens), targets)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        # torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        # for p in model.parameters():
        #     p.data.add_(-lr, p.grad.data)

        total_loss += loss.item()

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.3f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // bptt, optimizer.param_groups[0]['lr'],
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()


def export_onnx(path, batch_size, seq_len):
    print('The model is also exported in ONNX format at {}'.
          format(os.path.realpath(args.onnx_export)))
    model.eval()
    dummy_input = torch.LongTensor(seq_len * batch_size).zero_().view(-1, batch_size).to(device)
    hidden = model.init_hidden(batch_size)
    torch.onnx.export(model, (dummy_input, hidden), path)

### Set torch seed and CUDA device

In [0]:
# Set the random seed manually for reproducibility.
torch.manual_seed(seed)

# Set cuda device
if torch.cuda.is_available():
    if not cuda:
        print("WARNING: You have a CUDA device, so you should probably run with --cuda")

device = torch.device("cuda" if cuda else "cpu")

### Load corpus

In [0]:
corpus = Corpus(data)

eval_batch_size = 10
train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

ntokens = len(corpus.dictionary)

## (iii) Write a _class FNNModel(nn.Module)_.

The FNNModel should implement a language model with a feed-forward network architecture. It has a hidden layer with tanh architecture and the output layer is a Softmax layer. The output of the model for each input of (n-1) previous word indices are the probabilities of the |_V_| words in the vocabulary.

In [0]:
class FNNModel(nn.Module):
    """ Container module with an encoder, a feed forward module, and a decoder. """

    def __init__(self, ntokens, emsize, hidden_size, tie_weights=False):
        super(FNNModel, self).__init__()
        self.encoder = nn.Embedding(ntokens, emsize)
        self.hidden = nn.Linear(emsize, hidden_size)
        self.tanh = nn.Tanh()
        self.decoder = nn.Linear(hidden_size, ntokens)

        if tie_weights:
            if hidden_size != emsize:
                raise ValueError('When using the tied flag, hidden_size must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        # self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input):
        emb = self.encoder(input)
        hid = self.hidden(emb)
        output = self.tanh(hid)
        decoded = self.decoder(output)
        return decoded

## (iv-1) Train the model with Adam optimizer (without sharing weights).

In [12]:
tied = False
model = FNNModel(ntokens, emsize, nhid, tied).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())


# Loop over epochs.
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

| epoch   1 |   200/ 2983 batches | lr 0.001 | ms/batch 18.97 | loss  8.08 | ppl  3234.78
| epoch   1 |   400/ 2983 batches | lr 0.001 | ms/batch 17.86 | loss  6.59 | ppl   728.55
| epoch   1 |   600/ 2983 batches | lr 0.001 | ms/batch 17.97 | loss  6.38 | ppl   592.88
| epoch   1 |   800/ 2983 batches | lr 0.001 | ms/batch 18.02 | loss  6.31 | ppl   549.91
| epoch   1 |  1000/ 2983 batches | lr 0.001 | ms/batch 18.07 | loss  6.23 | ppl   507.14
| epoch   1 |  1200/ 2983 batches | lr 0.001 | ms/batch 18.15 | loss  6.22 | ppl   503.70
| epoch   1 |  1400/ 2983 batches | lr 0.001 | ms/batch 18.17 | loss  6.20 | ppl   490.41
| epoch   1 |  1600/ 2983 batches | lr 0.001 | ms/batch 18.27 | loss  6.18 | ppl   485.13
| epoch   1 |  1800/ 2983 batches | lr 0.001 | ms/batch 18.32 | loss  6.07 | ppl   430.87
| epoch   1 |  2000/ 2983 batches | lr 0.001 | ms/batch 18.39 | loss  6.10 | ppl   444.95
| epoch   1 |  2200/ 2983 batches | lr 0.001 | ms/batch 18.42 | loss  6.00 | ppl   401.94
| epoch   

  "type " + obj.__name__ + ". It won't be checked "


| epoch   2 |   200/ 2983 batches | lr 0.001 | ms/batch 18.86 | loss  5.73 | ppl   307.82
| epoch   2 |   400/ 2983 batches | lr 0.001 | ms/batch 18.96 | loss  5.68 | ppl   292.37
| epoch   2 |   600/ 2983 batches | lr 0.001 | ms/batch 18.87 | loss  5.56 | ppl   258.84
| epoch   2 |   800/ 2983 batches | lr 0.001 | ms/batch 18.89 | loss  5.59 | ppl   268.52
| epoch   2 |  1000/ 2983 batches | lr 0.001 | ms/batch 18.92 | loss  5.56 | ppl   259.41
| epoch   2 |  1200/ 2983 batches | lr 0.001 | ms/batch 18.92 | loss  5.58 | ppl   264.33
| epoch   2 |  1400/ 2983 batches | lr 0.001 | ms/batch 19.29 | loss  5.61 | ppl   272.79
| epoch   2 |  1600/ 2983 batches | lr 0.001 | ms/batch 19.76 | loss  5.61 | ppl   271.92
| epoch   2 |  1800/ 2983 batches | lr 0.001 | ms/batch 20.16 | loss  5.52 | ppl   250.46
| epoch   2 |  2000/ 2983 batches | lr 0.001 | ms/batch 20.20 | loss  5.58 | ppl   264.73
| epoch   2 |  2200/ 2983 batches | lr 0.001 | ms/batch 20.21 | loss  5.48 | ppl   239.13
| epoch   

## (v-1) Show the perplexity score on the test set. You should select your best model based on the _valid_ set.

In [13]:
# Load the best saved model.
with open(save, 'rb') as f:
    model = torch.load(f)

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  5.73 | test ppl   308.45


## (iv-2) Train the model with Adam optimizer, but now with sharing the input (look-up matrix) and output layer embeddings (final layer weights).

In [15]:
tied = True
model = FNNModel(ntokens, emsize, nhid, tied).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())


# Loop over epochs.
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

| epoch   1 |   200/ 2983 batches | lr 0.001 | ms/batch 17.35 | loss 14.22 | ppl 1502783.63
| epoch   1 |   400/ 2983 batches | lr 0.001 | ms/batch 17.25 | loss  8.03 | ppl  3073.87
| epoch   1 |   600/ 2983 batches | lr 0.001 | ms/batch 17.44 | loss  7.31 | ppl  1488.76
| epoch   1 |   800/ 2983 batches | lr 0.001 | ms/batch 17.68 | loss  7.13 | ppl  1243.80
| epoch   1 |  1000/ 2983 batches | lr 0.001 | ms/batch 17.86 | loss  7.00 | ppl  1101.65
| epoch   1 |  1200/ 2983 batches | lr 0.001 | ms/batch 18.14 | loss  6.97 | ppl  1067.11
| epoch   1 |  1400/ 2983 batches | lr 0.001 | ms/batch 18.32 | loss  6.91 | ppl  1003.37
| epoch   1 |  1600/ 2983 batches | lr 0.001 | ms/batch 18.61 | loss  6.89 | ppl   986.13
| epoch   1 |  1800/ 2983 batches | lr 0.001 | ms/batch 18.84 | loss  6.79 | ppl   885.45
| epoch   1 |  2000/ 2983 batches | lr 0.001 | ms/batch 18.79 | loss  6.80 | ppl   899.18
| epoch   1 |  2200/ 2983 batches | lr 0.001 | ms/batch 18.40 | loss  6.71 | ppl   820.16
| epoch 

  "type " + obj.__name__ + ". It won't be checked "


| epoch   2 |   200/ 2983 batches | lr 0.001 | ms/batch 17.51 | loss  6.54 | ppl   692.33
| epoch   2 |   400/ 2983 batches | lr 0.001 | ms/batch 17.42 | loss  6.47 | ppl   644.98
| epoch   2 |   600/ 2983 batches | lr 0.001 | ms/batch 17.42 | loss  6.36 | ppl   579.72
| epoch   2 |   800/ 2983 batches | lr 0.001 | ms/batch 17.32 | loss  6.40 | ppl   599.72
| epoch   2 |  1000/ 2983 batches | lr 0.001 | ms/batch 17.34 | loss  6.36 | ppl   576.65
| epoch   2 |  1200/ 2983 batches | lr 0.001 | ms/batch 17.29 | loss  6.38 | ppl   588.80
| epoch   2 |  1400/ 2983 batches | lr 0.001 | ms/batch 17.33 | loss  6.38 | ppl   589.49
| epoch   2 |  1600/ 2983 batches | lr 0.001 | ms/batch 17.31 | loss  6.38 | ppl   590.09
| epoch   2 |  1800/ 2983 batches | lr 0.001 | ms/batch 17.41 | loss  6.28 | ppl   532.73
| epoch   2 |  2000/ 2983 batches | lr 0.001 | ms/batch 17.47 | loss  6.32 | ppl   553.87
| epoch   2 |  2200/ 2983 batches | lr 0.001 | ms/batch 17.52 | loss  6.23 | ppl   507.36
| epoch   

## (v-2) Show the perplexity score on the test set. You should select your best model based on the _valid_ set.

In [16]:
# Load the best saved model.
with open(save, 'rb') as f:
    model = torch.load(f)

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  5.92 | test ppl   372.39


## (vii) Adapt generate.py so that you can generate texts using your language model (FNNModel)

In [28]:
data = './data/wikitext-2'  # location of the data corpus
checkpoint = './model.pt'   # model checkpoint to use
outf = 'generated.txt'      # output file for generated text
words = 1000                # number of words to generate
seed = 1111                 # random seed
cuda = True                 # use CUDA
temperature = 1.0           # temperature - higher will increase diversity
log_interval = 100          # reporting interval

with open(checkpoint, 'rb') as f:
  model = torch.load(f).to(device)
model.eval()

input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

with open(outf, 'w') as outf:
  with torch.no_grad(): # no tracking history
    for i in range(words):
      output = model(input)
      word_weights = output.squeeze().div(temperature).exp().cpu()
      word_idx = torch.multinomial(word_weights, 1)[0]
      input.fill_(word_idx)

      word = corpus.dictionary.idx2word[word_idx]

      outf.write(word + ('\n' if i % 20 == 19 else ' '))

      if i % log_interval == 0:
        print('| Generated {}/{} words'.format(i, words))

| Generated 0/1000 words
| Generated 100/1000 words
| Generated 200/1000 words
| Generated 300/1000 words
| Generated 400/1000 words
| Generated 500/1000 words
| Generated 600/1000 words
| Generated 700/1000 words
| Generated 800/1000 words
| Generated 900/1000 words


In [29]:
!cat generated.txt

located at O 'Malley . A direction of The defensive public area . <eos> Dodge Moresby = = <eos> Within
a Roman tonnes ( deep convection alone defence of the race with some challenges a game against the second playing
power NAACP ) . In active performance of the 1650 Australian deity as a aircraft for the horse @-@ Him
on Irish 4th Battle of passes overran the FIA 's most champion the drive of Walpole 's <unk> , along
with 12 men to punt and the Augustan History and 11 November 18 @-@ pre @-@ because of nine deeply
warm @-@ yard in luminosity for a 4 , as heaviest in the laws , which the Territorial Singles chart
. The title in 1899 , what : a forward to a estuary . <eos> = = = <eos> Common
starlings , where 550 house had below the 766th Regiment acquired much " instrumentation , the popularity , the party
. Occasionally he allowed another Washington Post Office : " race and exploration = <eos> = <eos> <eos> <eos> <eos>
= = Biography = <eos> <eos> <eos> = Marriage in the final too at le

## (viii) In your opinion, which computation/operation is the most expensive one in inference or forward pass? Can you think of ways to improve this? Is yea, please mention.

## (ix) Notice that the model also learns word vectors (input and output layer embeddings) as a byproduct. One way to evaluate the trained word vectors is to measure the cosine similarity between pairs of words, and then report the correlation with the similarity scores given by humans. For this exercise, use the dataset available [here](#) and report the Spearman correlation for the input embeddings. Exclude any pair if it is not in the embedding matrix.