# An RNN Transducer-based Language Model¶

In this problem we will
- Build an LSTM transducer-based language model using early stopping and compute the text perplexity.
- Use the model to generate sentences.
- Extend the model and compare performance when we 
    - replace the LSTM with a GRU or a Simple RNN
    - increase the number of LSTM layers
    - add dropout
    - add gradient clipping
    
You can develop on your local machine, but to train on the full training set requires GPUs.  We recommend using the GPUs at [Google Colab](https://colab.research.google.com). To upload a notebook, choose the "Files" dropdown menu and then "Upload."  To use a GPU, choose Runtime > Change runtime type and select GPU.    
    
Acknowledgement:  This assignment was originally written by Zewei Chu, and was inspired by a [homework in CS287](https://github.com/harvard-ml-courses/cs287-s18/blob/master/HW2/Homework%202.ipynb) at Harvard.
    

### Development vs full version

Choose the appropriate version using the switches `DEVELOPING` and `COLAB.`

In [455]:
!pip install torchtext==0.9.0
!pip install torch==1.8.1




distutils: c:\users\hocke\appdata\local\programs\python\python39\Include\UNKNOWN
sysconfig: c:\users\hocke\appdata\local\programs\python\python39\Include
user = False
home = None
root = None
prefix = None
distutils: c:\users\hocke\appdata\local\programs\python\python39\Include\UNKNOWN
sysconfig: c:\users\hocke\appdata\local\programs\python\python39\Include
user = False
home = None
root = None
prefix = None
You should consider upgrading via the 'c:\users\hocke\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


In [456]:
import torchtext
from torchtext.vocab import Vectors
import torch
import numpy as np
import random

USE_CUDA = torch.cuda.is_available()

if USE_CUDA:
    DEVICE = torch.device('cuda')
    print("Using cuda.")
else:
    DEVICE = torch.device('cpu')
    print("Using cpu.")

seed = 53113    
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if USE_CUDA:
    torch.cuda.manual_seed(seed)

# Change the following to false when training on
# the full set
# DEVELOPING = True    
DEVELOPING = False

if DEVELOPING:
    print('Small development version')
    BATCH_SIZE = 4
    EMBEDDING_SIZE = 20
    MAX_VOCAB_SIZE = 5000
    TRAIN_DATA_SET = "lm-train-small.txt"
    DEV_DATA_SET = "lm-dev-small.txt"
    TEST_DATA_SET = "lm-test-small.txt"
    BPTT_LENGTH = 8
    COLAB = False
    #COLAB = True
else:
    print('Full version')
    BATCH_SIZE = 32
    EMBEDDING_SIZE = 650
    MAX_VOCAB_SIZE = 50000
    TRAIN_DATA_SET = "lm-train.txt"
    DEV_DATA_SET = "lm-dev.txt"
    TEST_DATA_SET = "lm-test.txt"
    BPTT_LENGTH = 32
    # COLAB = False
    COLAB = True

# For uploading data to Colab see, e.g., 
# https://medium.com/@philipplies/transferring-data-from-google-drive-to-google-cloud-storage-using-google-colab-96e088a8c041    

if COLAB:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    PATH = "/content/drive/MyDrive/datasets"
else:
    PATH = ".\datasets"
    
LOG_FILE = "language-model.log"


Using cpu.
Full version


### Preprocessing using the legacy component of TorchText

TorchText is being upgraded.  For our preprocessing we have use its legacy component.  [Documentation](https://torchtext.readthedocs.io/en/latest/index.html) for this legacy component torchtext is relatively sparse (and, unfortunately, not very clear), but [Ben Trevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb) has a useful tutorial.  (If you are keen to understand this component, you may also want to look at the [source code](https://github.com/pytorch/text/tree/master/torchtext/legacy).)

All the **legacy torchtext code is already provided**. 


In [457]:
TEXT = torchtext.legacy.data.Field(lower=True)

train, val, test = torchtext.legacy.datasets.LanguageModelingDataset.splits(path=PATH, 
    train=TRAIN_DATA_SET, validation=DEV_DATA_SET, test=TEST_DATA_SET, text_field=TEXT)

TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
VOCAB_SIZE = len(TEXT.vocab)

print(f'Vocabulary size: {VOCAB_SIZE}')

train_iter, val_iter, test_iter = torchtext.legacy.data.BPTTIterator.splits(
    (train, val, test), batch_size=BATCH_SIZE, device=DEVICE, bptt_len=BPTT_LENGTH, 
    repeat=False)


Vocabulary size: 50002


### Back propagation through time (BPTT) iterator

The [BPTTIterator](https://torchtext.readthedocs.io/en/latest/data.html#bpttiterator) is a custom torchtext iterator for language modeling using RNNs.  Suppose the text in an example is "the quick brown fox".  The target in the transducer-based RNN language model would then be "quick brown fox jumps".  This allows every prefix of the text to be used as an training example, with the corresponding word in the target text as the target word.  So the above would lead to four examples, written as text sequence -> target word:
* "the" -> "quick"
* "the quick" -> "brown"
* "the quick brown" -> "fox"
* "the quick brown fox" -> "jump"

(Unlike some of the examples in class, here we treat words as part of a sequence without special consideration for sentences.  In particular, we don't use start/end of setence tags.)

One very **significant feature** of the BPTTIterator is that examples continue across batches.  To illustrate let the original data be one long seqence $w_1, w_2, \ldots, w_N$, in which, say, $N = 4,000$.  Further let each batch consist of $4$ examples, each of length 8.  Then the first batch created by BPTTIterator would be the following 4 examples---

- $(w_1, w_2, \ldots, w_{8}), (w_{1001}, w_{1002}, \ldots, w_{1008}), \ldots, (w_{3001}, w_{3002}, \ldots, w_{3008}).$ 

and the second batch would be---

- $(w_{9}, w_{10}, \ldots, w_{16}), (w_{1009}, w_{1010}, \ldots, w_{1016}), \ldots, (w_{3009}, w_{3010}, \ldots, w_{3016}).$

This has implications on how the hidden state of the RNN is set for the second batch onwards.

In [458]:
it = iter(train_iter)
batch = next(it)
print("The first three text/target sequences from the first batch are:\n")
indent = " " * 4
for j in range(3):
    print(indent, f"Text Sequence {j}:", 
          " ".join([TEXT.vocab.itos[i] for i in batch.text[:,j].data]))
    print(indent, f"Target Sequence {j}:",
          " ".join([TEXT.vocab.itos[i] for i in batch.target[:,j].data]))
    print()
 
print(f"Each sequence has BPTT_LENGTH = {BPTT_LENGTH}.\n")
print("Also the sequences continue in the next batch!\n")
batch = next(it)
for j in range(3):
    print(indent, f"Text Sequence {j}:", 
          " ".join([TEXT.vocab.itos[i] for i in batch.text[:,j].data]))
    print(indent, f"Target Sequence {j}:",
          " ".join([TEXT.vocab.itos[i] for i in batch.target[:,j].data]))
    print()

The first three text/target sequences from the first batch are:

     Text Sequence 0: anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term
     Target Sequence 0: originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is

     Text Sequence 1: had dropped to just three zero zero zero k it was then cool enough to allow the nuclei to capture electrons this process is called recombination during which the first neutral atoms
     Target Sequence 1: dropped to just three zero zero zero k it was then cool enough to allow the nuclei to capture electrons this process is called recombination during which the first neutral atoms took

     Text Sequence 2: in eastern asia after the great expansion from the african continent 

#### Initializing hidden vectors from the detached hidden vectors of previous batch

Since sequences continue across batches, for proper training, **the final output hidden vectors in a batch should be used to initialize the hidden vectors for the next batch**.  But care should be taken to detach vectors used for initialization from the computational graph, else gradients would flow "from one batch to the previous" and training would be increasingly slow. 

### Define the model


Our RNN based language model (when using an LSTM) for a language model is as follows:
- Let the input sequence---the *context*---be $w_1, w_2, \ldots, w_n$, and let the target sequence be $w_2, \ldots, w_n, w_{n+1}$.
- At step $i$ of the input, for $1 \leq i \leq n$:
    - $x_i = E_{[w_i]}$.
    - $y_i, (h_i, c_i) = \text{LSTM}(x_i, (h_{i-1}, c_{i-1}))$.  For LSTMs, $y_i$ equals $h_i$.
    - $\widehat{y}_i = \text{softmax}(y_i W + b)$, in which $\widehat{y}_i$ is the predicted probability distribution for $w_{i+1}$.
    - In the above 
        - $x_i$ is $1 \times \text{embedding dim}$ 
        - $y_i$, $h_i$ and $c_i$ are $1 \times \text{hidden dim}$
        - $\widehat{y}_i$ is $1 \times \text{vocab size}$.
- The loss $\ell = \sum_{i=1}^n \log \widehat{y}_{i_{[w_{i+1}]}}$, in which $\log \widehat{y}_{i_{[w_{i+1}]}}$ is the component of $\log \widehat{y}_{i}$ corresponding to the element $w_{i+1}$.

Since the sequences continue across batches we retain the hidden states across batches. Specifically, consider the $k$th example in batch $j$.  For $j=1$, i.e., first batch, the corresponding $(h_0, c_0)$ for the $k$th example is set to all zeros.  But for $j > 1$, the corresponding $(h_0, c_0)$ is set to $(h_{n}, c_{n})$ of the $k$th example in batch $j-1$.

In PyTorch we do not call the forward function separately for each step $i$.  Instead we call the model with

- tensors corresponding to $(w_1, w_2, \ldots, w_n)$ and $(h_0, c_0)$

and receive as ouput

- $(y_1, y_2, \ldots, y_n)$ and $(h_n, c_n)$.

Further the above is combined for several examples into one batch.  Please read the PyTorch documentation to learn more about building models
            with RNNs.  E.g., see the documentation on [LSTMs](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) and [Robert Gutherie's Tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py) on working with LSTMs.
            
The above can be adapted easily to [GRUs](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) or [Simple RNNs](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) since the PyTorch interface is very similar.

**Task 1** [10 points]: Complete the code for the class `RNNLM` based on the description above.  (Some extra parameters are provided since in a later task, you'll modify your code to incorporate the following: (i) replace the LSTM with a GRU or a Simple RNN, (ii) increase the number of LSTM layers, and (iii) add dropout.) 

In [459]:
import torch
import torch.nn as nn


class RNNLM(nn.Module):
    """ Container module with an linear encoder/embedding, an RNN module, and a linear decoder.
    """

    def __init__(self, rnn_type, vocab_size, embedding_dim, hidden_dim, num_layers, dropout=0.5):
        ''' Initialize model parameters corresponding to ---
            - embedding layer
            - recurrent neural network layer---one of LSTM, GRU, or RNN---with 
              optionally more than one layer
            - linear layer to map from hidden vector to the vocabulary
            - optionally, dropout layers.  Dropout layers can be placed after 
              the embedding layer or/and after the RNN layer. Dropout within
              an RNN is only applied when there are two or more num_layers.
            - optionally, initialize the model parameters.
            
            The arguments are:
            
            rnn_type: One of 'LSTM', 'GRU', 'RNN_TANH', 'RNN_RELU'
            vocab_size: size of vocabulary
            embedding_dim: size of an embedding vector
            hidden_dim: size of hidden/state vector in RNN
            num_layers: number of layers in RNN
            dropout: dropout probability.
            
        '''
        super(RNNLM, self).__init__()
        ## YOUR CODE HERE ##
        self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.rnn_type = rnn_type
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers, dropout=dropout, batch_first=True)
        self.gru = nn.GRU(embedding_dim, hidden_dim, num_layers, dropout=dropout, batch_first=True)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, dropout=dropout, batch_first=True)
        self.linear = nn.Linear(hidden_dim, vocab_size) # fully connected hidden layer
        
    def forward(self, input, hidden0):
        ''' 
        Run forward propagation for a given minibatch of inputs using
        hidden0 as the initial hidden state.
        In LSTMs hidden0 = (h0, c0). 
        The output of the RNN includes the hidden vector hiddenn = (hn, cn).
        Return this as well so that it can be used to initialize the next
        batch.
        Unlike previous homework sets do not apply softmax or logsoftmax here, since we'll use
        the more efficient CrossEntropyLoss.  See 
        https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html.
        '''
        ###YOUR CODE HERE###
        x = self.embeddings(input)  # 1 x embedding_dim
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(DEVICE) 
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(DEVICE)  
        hidden0 = (h0, c0)

        if self.rnn_type == "RNN":
          yi, _ = self.rnn(x, h0)  # 1 x hidden_dim
        elif self.rnn_type == "GRU":
          yi, _ = self.gru(x, h0)  # 1 x hidden_dim
        elif self.rnn_type == "LSTM":
          yi, _ = self.lstm(x, hidden0)  # 1 x hidden_dim
        else:
          print(f"{self.rnn_type} is not a valid rnn_type. Try RNN, LSTM, or GRU.")
          return

        yi_hat = self.linear(yi) 
        return yi_hat, hidden0
        
    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.hidden_dim),
                torch.zeros(self.num_layers, sequence_length, self.hidden_dim))
 


### Evaluate on a given data set

The function for evaluation is provided below.

In [460]:
def evaluate(model, data):
    '''
    Evaluate the model on the given data.
    '''
    model.eval()
    it = iter(data)
    total_count = 0. # Number of target words seen
    total_loss = 0. # Loss over all target words
    with torch.no_grad():
        # No gradients need to be maintained during evaluation
        # There are no hidden tensors for the first batch, and so will default to zeros.
        hidden = None 
        for i, batch in enumerate(it):
            ''' 
              Do the following:
                - Extract the text and target from the batch, and if using CUDA (essentially, using GPUs), place 
                  the tensors on cuda, using a commands such as "text = text.cuda()".  More details are at
                  https://pytorch.org/docs/stable/notes/cuda.html.
                - Pass the hidden state vector from output of previous batch as the initial hidden vector for
                  the current batch. 
                - Call forward propagation to get output and final hidden state vector.
                - Compute the cross entropy loss
                - The loss_fn computes the average loss per target word in the batch.  Count the number of target
                  words in the batch (it is usually the same, except for the last batch), and use it to track the 
                  total count (of target words) and total loss see so far over all batches.
            '''
            text, target = batch.text, batch.target
            if USE_CUDA:
                text, target = text.cuda(), target.cuda()
            output, hidden = model(text, hidden)
            loss = loss_fn(output.view(-1, output.size(-1)), target.view(-1))
                  
            total_count += np.multiply(*text.size())
            total_loss += loss.item()*np.multiply(*text.size())
                
    loss = total_loss / total_count
    model.train()
    return loss


### Train the model

Training the model is mostly similar to previous homework sets except for:
- A detached hidden vector is applied to the second batch onwards as described above.
- Every, say, 10,000 iterations evaluate the model on a validation set, and if the mean loss is the lowest so far, save a copy of it.  After training, this "best model" is used for testing. 
      
**Task 2** [15]: Complete the code below for training the model.

In [461]:
GRAD_CLIP = 1 # [0, 0.5, 1]
NUM_EPOCHS = 2
NUM_LAYERS = 2  # [2, 3, 4, 5, 6]
DROPOUT = 0.5 # [0, 0.25, 0.5, 0.75, 1]

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if h is None:
        return None
    elif isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

model = RNNLM("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, NUM_LAYERS, dropout=DROPOUT)
if USE_CUDA:
    model = model.cuda()

loss_fn = nn.CrossEntropyLoss() ## Used instead of NLLLoss.
learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
val_losses = []
best_model = None
for epoch in range(NUM_EPOCHS):
    model.train()
    it = iter(train_iter)
    # There are no hidden tensors for the first batch, and so will default to zeros.
    hidden = None
    for i, batch in enumerate(it):
      ''' Do the following:
          - Extract the text and target from the batch, and if using CUDA (essentially, using GPUs), place 
            the tensors on cuda, using a commands such as "text = text.cuda()".  More details are at
            https://pytorch.org/docs/stable/tensors.html#torch.Tensor.cuda
          - Pass the hidden state vector from output of previous batch as the initial hidden vector for
            the current batch. But detach each tensor in the hidden state vector using tensor.detach() or
            the provided repackage_hidden(). See
            https://pytorch.org/docs/master/generated/torch.Tensor.detach_.html#torch-tensor-detach
          - Zero out the model gradients to reset backpropagation for current batch
          - Call forward propagation to get output and final hidden state vector.
          - Compute the cross entropy loss
          - Run back propagation to set the gradients for each model parameter.
          - Clip the gradients that may have exploded. See Sec 5.2.4 in the Goldberg textbook, and
            https://pytorch.org/docs/master/generated/torch.nn.utils.clip_grad_norm_.html#torch-nn-utils-clip-grad-norm
          - Run a step of gradient descent. 
          - Print the batch loss after every few iterations. (Say every 100 when developing, every 1000 otherwise.)
          - Evaluate your model on the validation set after every, say, 10000 iterations and save it to val_losses. If
            your model has the lowest validation loss so far, copy it to best_model. For that it is recommended that
            copy the state_dict rather than use deepcopy, since the latter doesn't work on Colab.  See discussion at 
            https://discuss.pytorch.org/t/deep-copying-pytorch-modules/13514. This is Early Stopping and is described
            in Sec 2.3.1 of Lecture notes by Cho: 
            https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
      '''
      ###YOUR CODE HERE###
      text, target = batch.text, batch.target
      if USE_CUDA:
          text, target = text.cuda(), target.cuda()
      hidden = repackage_hidden(hidden)
      optimizer.zero_grad()
      yi_hat, hidden = model(text, hidden)  # forward propogation
      loss = loss_fn(yi_hat.view(-1, yi_hat.size(-1)), target.view(-1))
      loss.backward() # backward propogation
      nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
      optimizer.step()
      if DEVELOPING:
        if i % 100 == 0:
          print(f'At iteration {i} the loss is {loss:.3f}.')
      else:
        if i % 1000 == 0:
          print(f'At iteration {i} the loss is {loss:.3f}.')
      min_loss = float('inf')
      if i % 10000 == 0:
        if loss < min_loss:
          best_model = type(model)("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, NUM_LAYERS, dropout=DROPOUT)  # get a new instance
          best_model.load_state_dict(model.state_dict())  # copy weights and stuff
        val_losses.append(loss)


At iteration 0 the loss is 10.823.
At iteration 1000 the loss is 6.632.
At iteration 2000 the loss is 6.586.
At iteration 3000 the loss is 6.421.
At iteration 4000 the loss is 5.763.
At iteration 5000 the loss is 6.255.
At iteration 6000 the loss is 6.155.
At iteration 7000 the loss is 5.970.
At iteration 8000 the loss is 6.239.
At iteration 9000 the loss is 5.952.
At iteration 10000 the loss is 6.033.
At iteration 11000 the loss is 6.189.
At iteration 12000 the loss is 6.161.
At iteration 13000 the loss is 5.930.
At iteration 14000 the loss is 5.795.
At iteration 0 the loss is 5.973.
At iteration 1000 the loss is 5.973.
At iteration 2000 the loss is 6.044.
At iteration 3000 the loss is 5.974.
At iteration 4000 the loss is 5.424.
At iteration 5000 the loss is 5.974.
At iteration 6000 the loss is 5.863.
At iteration 7000 the loss is 5.744.
At iteration 8000 the loss is 5.894.
At iteration 9000 the loss is 5.627.
At iteration 10000 the loss is 5.786.
At iteration 11000 the loss is 5.949.

In [462]:
'''
Evaluate the loss of best_model on the validation set and compute its perplexity.
'''
val_loss = evaluate(best_model, val_iter)
print("perplexity: ", np.exp(val_loss))

perplexity:  377.5835641936008


### Use the best model to evaluate the test dataset. 

We expect a test perplexity of less than 250 on the full model after a couple of epochs.

In [463]:
'''
Evaluate the loss of best_model on the test set and compute its perplexity.
'''
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", np.exp(test_loss))

perplexity:  442.47708682557


### Use the model to generate some sentences

**Task 3** [10]: Write code to generate random sentences.  Section 9.5 in the Goldberg textbook describes how this can be done.  Since we don't have a start symbol, for the first word simply pick a random word from the vocabulary.

You'll notice that the full sequences don't make much sense, but subsequences sound reasonably correct. 

In [464]:
'''
Use the model to generate 5 random sequences of length 50 each.

Pick a word, uniformly at random.
Form an N-gram out of the picked word and the last N−1 words.
Look up the probability of that particular N-gram.
Generate a uniform random number between 0 and 1. 
If that number is smaller than the probability of your N-gram, "accept" the new word. Otherwise, go back to the start.
'''
###YOUR CODE HERE###
def random_sentence_generator(model, start, sequence_len=50):
    model.eval()
    words = start.split(' ')
    h0, c0 = model.init_state(len(words)) 

    for i in range(sequence_len):
        x = torch.tensor([[TEXT.vocab[w] for w in words[i:]]])
        yi_hat, (h0, c0) = model(x, (h0, c0))

        last_word_logits = yi_hat[0][-1]
        p = nn.functional.softmax(last_word_logits, dim=0).detach().numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(TEXT.vocab.itos[word_index])
        
    sentence = ""
    for word in words[:-1]:
        sentence += word + " "
    sentence += f"{words[-1]}."
    return sentence


for i in range(5):
    start = TEXT.vocab.itos[np.random.randint(VOCAB_SIZE)]  # choose a random word from the vocabulary
    print(random_sentence_generator(best_model, start))
    print()

cory go to the academy of alexander s goal off his first down during his hands back soviet union alexander the term image was first one nine nine in armies who can be of judah the reporter in its name but as lens of winning lines of alexander s soldiers in.

sternum drunken originates at both bengal wood and human beings western cities in asia based nor called d where the united nations complex and to solve it some types of the seven three during the delaware in several of the <unk> creators found null <unk> off freedom one seven irish invasion.

stunned alexander new technology there is typically related networks example of computer science that had close to reducing the advance the euro approaches the outposts of the higher and turkish military coups in our super exports zur <unk> and newspapers such an approximately one terrorist players who until the orthodox empire.

sponges not go passing line which were originally founded by the key when quantum but not listed from the armed for

### Choose the best sentence from alternatives

Generating random sentences as above is, however, not the objective of a language model.  Rather it is used as an auxiliary tool to choose the best sequence given some choices by comparing their perplexities.

**Task 4** [5]: Use the code below to compute perplexities of the given six sentences.  Discuss the model's performance in choosing the best alternative.  (The code uses TorchText functions which are designed for much larger datasets.  So the perplexities below are approximate. Even so they illustrate the usefullness of our model.)

In [465]:
'''
Notes on perplexity:
Perplexity is the probability of the test set, normalized by the number of words.
Minimizing perplexity is the same as maximizing probability. 
Perplexity is "on average, how many different words could come next?" 
If 10 words could come next each with 1/10 probability, then perplexity is 10. 
This is an average of the weighted probabilities. "average branch factor"

Task 4:
Developer Mode: 
Sentence 6 outperforms all of the other sentences with the lowest perplexity. 
This makes sense because the model is evaluating a set of words that we know make sense in sequence,
and we then repeat those words.

The model also impressively increase the perplexity after swapping the word "world" for "cat" as if to 
note that it makes more sense for us to be talking about world (plural implied) herd immunity to the virus, 
rather than a single cat.

I do not have an explanation for why sen5 has a lower perplexity than sen1 - sen4. 
I suspect this is an error of running in Developer mode with a lack of training.

Google Colab Mode:
TBD - unable to run GPU.
'''


'\nNotes on perplexity:\nPerplexity is the probability of the test set, normalized by the number of words.\nMinimizing perplexity is the same as maximizing probability. \nPerplexity is "on average, how many different words could come next?" \nIf 10 words could come next each with 1/10 probability, then perplexity is 10. \nThis is an average of the weighted probabilities. "average branch factor"\n\nTask 4:\nDeveloper Mode: \nSentence 6 outperforms all of the other sentences with the lowest perplexity. \nThis makes sense because the model is evaluating a set of words that we know make sense in sequence,\nand we then repeat those words.\n\nThe model also impressively increase the perplexity after swapping the word "world" for "cat" as if to \nnote that it makes more sense for us to be talking about world (plural implied) herd immunity to the virus, \nrather than a single cat.\n\nI do not have an explanation for why sen5 has a lower perplexity than sen1 - sen4. \nI suspect this is an error

In [466]:
sen1 = ("Early in the pandemic, there was hope that the world would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is " 
"crushing "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen2 = ("Early in the pandemic, there was hope that the world would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is "
"dancing "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen3 = ("Early in the pandemic, there was hope that the world would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is " 
"run "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen4 = ("Early in the pandemic, there was hope that the "
"cat "
" would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is"
"run "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen5 = sen1.split()
random.shuffle(sen5)
sen5 = " ".join(sen5)

sen6 = " ".join(['Early in the pandemic']*8)

sen_list = [sen1, sen2, sen3, sen4, sen5, sen6]

for sen in sen_list:

    print(sen)
    with open(PATH + "temp_sentence.txt", 'w') as text_file:
        print(sen, file = text_file)

    temp_ds = torchtext.legacy.datasets.LanguageModelingDataset(path=PATH + 'temp_sentence.txt', text_field=TEXT)


    sen_iter = torchtext.legacy.data.BPTTIterator(temp_ds, batch_size=BATCH_SIZE, device=DEVICE, 
                                                  bptt_len=BPTT_LENGTH, repeat=False)
        
    sen_loss = evaluate(best_model, sen_iter)
    print("perplexity: ", np.exp(sen_loss))
    print()


Early in the pandemic, there was hope that the world would one day achieve herd immunity, the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is crushing India with a fearsome second wave and surging in countries from Asia to Latin America.
perplexity:  1163.2668983855197

Early in the pandemic, there was hope that the world would one day achieve herd immunity, the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is dancing India with a fearsome second wave and surging in countries from Asia to Latin America.
perplexity:  1257.308100991085

Early in the pandemic, there was hope that the world would one day achieve herd immunity, the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is run India with a fearsome second wave and surging in countries from Asia to Latin America.
perplexity:  1206.5105081050533

Early in the pandemic, there was hope that the cat  would o

### Extensions

**Task 5** [10]: Extend your model to incorporate the following options: (i) substitute the LSTM with a GRU or a Simple RNN, (ii) increase the number of LSTM layers, (iii) add dropout, (iv) add gradient clipping.  Report on the combination of these options which gives the best performance.

In [467]:
'''
Task 5:
Small Development Mode:
Perplexities after running the models with various numbers of layers:

NUM LAYERS	LSTM	LSTM	LSTM sen1-3	LSTM sen4	LSTM sen5	LSTM sen6	RNN	    RNN	    GRU 	GRU 	average
2	        1974	1784	1899	    21.82	    1627	    762	        1881	1679	2179	1942	1574.882
3	        2216	1974	2145	    2552	    1852	    816	        2024	1788	1867	1665	1889.9
4	        1973	1757	1863	    2210	    1668	    774	        1922	1720	1971	1739	1759.7
5	        1862	1658	1805	    2162	    1625	    787	        1970	1744	2050	1807	1747
6	        2208	1978	2048	    2439	    1773	    807	        2097	1837	1962	1734	1888.3

2 layers appears to be optimal.

Perplexities after running the models with various Dropout values:

Dropout	LSTM	LSTM	LSTM sen1-3	LSTM sen4	LSTM sen5	LSTM sen6	RNN	    RNN	    GRU 	GRU 	average
0	    1908	1721	1715	    1963	    1573	    677	        1872	1676	1887	1688	1668
0.25	1934	1744	1767	    2039	    1588	    710	        1848	1660	2105	1885	1728
0.5	    1974	1784	1899	    2182	    1627    	762	        1881	1679	2179	1942	1790.9
0.75	2069	1872	2069	    2409	    1718	    847	        1973	1753	2119	1909	1873.8
1	    1995	1847	2103	    2440	    1743	    997	        2075	1860	2112	1925	1909.7

Dropout = 0 appears to be optimal.

Perplexities after running the models with various clipping values:

Grad Clip	LSTM	LSTM	LSTM sen1-3	LSTM sen4	LSTM sen5	LSTM sen6	RNN	    RNN	    GRU 	GRU 	average
0	        1995	1847	2103	    2440	    1743	    997	        2075	1860	2112	1925	1909.7
0.5	        1911	1722	1718	    1968	    1569	    681	        1851	1661	1907	1723	1671.1
1	        1908	1721	1715	    1963	    1573	    677	        1872	1676	1927	1738	1677

There was not much difference between 0.5 and 1.0 clipping, but both were better than 0 clipping.

I would choose 2 layers, 0 dropout, and 0.5 clipping.


Full Version (Using 2 layers, 0 dropout, and 0.5 clipping):
LSTM	LSTM	LSTM sen1   LSTM sen2   LSTM sen3	LSTM sen4	LSTM sen5	LSTM sen6
377     442     1163        1257        1206        1657        12980       55678
'''


'\nTask 5:\nPerplexities after running the models with various numbers of layers:\n\nNUM LAYERS\tLSTM\tLSTM\tLSTM sen1-3\tLSTM sen4\tLSTM sen5\tLSTM sen6\tRNN\t    RNN\t    GRU \tGRU \taverage\n2\t        1974\t1784\t1899\t    21.82\t    1627\t    762\t        1881\t1679\t2179\t1942\t1574.882\n3\t        2216\t1974\t2145\t    2552\t    1852\t    816\t        2024\t1788\t1867\t1665\t1889.9\n4\t        1973\t1757\t1863\t    2210\t    1668\t    774\t        1922\t1720\t1971\t1739\t1759.7\n5\t        1862\t1658\t1805\t    2162\t    1625\t    787\t        1970\t1744\t2050\t1807\t1747\n6\t        2208\t1978\t2048\t    2439\t    1773\t    807\t        2097\t1837\t1962\t1734\t1888.3\n\n2 layers appears to be optimal.\n\nPerplexities after running the models with various Dropout values:\n\nDropout\tLSTM\tLSTM\tLSTM sen1-3\tLSTM sen4\tLSTM sen5\tLSTM sen6\tRNN\t    RNN\t    GRU \tGRU \taverage\n0\t    1908\t1721\t1715\t    1963\t    1573\t    677\t        1872\t1676\t1887\t1688\t1668\n0.25\t1934

In [468]:
### RNN ###
model = RNNLM("RNN", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, NUM_LAYERS, dropout=DROPOUT)
if USE_CUDA:
    model = model.cuda()

loss_fn = nn.CrossEntropyLoss()  # Used instead of NLLLoss.
learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
val_losses = []
best_model = None
for epoch in range(NUM_EPOCHS):
    model.train()
    it = iter(train_iter)
    hidden = None
    for i, batch in enumerate(it):
      text, target = batch.text, batch.target
      if USE_CUDA:
          text, target = text.cuda(), target.cuda()
      hidden = repackage_hidden(hidden)
      optimizer.zero_grad()
      yi_hat, hidden = model(text, hidden)  # forward propogation
      loss = loss_fn(yi_hat.view(-1, yi_hat.size(-1)), target.view(-1))
      loss.backward()  # backward propogation
      nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
      optimizer.step()
      if DEVELOPING:
        if i % 100 == 0:
          print(f'At iteration {i} the loss is {loss:.3f}.')
      else:
        if i % 1000 == 0:
          print(f'At iteration {i} the loss is {loss:.3f}.')
      min_loss = float('inf')
      if i % 10000 == 0:
        if loss < min_loss:
          best_model = type(model)("RNN", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, NUM_LAYERS, dropout=DROPOUT) 
          best_model.load_state_dict(model.state_dict())
        val_losses.append(loss)

'''
Evaluate the loss of best_model on the validation set and compute its perplexity.
'''
val_loss = evaluate(best_model, val_iter)
print("perplexity: ", np.exp(val_loss))
'''
Evaluate the loss of best_model on the test set and compute its perplexity.
'''
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", np.exp(test_loss))


At iteration 0 the loss is 10.842.
At iteration 1000 the loss is 6.556.


KeyboardInterrupt: 

In [None]:
### GRU ###
model = RNNLM("GRU", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, NUM_LAYERS, dropout=DROPOUT)
if USE_CUDA:
    model = model.cuda()

loss_fn = nn.CrossEntropyLoss()  # Used instead of NLLLoss.
learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
val_losses = []
best_model = None
for epoch in range(NUM_EPOCHS):
    model.train()
    it = iter(train_iter)
    hidden = None
    for i, batch in enumerate(it):
      text, target = batch.text, batch.target
      if USE_CUDA:
          text, target = text.cuda(), target.cuda()
      hidden = repackage_hidden(hidden)
      optimizer.zero_grad()
      yi_hat, hidden = model(text, hidden)  # forward propogation
      loss = loss_fn(yi_hat.view(-1, yi_hat.size(-1)), target.view(-1))
      loss.backward()  # backward propogation
      nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
      optimizer.step()
      if DEVELOPING:
        if i % 100 == 0:
          print(f'At iteration {i} the loss is {loss:.3f}.')
      else:
        if i % 1000 == 0:
          print(f'At iteration {i} the loss is {loss:.3f}.')
      min_loss = float('inf')
      if i % 10000 == 0:
        if loss < min_loss:
          best_model = type(model)("GRU", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, NUM_LAYERS, dropout=DROPOUT)
          best_model.load_state_dict(model.state_dict())
        val_losses.append(loss)

'''
Evaluate the loss of best_model on the validation set and compute its perplexity.
'''
val_loss = evaluate(best_model, val_iter)
print("perplexity: ", np.exp(val_loss))
'''
Evaluate the loss of best_model on the test set and compute its perplexity.
'''
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", np.exp(test_loss))


At iteration 0 the loss is 8.231.
At iteration 100 the loss is 7.573.
At iteration 200 the loss is 6.492.
At iteration 300 the loss is 6.776.
At iteration 400 the loss is 5.931.
At iteration 500 the loss is 7.974.
At iteration 0 the loss is 6.174.
At iteration 100 the loss is 6.231.
At iteration 200 the loss is 6.105.
At iteration 300 the loss is 6.306.
At iteration 400 the loss is 5.695.
At iteration 500 the loss is 7.796.
perplexity:  1907.9382018673427
perplexity:  1723.3735961812633
