<a href="https://colab.research.google.com/github/deniztokmakoglu/CAPP30254_Project/blob/main/hw4_rnn_lang_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An RNN Transducer-based Language Model¶

In this homework we will
- Build an LSTM transducer-based language model using early stopping and compute the text perplexity.
- Use the model to generate sentences.
- Extend the model and compare performance when we 
    - replace the LSTM with a GRU or a Simple RNN
    - increase the number of LSTM layers
    - add dropout
    - add gradient clipping
    
You can develop on your local machine, but to train on the full training set requires GPUs.  We recommend using the GPUs at [Google Colab](https://colab.research.google.com). To upload a notebook, choose the "Files" dropdown menu and then "Upload."  To use a GPU, choose Runtime > Change runtime type and select GPU.    
    
Acknowledgement:  This assignment was originally written by Zewei Chu, and was inspired by a [homework in CS287](https://github.com/harvard-ml-courses/cs287-s18/blob/master/HW2/Homework%202.ipynb) at Harvard.
    

### Using an older version of torchtext

Torchtext is undergoing rapid development.  The latest version of the library has dropped some components, which are expected to be revamped and added back in the future.  So for this homework, we'll have to work with a slightly older version, 0.11.2.  Please use the following command to install the correct version, if your version is different.

`!pip install torchtext==0.11.2`


In [3]:
#!pip install torchtext==0.11.2

In [76]:
# Check your version
import torch
import torchtext
# On Colab, you'll see ('1.10.2+cu102', '0.11.2')
torch.__version__, torchtext.__version__

('1.10.2+cu102', '0.11.2')

### Development vs full version

Choose the appropriate version of the parameters using the switches `DEVELOPING` and `COLAB.`

In [77]:
import torchtext
from torchtext.vocab import Vectors
import torch
import numpy as np
import random

USE_CUDA = torch.cuda.is_available()

if USE_CUDA:
    DEVICE = torch.device('cuda')
    print("Using cuda.")
else:
    DEVICE = torch.device('cpu')
    print("Using cpu.")

random.seed(30255)
np.random.seed(30255)
torch.manual_seed(30255)
if USE_CUDA:
    torch.cuda.manual_seed(30255)

# Change the following to false when training on
# the full set
#DEVELOPING = True    
DEVELOPING = False

if DEVELOPING:
    print('Small development version')
    BATCH_SIZE = 4
    EMBEDDING_SIZE = 20
    MAX_VOCAB_SIZE = 5000
    TRAIN_DATA_SET = "lm-train-small.txt"
    DEV_DATA_SET = "lm-dev-small.txt"
    TEST_DATA_SET = "lm-test-small.txt"
    BPTT_LENGTH = 8
else:
    print('Full version')
    BATCH_SIZE = 32
    EMBEDDING_SIZE = 650
    MAX_VOCAB_SIZE = 50000
    TRAIN_DATA_SET = "lm-train.txt"
    DEV_DATA_SET = "lm-dev.txt"
    TEST_DATA_SET = "lm-test.txt"
    BPTT_LENGTH = 32

# For uploading data to Colab see, e.g., 
# https://medium.com/@philipplies/transferring-data-from-google-drive-to-google-cloud-storage-using-google-colab-96e088a8c041    
#COLAB = False
COLAB = True
if COLAB:
    from google.colab import drive 
    drive.mount('/content/gdrive')
    PATH = "gdrive/My Drive/"
else:
    PATH = "/Users/amitabh/mlpp22/Homework/hw4/"
    
    
LOG_FILE = "language-model.log"

Using cuda.
Full version
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Preprocessing using the legacy component of TorchText

For our preprocessing we'll use the legacy component in TorchText version 0.11.2.  [Documentation](https://torchtext.readthedocs.io/en/latest/index.html) for this legacy component torchtext is relatively sparse (and, unfortunately, not very clear), but [Ben Trevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb) has a useful tutorial.  (If you are keen to understand this component, you may also want to look at the [source code](https://github.com/pytorch/text/tree/master/torchtext/legacy).)

All the **legacy torchtext code is already provided**. 


In [78]:
TEXT = torchtext.legacy.data.Field(lower=True)

train, val, test = torchtext.legacy.datasets.LanguageModelingDataset.splits(path=PATH, 
    train=TRAIN_DATA_SET, validation=DEV_DATA_SET, test=TEST_DATA_SET, text_field=TEXT)

TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
VOCAB_SIZE = len(TEXT.vocab)

print(f'Vocabulary size: {VOCAB_SIZE}')

train_iter, val_iter, test_iter = torchtext.legacy.data.BPTTIterator.splits(
    (train, val, test), batch_size=BATCH_SIZE, device=DEVICE, bptt_len=BPTT_LENGTH, 
    repeat=False)


Vocabulary size: 50002


### Back propagation through time (BPTT) iterator

The [BPTTIterator](https://torchtext.readthedocs.io/en/latest/data.html#bpttiterator) is a custom torchtext iterator for language modeling using RNNs.  Suppose the text in an example is "the quick brown fox".  The target in the transducer-based RNN language model would then be "quick brown fox jumps".  This allows every prefix of the text to be used as an training example, with the corresponding word in the target text as the target word.  So the above would lead to four examples, written as text sequence -> target word:
* "the" -> "quick"
* "the quick" -> "brown"
* "the quick brown" -> "fox"
* "the quick brown fox" -> "jump"

(Unlike some of the examples in class, here we treat words as part of a sequence without special consideration for sentences.  In particular, we don't use start/end of setence tags.)

One very **significant feature** of the BPTTIterator is that examples continue across batches.  To illustrate let the original data be one long seqence $w_1, w_2, \ldots, w_N$, in which, say, $N = 4,000$.  Further let each batch consist of $4$ examples, each of length 8.  Then the first batch created by BPTTIterator would be the following 4 examples---

- $(w_1, w_2, \ldots, w_{8}), (w_{1001}, w_{1002}, \ldots, w_{1008}), \ldots, (w_{3001}, w_{3002}, \ldots, w_{3008}).$ 

and the second batch would be---

- $(w_{9}, w_{10}, \ldots, w_{16}), (w_{1009}, w_{1010}, \ldots, w_{1016}), \ldots, (w_{3009}, w_{3010}, \ldots, w_{3016}).$

This has implications on how the hidden state of the RNN is set for the second batch onwards.

In [79]:
it = iter(train_iter)
batch = next(it)
print("The first three text/target sequences from the first batch are:\n")
indent = " " * 4
for j in range(3):
    print(indent, f"Text Sequence {j}:", 
          " ".join([TEXT.vocab.itos[i] for i in batch.text[:,j].data]))
    print(indent, f"Target Sequence {j}:",
          " ".join([TEXT.vocab.itos[i] for i in batch.target[:,j].data]))
    print()
 
print(f"Each sequence has BPTT_LENGTH = {BPTT_LENGTH}.\n")
print("Also the sequences continue in the next batch!\n")
batch = next(it)
for j in range(3):
    print(indent, f"Text Sequence {j}:", 
          " ".join([TEXT.vocab.itos[i] for i in batch.text[:,j].data]))
    print(indent, f"Target Sequence {j}:",
          " ".join([TEXT.vocab.itos[i] for i in batch.target[:,j].data]))
    print()

The first three text/target sequences from the first batch are:

     Text Sequence 0: anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term
     Target Sequence 0: originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is

     Text Sequence 1: of natural history albert einstein the albert einstein institution the economist one zero zero years of einstein einstein home distributed computing project searching for gravitational waves predicted by einstein s theories world
     Target Sequence 1: natural history albert einstein the albert einstein institution the economist one zero zero years of einstein einstein home distributed computing project searching for gravitational waves predicted by einstein s theories world 

#### Initializing hidden vectors from the detached hidden vectors of previous batch

Since sequences continue across batches, for proper training, **the final output hidden vectors in a batch should be used to initialize the hidden vectors for the next batch**.  But care should be taken to detach vectors used for initialization from the computational graph, else gradients would flow "from one batch to the previous" and training would be increasingly slow. 

### Define the model


Our RNN based language model (when using an LSTM) for a language model is as follows:
- Let the input sequence---the *context*---be $w_1, w_2, \ldots, w_n$, and let the target sequence be $w_2, \ldots, w_n, w_{n+1}$.
- At step $i$ of the input, for $1 \leq i \leq n$:
    - $x_i = E_{[w_i]}$.
    - $y_i, (h_i, c_i) = \text{LSTM}(x_i, (h_{i-1}, c_{i-1}))$.  For LSTMs, $y_i$ equals $h_i$.
    - $\widehat{y_i} = \text{softmax}(y_i W + b)$ in which $\widehat{y_i}$ is the predicted probability distribution for $w_{i+1}$.
    - In the above 
        - $x_i$ is $1 \times \text{embedding dim}$ 
        - $y_i$, $h_i$ and $c_i$ are $1 \times \text{hidden dim}$
        - $\widehat{y}_i$ is $1 \times \text{vocab size}$.
- The loss $\ell = \sum_{i=1}^n \log \widehat{y}_{i_{[w_{i+1}]}}$, in which $\log \widehat{y}_{i_{[w_{i+1}]}}$ is the component of $\log \widehat{y}_{i}$ corresponding to the element $w_{i+1}$.

Since the sequences continue across batches we retain the hidden states across batches. Specifically, consider the $k$th example in batch $j$.  For $j=1$, i.e., first batch, the corresponding $(h_0, c_0)$ for the $k$th example is set to all zeros.  But for $j > 1$, the corresponding $(h_0, c_0)$ is set to $(h_{n}, c_{n})$ of the $k$th example in batch $j-1$.

In PyTorch we do not call the forward function separately for each step $i$.  Instead we call the model with

- tensors corresponding to $(w_1, w_2, \ldots, w_n)$ and $(h_0, c_0)$

and receive as ouput

- $(y_1, y_2, \ldots, y_n)$ and $(h_n, c_n)$.

Further the above is combined for several examples into one batch.  Please read the PyTorch documentation to learn more about building models
            with RNNs.  E.g., see the documentation on [LSTMs](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) and [Robert Gutherie's Tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py) on working with LSTMs.
            
The above can be adapted easily to [GRUs](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) or [Simple RNNs](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) since the PyTorch interface is very similar.

**Task 1** [20 points]: Complete the code for the class `RNNLM` based on the description above.  (Some extra parameters are provided since in a later task, you'll modify your code to incorporate the following: (i) replace the LSTM with a GRU or a Simple RNN, (ii) increase the number of LSTM layers, and (iii) add dropout.) 

In [130]:
import torch
import torch.nn as nn


class RNNLM(nn.Module):
    """ Container module with an linear encoder/embedding, an RNN module, and a linear decoder.
    """

    def __init__(self, rnn_type, vocab_size, embedding_dim, hidden_dim, num_layers, 
                 dropout=0.5):
        ''' Initialize model parameters corresponding to ---
            - embedding layer
            - recurrent neural network layer---one of LSTM, GRU, or RNN---with 
              optionally more than one layer
            - linear layer to map from hidden vector to the vocabulary
            - optionally, dropout layers.  Dropout layers can be placed after 
              the embedding layer or/and after the RNN layer. Dropout within
              an RNN is only applied when there are two or more num_layers.
            - optionally, initialize the model parameters.
            
            The arguments are:
            
            rnn_type: One of 'LSTM', 'GRU', 'RNN_TANH', 'RNN_RELU'
            vocab_size: size of vocabulary
            embedding_dim: size of an embedding vector
            hidden_dim: size of hidden/state vector in RNN
            num_layers: number of layers in RNN
            dropout: dropout probability.
            
        '''
        super(RNNLM, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        if rnn_type == "LSTM":
          self.lstm = nn.LSTM(input_size=embedding_dim,
                            hidden_size=hidden_dim,
                            num_layers = num_layers)
        elif rnn_type == "GRU":
          self.GRU = nn.GRU(input_size=embedding_dim,
                            hidden_size=hidden_dim,
                            num_layers = num_layers)
        elif rnn_type == "RNN_TANH":
          self.RNN_TANH = nn.RNN(input_size=embedding_dim,
                            hidden_size=hidden_dim,
                            num_layers = num_layers,
                            nonlinearity = "tanh")
        elif rnn_type == "RNN_RELU":
          self.RNN_RELU =nn.RNN(input_size=embedding_dim,
                             hidden_size=hidden_dim,
                            num_layers = num_layers,
                            nonlinearity = "relu")
        else:
          assert "Incorrect RNN Type."
        
        self.dropout = nn.Dropout(p = dropout)
        self.decoder = nn.Linear(hidden_dim, vocab_size) #vocab_size?
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.n_layers = num_layers
        self.nhid = hidden_dim

    def forward(self, input, hidden0):
        ''' 
        Run forward propagation for a given minibatch of inputs using
        hidden0 as the initial hidden state.

        In LSTMs hidden0 = (h_0, c_0). 

        The output of the RNN includes the hidden vector hiddenn = (h_n, c_n).
        Return this as well so that it can be used to initialize the next
        batch.
        
        Unlike previous homework sets do not apply softmax or logsoftmax here, since we'll use
        the more efficient CrossEntropyLoss.  See 
        https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html.
        '''
        encoded = self.embedding(input.long())
        encoded = self.dropout(encoded)
        lstm_out, hidden = self.lstm(encoded, hidden0)
        lstm_out = self.dropout(lstm_out)
        decoded = self.decoder(lstm_out)
        decoded = torch.squeeze(decoded)

        
        return decoded.view(BPTT_LENGTH, self.vocab_size, -1), hidden
        
    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(self.n_layers, bsz, self.nhid),
            weight.new_zeros(self.n_layers, bsz, self.nhid))
        
 

### Evaluate on a given data set

The function for evaluation is provided below.

In [131]:
def evaluate(model, data):
    '''
    Evaluate the model on the given data.
    '''
  
    model.eval()
    it = iter(data)
    total_count = 0. # Number of target words seen
    total_loss = 0. # Loss over all target words
    with torch.no_grad():
        # No gradients need to be maintained during evaluation
        # There are no hidden tensors for the first batch, and so will default to zeros.
        hidden = None 
        for i, batch in enumerate(it):
            ''' Do the following:
                - Extract the text and target from the batch, and if using CUDA (essentially, using GPUs), place 
                  the tensors on cuda, using a commands such as "text = text.cuda()".  More details are at
                  https://pytorch.org/docs/stable/notes/cuda.html.
                - Pass the hidden state vector from output of previous batch as the initial hidden vector for
                  the current batch. 
                - Call forward propagation to get output and final hidden state vector.
                - Compute the cross entropy loss
                - The loss_fn computes the average loss per target word in the batch.  Count the number of target
                  words in the batch (it is usually the same, except for the last batch), and use it to track the 
                  total count (of target words) and total loss see so far over all batches.
            '''
            text, target = batch.text, batch.target
            if USE_CUDA:
                text, target = text.cuda(), target.cuda()
            if target.shape[0] != 10: #kicking out the last batch
              output, hidden = model(text, hidden)
              loss = loss_fn(output, target)
              total_count += np.multiply(*text.size())
              total_loss += loss.item()*np.multiply(*text.size())
                
    loss = total_loss / total_count
    model.train()
    return loss


### Train the model

Training the model is mostly similar to previous homework sets except for:
- A detached hidden vector is applied to the second batch onwards as described above.
- Every, say, 10,000 iterations evaluate the model on a validation set, and if the mean loss is the lowest so far, save a copy of it.  After training, this "best model" is used for testing. 
      
**Task 2** [30]: Complete the code below for training the model.

In [132]:
def train_model(model, data, i, hidden = None, USE_CUDA = True, print_every = 250):
  start_time = time.time()
  total_loss = 0
  text, target = data.text, data.target
  if USE_CUDA:
      text, target = text.cuda(), target.cuda()
  hidden = repackage_hidden(hidden)
  output, hidden = model(text, hidden)

  if output.shape[0] == target.shape[0]:
    loss = loss_fn(output, target.long())
    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
    optimizer.step()
    total_loss += loss.item()
  if (i + 1) % print_every == 0:
      cur_loss = total_loss / print_every
      elapsed = time.time() - start_time
      print(f"Average loss so far: {round(cur_loss, 2)} | Batch {i+1} | Time {round(elapsed, 2)}")
      total_loss = 0
      start_time = time.time()

In [133]:
GRAD_CLIP = 1.
NUM_EPOCHS = 2
LOG_INTERVAL = 100
import os
SAVE_BEST = os.path.join(PATH, 'model.pt')
import time
import math

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if h is None:
        return None
    elif isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

model = RNNLM("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)
if USE_CUDA:
    model = model.cuda()

loss_fn = nn.CrossEntropyLoss() ## Used instead of NLLLoss.
learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
val_losses = []
best_model = None
for epoch in range(NUM_EPOCHS):
    print(f""" #### EPOCH {epoch} #####""")
    model.train()
    # There are no hidden tensors for the first batch, and so will default to zeros.
    hidden = None
    min_val_loss = math.inf
    best_model = None
    for i, batch in enumerate(it):
      train_model(model, batch, i)
  
      if (i + 1) % 1000 == 0:
        val_loss = evaluate(model, val_iter)
        
        val_losses.append(val_loss.item())
        min_val_loss = min(min_val_loss, val_loss.item())
        print(f"Average val loss {sum(val_losses) / len(val_losses)}")
        if min_val_loss == val_loss.item():
          with open(SAVE_BEST, 'wb') as f:
            torch.save(model, f)
          print("New best model saved!")

        
       

    ''' Do the following:
      
        - Pass the hidden state vector from output of previous batch as the initial hidden vector for
          the current batch. But detach each tensor in the hidden state vector using tensor.detach() or
          the provided repackage_hidden(). See
          https://pytorch.org/docs/master/generated/torch.Tensor.detach_.html#torch-tensor-detach
        - Zero out the model gradients to reset backpropagation for current batch
        - Call forward propagation to get output and final hidden state vector.
        - Compute the cross entropy loss
        - Run back propagation to set the gradients for each model parameter.
        - Clip the gradients that may have exploded. See Sec 5.2.4 in the Goldberg textbook, and
          https://pytorch.org/docs/master/generated/torch.nn.utils.clip_grad_norm_.html#torch-nn-utils-clip-grad-norm
        - Run a step of gradient descent. 
        - Print the batch loss after every few iterations. (Say every 100 when developing, every 1000 otherwise.)
        - Evaluate your model on the validation set after every, say, 10000 iterations and save it to val_losses. If
          your model has the lowest validation loss so far, copy it to best_model. For that it is recommended that
          copy the state_dict rather than use deepcopy, since the latter doesn't work on Colab.  See discussion at 
          https://discuss.pytorch.org/t/deep-copying-pytorch-modules/13514. This is Early Stopping and is described
          in Sec 2.3.1 of Lecture notes by Cho: 
          https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
    '''
    
        
        

 #### EPOCH 0 #####
 #### EPOCH 1 #####


In [134]:
with open(SAVE_BEST, 'rb') as f:
    best_model = torch.load(f)
    # After loading the RNN params, they are not a continuous chunk of memory.
    # flatten_paramters() makes them a continuous chunk, and will speed up the forward pass.
    # Currently, only RNN model supports flatten_parameters function.
    best_model.lstm.flatten_parameters()


In [135]:
'''
Evaluate the loss of best_model on the validation set and compute its perplexity.
'''
## load best model
with open(SAVE_BEST, 'rb') as f:
    best_model = torch.load(f)
    # After loading the RNN params, they are not a continuous chunk of memory.
    # flatten_paramters() makes them a continuous chunk, and will speed up the forward pass.
    # Currently, only RNN model supports flatten_parameters function.
    best_model.lstm.flatten_parameters()

best_model_set = RNNLM("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)


val_loss = evaluate(best_model, val_iter)
print("perplexity: ", np.exp(val_loss))

perplexity:  16812.515798108216


### Use the best model to evaluate the test dataset. 

We expect a test perplexity of less than 250 on the full model after a couple of epochs.

In [136]:
'''
Evaluate the loss of best_model on the test set and compute its perplexity.
'''
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", np.exp(test_loss))

perplexity:  16190.229473819982


### Use the model to generate some sentences

**Task 3** [20]: Write code to generate random sentences.  Section 9.5 in the Goldberg textbook describes how this can be done.  Since we don't have a start symbol, for the first word simply pick a random word from the vocabulary.

You'll notice that the full sequences don't make much sense, but subsequences sound reasonably correct. 

In [189]:
'''
Use the model to generate 5 random sequences of length 50 each.
'''
def generate_text(sampling_func, model = best_model):
    # # Generation with LSTM lm given a sampling function and a prompt
    prompt = random.choice(TEXT.vocab.itos)
    id_word = TEXT.vocab.itos.index(prompt)
    max_length = 50
    hidden = model.init_hidden(1)
    hidden = repackage_hidden(hidden)
    with torch.no_grad():  # no tracking history
        input = torch.tensor(id_word).to("cuda")
        output, hidden = model(torch.tensor([[id_word] * 32]).to("cuda"), None)
        word_prob = torch.nn.functional.softmax(output[-1,:], dim=0).cpu()
        generations = []
        for i in range(max_length):
            word_idx = sampling_func(word_prob)
            word = TEXT.vocab.itos[word_idx]
            generations.append(word)
            if word == "<eos>":
                break
            new_word = torch.LongTensor([[word_idx] * 32]).to("cuda")
            output, hidden = model(new_word, None)
            word_prob = torch.nn.functional.softmax(output[-1,:], dim=0).cpu()
    return generations

def topk_sampling_5(word_prob):
    k = 10
    topk = torch.topk(word_prob.flatten(), k)
    values = topk.values / topk.values.sum()
    indices = topk.indices
    index = torch.multinomial(values, 1).item()
    word_id = list(indices)[index]
    return word_id

In [190]:
for i in range(5):
  generations = generate_text(topk_sampling_5) # replace sample_func with the sampling function that you would like to try
  print('prompt: ' + " ".join(generations))

prompt: would people so so so people may called if th called often after into may after if however would may if called people would so may i if if if war would so so so only if people after people will would will south i if i d so however
prompt: known so through so i may if i so however b called during will so people may may may so however b hermitian b article called people would would through i may would b may will people called b asparagine system people d i so d where would may b
prompt: d article called b nl after called would b vitrification b scale so may called d however may however will however may would would called d see known if may if would during b aau if would often b states will so called called however after may after i b
prompt: however may would south would united will may people would called if called so b called people so so so so over if may would i people b external if so will will than people may after so would would called i so called only people would so peop

### Choose the best sentence from alternatives

Generating random sentences as above is, however, not the objective of a language model.  Rather it is used as an auxiliary tool to choose the best sequence given some choices by comparing their perplexities.

**Task 4** [5]: Use the code below to compute perplexities of the given six sentences.  Discuss the model's performance in choosing the best alternative.  (The code uses TorchText functions which are designed for much larger datasets.  So the perplexities below are approximate. Even so they illustrate the usefullness of our model.)

In [196]:
sen1 = ("Early in the pandemic, there was hope that the world would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is " 
"crushing "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen2 = ("Early in the pandemic, there was hope that the world would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is "
"dancing "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen3 = ("Early in the pandemic, there was hope that the world would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is " 
"run "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen4 = ("Early in the pandemic, there was hope that the "
"cat "
" would one day achieve herd immunity, "
"the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is"
"run "
"India with a fearsome second wave and surging in countries from Asia to Latin America.")

sen5 = sen1.split()
random.shuffle(sen5)
sen5 = " ".join(sen5)

sen6 = " ".join(['Early in the pandemic']*8)

sen_list = [sen1, sen2, sen3, sen4, sen5, sen6]

for sen in sen_list:

    print(sen)
    with open(PATH + "temp_sentence.txt", 'w') as text_file:
        print(sen, file = text_file)

    temp_ds = torchtext.legacy.datasets.LanguageModelingDataset(path=PATH + 'temp_sentence.txt', 
                                                                text_field=TEXT)


    sen_iter = torchtext.legacy.data.BPTTIterator(temp_ds, batch_size=BATCH_SIZE, device=DEVICE, 
                                                  bptt_len=BPTT_LENGTH, repeat=False)
        
    sen_loss = evaluate(best_model, sen_iter)
    print("perplexity: ", np.exp(sen_loss))
    print()


Early in the pandemic, there was hope that the world would one day achieve herd immunity, the point when the coronavirus lacks hosts to spread easily. But over a year later, the virus is crushing India with a fearsome second wave and surging in countries from Asia to Latin America.


ValueError: ignored

sen_iter

In [None]:
### Extensions

**Task 5** [25]: Extend your model to incorporate the following options: (i) substitute the LSTM with a GRU or a Simple RNN, (ii) increase the number of LSTM layers, (iii) add dropout, (iv) add gradient clipping.  Report on the combination of these options which gives the best performance.