 The following Python libraries are required for this part, and have been tested on Python 3.9 and Python 3.7.
 If you use Google Colab, PyTorch is already installed.
  - [PyTorch](https://pytorch.org/get-started/locally/) (tested with 1.10)

## Data

In [2]:
# You may prefer to upload the data to your google drive and mount your google drive to this colab, 
# because the data will be erased if you stop using this colab for a while.
# Uncomment the code below to do so. After mounting, navigate to the appropriate folder, right click, and "copy path".
# Assign DATA_DIR global variable to that path.
# For more mounting instructions: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=XDg9OBaYqRMd
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
# If imported from google drive, config for your file directory. Mine is 'lm_data'.
DATA_DIR = "/content/drive/MyDrive/nlp/a3/lm_data"

# the goal is that DATA_DIR points to where the training/validation/test data is. 

In [4]:
import os
from io import open
import torch
import math
import torch.nn as nn
import time
import numpy as np

In [5]:
SEED = 0
TRAIN_BATCH_SIZE = 100
TEST_BATCH_SIZE = 100
WORD_EMBED_DIM = 200
HID_EMBED_DIM = 200 
N_LAYERS = 2 
DROPOUT = 0.5 
LOG_INTERVAL = 100
EPOCHS = 10
BPTT = 50 # sequence length
CLIP = 0.25
TIED = False
SAVE_BEST = os.path.join(DATA_DIR, 'model.pt')

## Build vocabulary and convert text in corpus to lists of word index

In [6]:
class WordDict(object):
    def __init__(self):
        # mapping between word type to its index
        self.word2idx = {'<sos>': 0, '<eos>': 1}
        # mapping between index to word type
        self.idx2word = ['<sos>', '<eos>']

    def add_word(self, word):
        # TODO: add word to the dictionary by updating both word2idx and idx2word
        if word not in self.word2idx:
            self.word2idx[word] = len(self.word2idx)
            self.idx2word.append(word)

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.train_file = os.path.join(path, 'train.txt')
        self.valid_file = os.path.join(path, 'valid.txt')
        self.test_file = os.path.join(path, 'test.txt')

        self.dictionary = WordDict() 

        self.train = self.tokenize(self.train_file)
        self.valid = self.tokenize(self.valid_file)
        self.test = self.tokenize(self.test_file)
                                   
    def tokenize(self, filename):
        ################################
        ## TODO: 
        ## (1) build vocabulary on three given files, using class WordDict 
        ## (2) tokenize each file content with the vocabulary, return a list of token ids
        ## Note that in this implementation, we add words in validation and test file into the vocabulary,
        ## so there is no unknown word.
        ################################ 
        
        ids = []
        with open(filename) as f:
            for line in f: 
                if line.strip() != '': 
                    tokens = line.lower().split()
                    for token in tokens:
                        self.dictionary.add_word(token)
                    tokens.append('<eos>')
                    tokens.insert(0, '<sos>')
                    ids += [self.dictionary.word2idx[token] for token in tokens]

        return ids

corpus = Corpus(DATA_DIR)
print(len(corpus.train))
print(len(corpus.valid))
print(len(corpus.test))
print(len(corpus.dictionary))

2099444
218808
246993
28913


In [7]:
def batchify(ids, batch_size):
    """
    batchify arranges the dataset into columns.
    # Parameters
    data : Tensor
        1-dimensional tensor of token ids
    batch_size : Int
        batch_size
    # Returns
    data: a torch.LongTensor with shape of (batch_size, len(ids)//batch_size)
        batchified corpus data

    For example, the input ids [1,2,3,4,5,6,7,8,9] and batch_size=2
    output is:
    [ [1, 5],
      [2, 6],
      [3, 7],
      [4, 8] ]
    The shape of the tensor is 4x2. 
    We trim off any extra elements (9 in this example) that wouldn't cleanly fit.
    ***Again, note that the text order is in the column.***
    """ 
    ########
    # TODO #
    ########

    dimension = len(ids) // batch_size
    id_t = ids[:dimension * batch_size]
    numpy_ids = np.array(id_t).reshape(batch_size, dimension)
    
    return torch.permute(torch.Tensor(numpy_ids), (1, 0))    

train_data = batchify(corpus.train, TRAIN_BATCH_SIZE)
val_data = batchify(corpus.valid, TEST_BATCH_SIZE)
test_data = batchify(corpus.test, TEST_BATCH_SIZE)

print(train_data.shape)
print(val_data.shape)
print(test_data.shape)

torch.Size([20994, 100])
torch.Size([2188, 100])
torch.Size([2469, 100])


In [9]:
def get_batch(source, i):
    """
    # Parameters
    source : Tensor
        corpus as 2-dimensional tensor
    i : Int
        minibatch index

    # Returns
    data : 2D tensor 
        LSTM input
    target : 1D tensor
        LSTM output target

    Consider the following example where "source" is a 2d tensor of shape (4, 2).
    In this example we have 4 batches, each of size 2.
    [ [1, 5],
      [2, 6],
      [3, 7],
      [4, 8] ]

    Suppose we set BPTT (backpropagation through time, see A3 pdf for details) to 2.
    At index i = 0, the input to our LSTM becomes:
    [ [1, 5],
      [2, 6] ]
    This corresponds to the first 2 batches in the sequence.
    The target would correspondingly be (again, since BPTT is 2): 
    [ [2, 6],
      [3, 7] ]
    However, we need to reshape it from 2-dimensions to 1-dimension:
    [2, 6, 3, 7]

    For the next batch, index i = prev_i + BPTT = 2. 
    However, i + BPTT = 2 + 2 = 4 and 4 >= len(source). This wouldn't work.
    So, to account for this edge case, we consider BPTT to be:
    len(source) - 1 - i = 4 - 1 - 2 = 1
    As such, our input now becomes:
    [ [3, 7] ]
    and target is 
    [4, 8]. 
    """ 
    ###################################################
    # TODO Assign these variables to the right values #
    ###################################################

    seq_len = BPTT
    if i + BPTT >= len(source):
        seq_len = len(source) - 1 - i

    data = source[i : i + seq_len]
    target = source[i + 1 : i + seq_len + 1].reshape(-1)
    return data, target

data, targets = get_batch(train_data, 50)
print(data)
print(targets)

tensor([[3.8000e+01, 1.4000e+01, 6.2000e+01,  ..., 1.7113e+04, 2.0000e+00,
         5.4070e+03],
        [3.9000e+01, 3.5380e+03, 1.0000e+01,  ..., 2.6070e+03, 1.8169e+04,
         3.8000e+01],
        [4.0000e+01, 1.8620e+03, 1.8000e+01,  ..., 3.6000e+01, 1.2759e+04,
         1.4018e+04],
        ...,
        [3.8000e+01, 3.4900e+03, 1.8000e+01,  ..., 1.6000e+01, 5.5000e+01,
         6.2570e+03],
        [6.1000e+01, 2.7500e+02, 5.5460e+03,  ..., 4.7200e+02, 1.5633e+04,
         4.6910e+03],
        [1.8000e+01, 1.0000e+01, 6.2000e+01,  ..., 6.0000e+01, 8.5430e+03,
         1.7000e+01]])
tensor([  39., 3538.,   10.,  ..., 1938.,   38., 1654.])


In [10]:
################################
## TODO: Implement RNN LSTM
## documentation of pytorch LSTM interface: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
################################

class LSTMModel(nn.Module):

    def __init__(self, vocab_size, word_embedding_size, nhid, nlayers, dropout=0.5, tied_weights=False):
        super(LSTMModel, self).__init__()

        self.nhid = nhid # hidden dimension of LSTM
        self.nlayers = nlayers # number of LSTM layers
        # TODO: initialize the required modules for the LSTM model
        # HINT: batch_first should be False in LSTM since our data structure is not batch first.
        self.vocab_size = vocab_size
        self.word_embedding_size = word_embedding_size
        self.encoder = nn.Embedding(vocab_size, word_embedding_size)
        self.lstm = nn.LSTM(word_embedding_size, nhid, nlayers, batch_first=False)
        # self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(dropout)
        self.decoder = nn.Linear(nhid, vocab_size)

        self.init_weights()

    def init_weights(self):
        """
        For example:
        # initrange = 0.1
        # nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        This is not all that you need!
        """
        # TODO: initialize the parameters
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)
        nn.init.uniform_(self.decoder.bias, -initrange, initrange)

    def forward(self, input, hidden):
        """
        # Parameters
        input: input embedding
        hidden: hidden states in LSTM
        # Returns
        decoded: refers to the output of decoder layer over the vocabulary. Note that you don't need to pass it through the softmax layer
        hidden: stores the hidden states in LSTM
        """
        # TODO
        embeddings = self.encoder(input.long())
        embeddings = self.dropout(embeddings)

        lstm_out, hidden = self.lstm(embeddings, hidden)
        lstm_out = self.dropout(lstm_out)

        decoded = self.decoder(lstm_out.view(-1, self.word_embedding_size))

        return decoded, hidden

    # initialize parameters in LSTM
    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, bsz, self.nhid),
            weight.new_zeros(self.nlayers, bsz, self.nhid))

In [11]:
# Set the random seed for reproducibility.
torch.manual_seed(SEED)
# set device as GPU/CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [37]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

def train():
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(TRAIN_BATCH_SIZE)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, BPTT)):
        data, targets = get_batch(train_data, i)
        data = data.to(device)
        targets = targets.to(device)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        optimizer.zero_grad()
        hidden = repackage_hidden(hidden) # Note that the main advantage here is that the hidden value is continual from the previous forward pass
        output, hidden = model(data, hidden)
        loss = criterion(output, targets.to(torch.int64))
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
        optimizer.step()

        total_loss += loss.item()

        if batch % LOG_INTERVAL == 0 and batch > 0:
            print("LOG INTERVAL", LOG_INTERVAL)
            cur_loss = total_loss / LOG_INTERVAL
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // BPTT,
                elapsed * 1000 / LOG_INTERVAL, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

In [48]:
# TODO: Compute the loss of model on data_source
def evaluate(data_source):
    model.eval()
    total_loss = 0.
    # TODO: get the average negative log likelihood on the data_source
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(TEST_BATCH_SIZE)
    for batch, i in enumerate(range(0, data_source.size(0) - 1, BPTT)):
        data, targets = get_batch(data_source, i)
        data = data.to(device)
        targets = targets.to(device)
        model.zero_grad()
        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)

        loss = criterion(output, targets.to(torch.int64))

        total_loss += loss.item()

    return total_loss / (batch + 1)

In [49]:
# prepare the model, loss, and optimizer
ntokens = len(corpus.dictionary)
model = LSTMModel(ntokens, WORD_EMBED_DIM, HID_EMBED_DIM, N_LAYERS, DROPOUT, TIED).to(device)
criterion = nn.CrossEntropyLoss() # use crossentropy loss
optimizer = torch.optim.Adam(model.parameters()) # use adam optimizer with default setting
best_val_loss = None

# Training framework
for epoch in range(1, EPOCHS+1):
    epoch_start_time = time.time()
    train()
    val_loss = evaluate(val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
        'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
        val_loss, math.exp(val_loss)))
    print('-' * 89)
    
    # Save the model if the validation loss is the best we've seen so far.
    if not best_val_loss or val_loss < best_val_loss:
        with open(SAVE_BEST, 'wb') as f:
            torch.save(model, f)
            print("save new best model!")
        best_val_loss = val_loss

LOG INTERVAL 100
| epoch   1 |   100/  419 batches | ms/batch 57.05 | loss  7.62 | ppl  2028.70
LOG INTERVAL 100
| epoch   1 |   200/  419 batches | ms/batch 56.57 | loss  6.82 | ppl   914.18
LOG INTERVAL 100
| epoch   1 |   300/  419 batches | ms/batch 56.36 | loss  6.56 | ppl   705.75
LOG INTERVAL 100
| epoch   1 |   400/  419 batches | ms/batch 56.34 | loss  6.43 | ppl   619.84
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 24.53s | valid loss  6.06 | valid ppl   429.52
-----------------------------------------------------------------------------------------
save new best model!
LOG INTERVAL 100
| epoch   2 |   100/  419 batches | ms/batch 56.85 | loss  6.37 | ppl   586.80
LOG INTERVAL 100
| epoch   2 |   200/  419 batches | ms/batch 56.28 | loss  6.23 | ppl   506.15
LOG INTERVAL 100
| epoch   2 |   300/  419 batches | ms/batch 56.29 | loss  6.16 | ppl   474.77
LOG INTERVAL 100
| epoch   2 |   400/  419 batches | 

In [51]:
# Load the best saved model.
with open(SAVE_BEST, 'rb') as f:
    model = torch.load(f)
    # After loading the RNN params, they are not a continuous chunk of memory.
    # flatten_paramters() makes them a continuous chunk, and will speed up the forward pass.
    # Currently, only RNN model supports flatten_parameters function.
    model.lstm.flatten_parameters()

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  5.10 | test ppl   163.91


In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

In [5]:
# Generation with GPT-2
# Check this tutorial: https://huggingface.co/blog/how-to-generate
# It comes with a notebook. You need to run through that notebook and understand different sampling procedures.
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

### Greedy Search

In [15]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I went to', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to the hospital and was told that I was going to be taken to the hospital. I was told that I was going to be taken to the hospital and that I was going to be taken to the hospital. I was told that I was


### Beam Search

In [16]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to the doctor and said, 'I don't know what's going on. I don't know what's going on. I don't know what's going on. I don't know what's going on. I don't know what


In [17]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to the doctor and said, 'I don't know what's wrong with you, but I'm going to take care of you.'"

The doctor said he had no idea what was wrong. "I didn't want to do anything


### Sampling

In [18]:
# RANDOM W/O TEMPERATURE
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to Spain to learn Spanish.

How did you know Kristina was taking a class on birth control?

We sat down. Ship parents talked about history and child development. We were set up to be indicative of people who didn


In [19]:
# WITH TEMPERATURE
# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to work, and one of my friends came into the office with a very wide towel. She said, "You've got to go for your baby." I said, "No." She came out and gave me the towel. Then she


In [20]:
# TOP-K SAMPLING
# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to college at New York University, and I spent my early-50s living in L.A. for two years to try to find work. I did some internships, and then I got married. I remember doing all of them


In [21]:
# TOP-P
# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I went to a colleague at Stanford who was critical of our pace-fixing process. He was very vague about the technology. We are not cracking it yet," said Graham Watters, a Stanford Admissions professor who now has a Ph.D


In [52]:
def generate_text(prompt, sampling_func):
    # # Generation with LSTM lm given a sampling function and a prompt
    max_length = 30
    ids = []
    for word in prompt.split():
        ids.append(corpus.dictionary.word2idx[word])
    hidden = model.init_hidden(1)
    with torch.no_grad():  # no tracking history
        output, hidden = model(torch.LongTensor([[wid] for wid in ids]).to(device), hidden)
        word_prob = torch.nn.functional.softmax(output[-1,:], dim=0).cpu()
        generations = []
        for i in range(max_length):
            word_idx = sampling_func(word_prob)
            word = corpus.dictionary.idx2word[word_idx]
            generations.append(word)
            if word == "<eos>":
                break
            new_word = torch.LongTensor([[word_idx]]).to(device)
            output, hidden = model(new_word, hidden)
            word_prob = torch.nn.functional.softmax(output[-1,:], dim=0).cpu()
    return generations

In [53]:
import random
def greedy_sampling(word_prob):
    # TODO: return the word with the max probability
    word_id = torch.argmax(word_prob)
    return word_id

def random_sampling(word_prob):
    # TODO: sample a random word based on the probability vector
    word_id = random.randint(0, len(word_prob) - 1)
    return word_id

def topk_sampling(word_prob):
    # TODO: top k sampling as explained in the assignment
    indices = torch.sort(word_prob, descending = True)[:50][1]
    index = random.randint(0, k - 1)
    return indices[index]

In [54]:
prompt = "i went to".lower()
generations = generate_text(prompt, greedy_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: i went to
the <unk> " . <eos>


In [55]:
generations = generate_text(prompt, random_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: i went to
chagas exhaled transfers citizens palm pune conspirators bhai facial charity emigrated recruit imprison munro dispatched jeremi dubious writ träumerei rested antonescu northampton collision database stars shoemaker same exposes memoir boy


In [60]:
k = 10
generations = generate_text(prompt, topk_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: i went to
each the most story he made what at example he makes many back through a group in your . [ the female have an only and two characters like him


In [59]:
k = 20
generations = generate_text(prompt, topk_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: i went to
each years before they get off what we 'm actually get , because i am ' we may come against as for only being well , an few ways were


In [58]:
k = 50
generations = generate_text(prompt, topk_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: i went to
that another of 20 knots may and get back all . although even in may 28 august 1940 after much the previous episode has only know by an least high


In [61]:
prompt = 'today at school'
generations = generate_text(prompt, random_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: today at school
bed republican 1967 exacerbated outlaw cheke der harper trees genetics guthrie lawlessness nandi profitable ace curse santa gunshot inflation stylus fitzwarin amenable chests whoever sags pretend regionally exploration shows electronica


In [62]:
prompt = 'it seems like'
generations = generate_text(prompt, random_sampling) # replace sample_func with the sampling function that you would like to try
print('prompt: ' + prompt)
print(' '.join(generations))

prompt: it seems like
longitudinal curd delegates broods poke preparatory icp gabonica katzenjammer token indoctrination composition 28th lookouts bats ignorant rush variously doo tragic tripoli rajamouli forest merengue real airlines sensible sabo gather slovenes
