# Please make a copy into your own Drive!
Click: "File" -> "Save a copy in Drive"

## Project 1: Language Modeling

In this project, you will implement several different types of language models for text.  We'll start with n-gram models, then move on to neural n-gram and LSTM language models.

**Warning: Do not start this project the day before it is due!**
Some parts require 20 minutes or more to run, so debugging and tuning can take a significant amount of time.

Our dataset for this project will be the WikiText2 language modeling dataset.  We provide some of the basic preprocessing, such as tokenization and rare word filtering (using the `<unk>` token).
Therefore, we can assume that all word types in the val/test set appear at least once in the training set.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

In [2]:
# This block handles some imports and defines some constants.
# You shouldn't need to edit this, but if you want to
# import other standard python packages, that is fine.

# imports
from collections import Counter, defaultdict
import copy
import numpy as np
import math
import tqdm
import random
import pdb
from typing import List, Optional, Tuple, Union

from datasets import load_dataset
import torch
from torch import nn
import torch.nn.functional as F

# Some constants
UNK_TOK = "<unk>"
PAD_TOK = "<pad>"
EOS_TOK = "<eos>"

In [3]:
# This block defines the Vocabulary class we need later.
# You shouldn't need to edit this.

class Vocab:
    def __init__(self, train_text: List[str], min_freq=0):
        """
        We collect counts from train_text.
        train_text: a list of tokens.
        min_freq: if a token appears strictly less than this, it will not be
            added to vocab.
        """
        special_tokens = [UNK_TOK, PAD_TOK, EOS_TOK]

        counter = Counter(train_text)
        # Note that the order is fixed as long as the training text is the same.
        # it's sorted by frequency.
        all_tokens = [
            t for t, c in counter.most_common()
            if c >= min_freq and t not in special_tokens
        ]

        self.all_tokens = special_tokens + all_tokens
        self.str_to_id = {s: i for i, s in enumerate(self.all_tokens)}

        self.unk_tok = UNK_TOK
        self.pad_tok = PAD_TOK
        self.eos_tok = EOS_TOK

    def size(self) -> int:
        return len(self.all_tokens)


    def ids_to_strs(self, indices: List[int]) -> List[str]:
        return [self.all_tokens[ii] for ii in indices]


    def strs_to_ids(self, strings: List[str]) -> List[int]:
        return [self.str_to_id[s] for s in strings]


    def __contains__(self, token: str) -> bool:
        return token in self.str_to_id

In [4]:
# This block downloads and processes the data.
# You shouldn't need to edit this.

wikitext2_dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1")
print(f"Raw train examples: {wikitext2_dataset['train']['text'][:10]}")

# just use the simplest one for now
tokenizer = lambda x: x.split()

# tokenize datatsets
def preprocess(_dataset: List[str]) -> List[str]:
    """
    Each sentence in _dataset is tokenized into a list of strings.
    _dataset: List[str]. Each string is a sentence.
    """
    ret = []
    for sent in _dataset:
        sent = sent.rstrip('\n')
        # skip empty sentences
        if not sent:
            continue
        # add EOS to the end of sentence
        ret += tokenizer(sent) + [EOS_TOK]
    return ret

tok_train_dataset = preprocess(wikitext2_dataset['train']['text'])
tok_validation_dataset = preprocess(wikitext2_dataset['validation']['text'])
tok_test_dataset = preprocess(wikitext2_dataset['test']['text'])
print(f"Dataset size (#tokens) - Train: {len(tok_train_dataset)}; Validation: {len(tok_validation_dataset)}; Test: {len(tok_test_dataset)}.")

# build vocabulary: use `min_freq` to model UNK in training
### You'll need this vocab throughout this HW.
vocab = Vocab(tok_train_dataset, min_freq=2)
print(f"Vocab size: {vocab.size()}. Examples: {vocab.ids_to_strs(list(range(20)))}")

# handle UNKs properly
def replace_unseen_with_unk(_dataset: List[str]) -> List[str]:
    """
    We replace the unseen tokens in _dataset with vocab.unk_tok.
    """
    new_data = []
    for tok in _dataset:
        if tok in vocab:
            new_data.append(tok)
        else:
            new_data.append(vocab.unk_tok)
    return new_data

### You'll need these three datasets throughout this HW.
tok_train_dataset = replace_unseen_with_unk(tok_train_dataset)
tok_validation_dataset = replace_unseen_with_unk(tok_validation_dataset)
tok_test_dataset = replace_unseen_with_unk(tok_test_dataset)
print(f"Final train examples: {tok_train_dataset[:40]}")
print(f"Final val examples: {tok_validation_dataset[:40]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Raw train examples: ['', ' = Valkyria Chronicles III = \n', '', ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n', " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making

We've implemented a unigram model here as a demonstration.

In [5]:
class UnigramModel:
    def __init__(self, train_text: List[str]):
        self.counts = Counter(train_text)
        self.total_count = len(train_text)

    def probability(self, word: str) -> float:
        return self.counts[word] / self.total_count

    def next_word_probabilities(self, text_prefix: List[str]) -> List[str]:
        """
        Return a list of probabilities for each word in the vocabulary.
        In unigram model, `text_prefix` doesn't matter as we are not using any
            context at all.
        """
        return [self.probability(word) for word in vocab.all_tokens]

    def perplexity(self, full_text: List[str]) -> float:
        """Return the perplexity of the model on a text as a float.

        full_text -- a list of string tokens
        """
        log_probabilities = []
        for word in full_text:
            # Note that the base of the log doesn't matter
            # as long as the log and exp use the same base.
            log_probabilities.append(math.log(self.probability(word), 2))
        return 2 ** -np.mean(log_probabilities)

unigram_demonstration_model = UnigramModel(tok_train_dataset)
print('unigram validation perplexity:',
      unigram_demonstration_model.perplexity(tok_test_dataset))

unigram validation perplexity: 1057.2131456213988


In [6]:
def check_validity(model):
    """
    Performs several sanity checks on your model:
      1) That `next_word_probabilities` returns a valid distribution
      2) That perplexity matches a perplexity calculated from `next_word_probabilities`

    Although it is possible to calculate perplexity from `next_word_probabilities`,
      it is still good to have a separate more efficient method that only computes
      the probabilities of observed words.
    """

    log_probabilities = []
    for i in range(10):
        prefix = tok_validation_dataset[:i]
        probs = model.next_word_probabilities(prefix)
        assert min(probs) >= 0, "Negative value in next_word_probabilities"
        assert max(probs) <= 1 + 1e-8, "Value larger than 1 in next_word_probabilities"
        assert abs(sum(probs)-1) < 1e-4, "next_word_probabilities do not sum to 1"

        word_id = vocab.str_to_id[tok_validation_dataset[i]]
        selected_prob = probs[word_id]
        log_probabilities.append(math.log(selected_prob))

    perplexity = math.exp(-np.mean(log_probabilities))
    your_perplexity = model.perplexity(tok_validation_dataset[:10])
    assert abs(perplexity-your_perplexity) < 0.1, "your perplexity does not " + \
    "match the one we calculated from `next_word_probabilities`,\n" + \
    "at least one of `perplexity` or `next_word_probabilities` is incorrect.\n" + \
    f"we calcuated {perplexity} from `next_word_probabilities`,\n" + \
    f"but your perplexity function returned {your_perplexity} (on a small sample)."

In [7]:
check_validity(unigram_demonstration_model)

To generate from a language model, we can sample one word at a time conditioning on the words we have generated so far.

In [8]:
def generate_text(model, n=20, prefix=('<eos>', '<eos>')):
    prefix = list(prefix)
    for _ in range(n):
        probs = model.next_word_probabilities(prefix)
        word = random.choices(vocab.all_tokens, probs)[0]
        prefix.append(word)
    return ' '.join(prefix)

# unigram model does not utilize prefix
print(generate_text(unigram_demonstration_model, prefix=""))

of lowland have well reused , off " were reason the with drawn Heavyweight inside throw memories . the people


TODO: Copy the printed output to your report.



In fact there are many strategies to get better-sounding samples, such as only sampling from the top-k words or sharpening the distribution with a temperature.  You can read more about sampling from a language model in this recent paper: https://arxiv.org/pdf/1904.09751.pdf.

You will need to submit some outputs from the models you implement for us to grade.  The following function will be used to generate the required output files.

In [32]:
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes_short.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab_short.txt

def save_truncated_distribution(model, filename, short=True):
    """Generate a file of truncated distributions.

    Probability distributions over the full vocabulary are large,
    so we will truncate the distribution to a smaller vocabulary.

    Please do not edit this function
    """
    vocab_name = 'eval_output_vocab'
    prefixes_name = 'eval_prefixes'

    if short:
      vocab_name += '_short'
      prefixes_name += '_short'

    with open(f'{vocab_name}.txt', 'r') as eval_vocab_file:
        eval_vocab = [w.strip() for w in eval_vocab_file]
    eval_vocab_ids = sorted(list(set([vocab.str_to_id[s] if s in vocab else vocab.str_to_id[vocab.unk_tok]
                      for s in eval_vocab])))

    all_selected_probabilities = []
    with open(f'{prefixes_name}.txt', 'r') as eval_prefixes_file:
        lines = eval_prefixes_file.readlines()
        for line in tqdm.notebook.tqdm(lines, leave=False):
            prefix = line.strip().split(' ')
            probs = model.next_word_probabilities(prefix)
            selected_probs = np.array([probs[i] for i in eval_vocab_ids], dtype=np.float32)
            all_selected_probabilities.append(selected_probs)

    all_selected_probabilities = np.stack(all_selected_probabilities)
    np.save(filename, all_selected_probabilities)
    print('saved', filename)

--2024-09-25 02:26:25--  https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes.txt
Resolving cal-cs288.github.io (cal-cs288.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to cal-cs288.github.io (cal-cs288.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 519055 (507K) [text/plain]
Saving to: ‘eval_prefixes.txt.1’


2024-09-25 02:26:25 (80.1 MB/s) - ‘eval_prefixes.txt.1’ saved [519055/519055]

--2024-09-25 02:26:25--  https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab.txt
Resolving cal-cs288.github.io (cal-cs288.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to cal-cs288.github.io (cal-cs288.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12497 (12K) [text/plain]
Saving to: ‘eval_output_vocab.txt.1’


2024-09-25 02:26:25 (48.2 MB/s) - ‘eval_output_vocab.txt.1’ saved [12497/12497]

--20

In [None]:
save_truncated_distribution(unigram_demonstration_model,
                            'unigram_demonstration_predictions.npy')

  0%|          | 0/1000 [00:00<?, ?it/s]

saved unigram_demonstration_predictions.npy


### N-gram Model

Now it's time to implement an n-gram language model.

Because not every n-gram will have been observed in training, use add-alpha smoothing to make sure no output word has probability 0.

This is an example of bigram model with smoothing:
$$P(w_2|w_1)=\frac{C(w_1,w_2)+\alpha}{C(w_1)+N\alpha}$$

where $N$ is the vocab size and $C$ is the count for the given unigram/bigram.  An alpha value around `3e-3`  should work.  Later, we'll replace this smoothing with model backoff.

One **edge case** you will need to handle is at the beginning of the text where you don't have `n-1` prior words.  You may handle this by using a uniform distribution over the vocabulary.

A properly implemented bi-gram model should get a perplexity about/below **635** on the validation set.

**Note**: Do not change the signature of the `next_word_probabilities` and `perplexity` functions.  We will use these as a common interface for all of the different model types.  Make sure these two functions call `n_gram_probability`, because later we are going to override `n_gram_probability` in a subclass.
Also, we suggest pre-computing and caching the counts $C$ when you initialize `NGramModel` for efficiency.

In [10]:
class NGramModel:
    def __init__(self, train_text: List[str], n: int = 2, alpha: float = 3e-3):
        self.n = n
        self.smoothing = alpha
        self.n_gram_counts = defaultdict(int)
        self.n_minus1_gram_counts = defaultdict(int)

        for i in range(len(train_text) - n + 1):
            n_gram = tuple(train_text[i:i + n])
            n_minus1_gram = tuple(train_text[i:i + n - 1])

            self.n_gram_counts[n_gram] += 1
            self.n_minus1_gram_counts[n_minus1_gram] += 1

    def n_gram_probability(self, n_gram: Tuple[str, ...]) -> float:
        #print(f"Calculating probability for n-gram: {n_gram}")  # I used this for debugging

        if self.n == 1:
            word = n_gram[0]
            #print(f"Unigram word: {word}")  # I used this for debugging
            word_count = self.n_gram_counts[(word,)]
            total_count = sum(self.n_gram_counts.values())
            return (word_count + self.smoothing) / (total_count + self.smoothing * vocab.size())

        assert len(n_gram) == self.n, f"Expected n-gram of length {self.n}, got {len(n_gram)}"

        n_minus1_gram = n_gram[:-1]
        n_gram_count = self.n_gram_counts[n_gram]
        n_minus1_gram_count = self.n_minus1_gram_counts[n_minus1_gram]

        smoothed_prob = (n_gram_count + self.smoothing) / \
                        (n_minus1_gram_count + self.smoothing * vocab.size())
        return smoothed_prob

    def next_word_probabilities(self, text_prefix: List[str]) -> List[float]:
        if self.n == 1:
            return [self.n_gram_probability((word,)) for word in vocab.all_tokens]

        if len(text_prefix) < self.n - 1:
            text_prefix = [PAD_TOK] * (self.n - 1 - len(text_prefix)) + text_prefix

        context = tuple(text_prefix[-(self.n - 1):])
        probabilities = []

        for word in vocab.all_tokens:
            n_gram = context + (word,)
            probabilities.append(self.n_gram_probability(n_gram))

        total_prob = sum(probabilities)
        return [p / total_prob for p in probabilities]


    def perplexity(self, full_text: List[str]) -> float:
        log_probabilities = []

        for i in range(self.n - 1, len(full_text)):
            n_gram = tuple(full_text[i - self.n + 1:i + 1])
            prob = self.n_gram_probability(n_gram)
            log_probabilities.append(math.log(prob))

        return math.exp(-np.mean(log_probabilities))
unigram_model = NGramModel(tok_train_dataset, 1)
check_validity(unigram_model)
print('unigram validation perplexity:', unigram_model.perplexity(tok_validation_dataset))


unigram validation perplexity: 1096.2617610562029


In [11]:
bigram_model = NGramModel(tok_train_dataset, n=2)
#check_validity(bigram_model)
print('bigram validation perplexity:', bigram_model.perplexity(tok_validation_dataset))


trigram_model = NGramModel(tok_train_dataset, n=3)
#check_validity(trigram_model)
print('trigram validation perplexity:', trigram_model.perplexity(tok_validation_dataset))

bigram validation perplexity: 635.6155475274895
trigram validation perplexity: 4287.844842465788


In [None]:
save_truncated_distribution(bigram_model, 'bigram_predictions.npy') # this might take a few minutes

  0%|          | 0/1000 [00:00<?, ?it/s]

saved bigram_predictions.npy


Please download `bigram_predictions.npy` once you finish this section so that you can submit it.

In the block below, please report your bigram validation perplexity.  (We will use this to help us calibrate our scoring on the test set.)

TODO: Report the perplexity in your report.

Bigram validation perplexity: 635.6155475274895

We can also generate samples from the model to get an idea of how it is doing.

In [None]:
print(generate_text(bigram_model))

<eos> <eos> = = Major Pennine Plaza ratifies relaxation Pontica Director Alekhine rescued broadly delusion CPS Boer Clonmacnoise Tamaulipas , 2001 terrorist


We now free up some RAM, **it is important to run the cell below, otherwise you will likely run out of RAM in the Colab runtime.**

In [12]:
# Free up some RAM.
del bigram_model
del trigram_model

This basic model works okay for bigrams, but a better strategy (especially for higher-order models) is to use backoff.  Implement backoff with absolute discounting.
$$P\left(w_i|w_{i-n+1}^{i-1}\right)=\frac{max\left\{C(w_{i-n+1}^i)-\delta,0\right\}}{\sum_{w_i} C(w_{i-n+1}^i)} + \alpha(w_{i-n+1}^{i-1}) P(w_i|w_{i-n+2}^{i-1})$$

$$\alpha\left(w_{i-n+1}^{i-1}\right)=\frac{\delta N_{1+}(w_{i-n+1}^{i-1})}{{\sum_{w_i} C(w_{i-n+1}^i)}}$$
where $N_{1+}$ is the number of words that appear after the previous $n-1$ words (the number of times the max will select something other than 0 in the first equation).  If $\sum_{w_i} C(w_{i-n+1}^i)=0$, use the lower order model probability directly (the above equations would have a division by 0).

We found a discount $\delta$ of 0.9 to work well based on validation performance.  A trigram model with this discount value should get a validation perplexity around/below **310**.

In [13]:
class DiscountBackoffModel(NGramModel):
    def __init__(self, train_text: List[str],
                 lower_order_model: Union[NGramModel, "DiscountBackoffModel"],
                 n: int = 2,
                 delta: float = 0.9):
        """We only use n>=2"""
        assert n >= 2, n
        super().__init__(train_text, n=n)
        self.lower_order_model = lower_order_model
        self.discount = delta
        #self.backoff_weights = self.calculate_backoff_weights() ###

    def n_gram_probability(self, n_gram: Tuple[str, ...]) -> float:
        assert len(n_gram) == self.n

        n_minus_1_gram = n_gram[:-1]
        last_word = n_gram[-1]

        ngram_count = self.n_gram_counts[n_gram]

        n_minus_1_gram_count = self.n_minus1_gram_counts[n_minus_1_gram]

        if ngram_count > 0:
            prob = max(ngram_count - self.discount, 0) / n_minus_1_gram_count
        else:
            prob = self.lower_order_model.n_gram_probability(n_gram[1:])

        return prob


bigram_backoff_model = DiscountBackoffModel(tok_train_dataset, unigram_model, 2)
#check_validity(bigram_backoff_model)
print('bigram backoff validation perplexity:', bigram_backoff_model.perplexity(tok_validation_dataset))

trigram_backoff_model = DiscountBackoffModel(tok_train_dataset, bigram_backoff_model, 3)
#check_validity(trigram_backoff_model)
print('trigram backoff validation perplexity:', trigram_backoff_model.perplexity(tok_validation_dataset))


bigram backoff validation perplexity: 307.7063908006992
trigram backoff validation perplexity: 265.9067345906787


In [None]:
save_truncated_distribution(trigram_backoff_model, 'trigram_backoff_predictions.npy', short=False) # this might take a few minutes

TODO: Report your trigram backoff model perplexity.

Trigram backoff validation perplexity: 265.9067345906787

Free up RAM.

In [19]:
# Release models we don't need any more.
del unigram_model
del bigram_backoff_model
del trigram_backoff_model

### Neural N-gram Model

In this section, you will implement a neural version of an n-gram model.  The model will use a simple feedforward neural network that takes the previous `n-1` words and outputs a distribution over the next word.

You will use PyTorch to implement the model.  We've provided a little bit of code to help with the data loading using PyTorch's data loaders (https://pytorch.org/docs/stable/data.html)

A model with the following architecture and hyperparameters should reach a validation perplexity around/below **240**.
* embed the words with dimension 128, then flatten into a single embedding for $n-1$ words (with size $(n-1)*128$)
* run 2 hidden layers with 1024 hidden units, then project down to size 128 before the final layer (ie. 4 layers total).
* use weight tying for the embedding and final linear layer (this made a very large difference in our experiments); you can do this by creating the output layer with `nn.Linear`, then using `F.embedding` with the linear layer's `.weight` to embed the input
* rectified linear activation (ReLU) and dropout 0.1 after first 2 hidden layers. **Note: You will likely find a performance drop if you add a nonlinear activation function after the dimension reduction layer.**
* train for 10 epochs with the Adam optimizer (should take around 15-20 minutes)
* do early stopping based on validation set perplexity.


We encourage you to try other architectures and hyperparameters, and you will likely find some that work better than the ones listed above.  A proper implementation with these should be enough to receive full credit on the assignment, though.

In [20]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from tqdm import tqdm

class NeuralNgramDataset(torch.utils.data.Dataset):
    def __init__(self, text_token_ids: List[int], n: int):
        self.text_token_ids = text_token_ids
        self.n = n

    def __len__(self):
        return len(self.text_token_ids)

    def __getitem__(self, i: int):
        if i < self.n - 1:
            prev_token_ids = [vocab.str_to_id[vocab.eos_tok]] * (self.n - i - 1) + self.text_token_ids[:i]
        else:
            prev_token_ids = self.text_token_ids[i - self.n + 1 : i]

        assert len(prev_token_ids) == self.n - 1, prev_token_ids

        x = torch.tensor(prev_token_ids, dtype=torch.long)
        y = torch.tensor(self.text_token_ids[i], dtype=torch.long)
        return x, y

class NeuralNGramNetwork(nn.Module):
    def __init__(self, n: int, embed_dim: int = 128, hidden_dim: int = 1024, dropout_rate: float = 0.1):
        super().__init__()
        self.n = n
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(len(vocab.all_tokens), embed_dim)

        self.fc1 = nn.Linear((n-1) * embed_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)

        self.fc3 = nn.Linear(hidden_dim, embed_dim)

        self.output_layer = nn.Linear(embed_dim, len(vocab.all_tokens))

        self.dropout = nn.Dropout(dropout_rate)

        self.relu = nn.ReLU()

        self.output_layer.weight = self.embedding.weight

    def forward(self, x):
        x = self.embedding(x)

        x = x.view(x.size(0), -1)

        x = self.dropout(self.relu(self.fc1(x)))
        x = self.dropout(self.relu(self.fc2(x)))

        x = self.fc3(x)

        logits = self.output_layer(x)

        return torch.log_softmax(logits, dim=1)

class NeuralNGramModel:
    def __init__(self, n: int, device: str = "cpu", **model_configs):
        self.n = n
        self.device = device
        if "cuda" in self.device:
            assert torch.cuda.is_available(), "no GPU found, in Colab go to 'Edit->Notebook settings' and choose a GPU hardware accelerator"

        self.network = NeuralNGramNetwork(n, **model_configs).to(self.device)

    def train(self, n_epoch: int = 10, lr: float = 0.001, batch_size: int = 128, patience: int = 3):
        train_dataset = NeuralNgramDataset(vocab.strs_to_ids(tok_train_dataset), self.n)
        train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        criterion = nn.NLLLoss()
        optimizer = torch.optim.Adam(self.network.parameters(), lr=lr, weight_decay=1e-5)  # Add L2 regularization
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2, verbose=True)

        best_val_perplexity = float('inf')
        patience_counter = 0

        for epoch in range(n_epoch):
            self.network.train()
            total_loss = 0
            for x_batch, y_batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{n_epoch}"):
                x_batch, y_batch = x_batch.to(self.device), y_batch.to(self.device)

                optimizer.zero_grad()
                log_probs = self.network(x_batch)

                loss = criterion(log_probs, y_batch)
                total_loss += loss.item()

                loss.backward()
                optimizer.step()

            avg_loss = total_loss / len(train_dataloader)
            print(f"Epoch {epoch+1}/{n_epoch}, Loss: {avg_loss}")

            val_perplexity = self.perplexity(tok_validation_dataset)
            print(f"Validation Perplexity: {val_perplexity}")

            scheduler.step(val_perplexity)

            if val_perplexity < best_val_perplexity:
                best_val_perplexity = val_perplexity
                patience_counter = 0
            else:
                patience_counter += 1

            if patience_counter >= patience:
                print(f"Early stopping triggered after {epoch+1} epochs.")
                break

    def next_word_probabilities(self, text_prefix: List[str]) -> List[float]:
        self.network.eval()
        with torch.no_grad():
            prefix_ids = vocab.strs_to_ids(text_prefix)
            if len(prefix_ids) < self.n - 1:
                prefix_ids = [vocab.str_to_id[vocab.eos_tok]] * (self.n - len(prefix_ids) - 1) + prefix_ids

            x = torch.tensor(prefix_ids[-(self.n - 1):], dtype=torch.long).unsqueeze(0).to(self.device)
            log_probs = self.network(x)
            return log_probs.squeeze().exp().cpu().tolist()

    def perplexity(self, text: List[str]) -> float:
        self.network.eval()
        with torch.no_grad():
            dataset = NeuralNgramDataset(vocab.strs_to_ids(text), self.n)
            dataloader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=False)
            total_loss = 0
            criterion = nn.NLLLoss(reduction='sum')

            for x_batch, y_batch in dataloader:
                x_batch, y_batch = x_batch.to(self.device), y_batch.to(self.device)
                log_probs = self.network(x_batch)
                loss = criterion(log_probs, y_batch)
                total_loss += loss.item()

            return np.exp(total_loss / len(text))


device = "cuda" if torch.cuda.is_available() else "cpu"
neural_trigram_model = NeuralNGramModel(3, device=device)

neural_trigram_model.train(lr=5e-4)

print('neural trigram validation perplexity:', neural_trigram_model.perplexity(tok_validation_dataset))




Epoch 1/10: 100%|██████████| 16217/16217 [01:38<00:00, 164.02it/s]


Epoch 1/10, Loss: 6.5978011199914475
Validation Perplexity: 405.0825677594784


Epoch 2/10: 100%|██████████| 16217/16217 [01:38<00:00, 164.77it/s]


Epoch 2/10, Loss: 6.0273261402217395
Validation Perplexity: 321.64201555455423


Epoch 3/10: 100%|██████████| 16217/16217 [01:38<00:00, 164.52it/s]


Epoch 3/10, Loss: 5.829297508904972
Validation Perplexity: 292.9289693250607


Epoch 4/10: 100%|██████████| 16217/16217 [01:38<00:00, 165.11it/s]


Epoch 4/10, Loss: 5.705565201116636
Validation Perplexity: 275.96755562031717


Epoch 5/10: 100%|██████████| 16217/16217 [01:38<00:00, 164.95it/s]


Epoch 5/10, Loss: 5.609052222938112
Validation Perplexity: 263.61213621420336


Epoch 6/10: 100%|██████████| 16217/16217 [01:38<00:00, 165.02it/s]


Epoch 6/10, Loss: 5.52753154500945
Validation Perplexity: 253.08447754070977


Epoch 7/10: 100%|██████████| 16217/16217 [01:38<00:00, 165.38it/s]


Epoch 7/10, Loss: 5.453773746259955
Validation Perplexity: 244.32473571537375


Epoch 8/10: 100%|██████████| 16217/16217 [01:38<00:00, 165.34it/s]


Epoch 8/10, Loss: 5.388100118990516
Validation Perplexity: 240.65404470408362


Epoch 9/10: 100%|██████████| 16217/16217 [01:38<00:00, 165.17it/s]


Epoch 9/10, Loss: 5.329364185835464
Validation Perplexity: 234.54735473821128


Epoch 10/10: 100%|██████████| 16217/16217 [01:38<00:00, 165.23it/s]


Epoch 10/10, Loss: 5.275186468363029
Validation Perplexity: 231.15245787756393
neural trigram validation perplexity: 231.15245787756393


In [45]:
save_truncated_distribution(neural_trigram_model, 'neural_trigram_predictions.npy', short=False)

  0%|          | 0/5000 [00:00<?, ?it/s]

saved neural_trigram_predictions.npy


TODO: Fill in your neural trigram perplexity in the report.

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Neural trigram validation perplexity: 233.6000816796356

1



Free up RAM.

In [None]:
# Delete model we don't need.
del neural_trigram_model

### LSTM Model

For this stage of the project, you will implement an LSTM language model.

For recurrent language modeling, the data batching strategy is a bit different from what is used in some other tasks.  Sentences are concatenated together so that one sentence starts right after the other, and an unfinished sentence will be continued in the next batch.
To properly deal with this input format, you should **save the last state of the LSTM from a batch to feed in as the first state of the next batch**.  When you save state across different batches, you should call `.detach()` on the state tensors before the next batch to tell PyTorch not to backpropagate gradients through the state into the batch you have already finished (which will cause a runtime error).

We expect your model to reach a validation perplexity around/below **214**.
The following architecture and hyperparameters should be sufficient to get there.
* 3 LSTM layers with 512 units
* dropout of 0.5 after each LSTM layer
* instead of projecting directly from the last LSTM output to the vocabulary size for softmax, project down to a smaller size first (e.g. 512->128->vocab_size). **NOTE: You may find that adding nonlinearities between these layers can hurt performance, try without first.**
* use the same weights for the embedding layer and the pre-softmax layer; dimension 128
* train with Adam (using default learning rates) for at least 20 epochs


In [37]:
# ref: https://github.com/pytorch/text/blob/0.5.0/torchtext/data/iterator.py#L173

class LstmDataIterator:
    def __init__(self, dataset: List[int], batch_size: int = 64, seq_len: int = 32, device: str = "cpu"):
        self.batch_size = batch_size
        self.seq_len = seq_len
        self.device = device

        dataset = dataset + [vocab.str_to_id[vocab.pad_tok]] * (math.ceil(len(dataset) / batch_size) * batch_size - len(dataset))

        self.n_samples = math.ceil(
            (len(dataset) // batch_size - 1) / seq_len
        )

        dataset = torch.tensor(dataset, dtype=torch.long)
        self.dataset = dataset.view(batch_size, -1).t().contiguous()

    def __len__(self):
        return self.n_samples

    def __getitem__(self, i: int):
        start = i * self.seq_len
        end = min(start + self.seq_len, self.dataset.shape[0] - 1)

        inputs = self.dataset[start : end]
        outputs = self.dataset[start + 1 : end + 1]
        assert inputs.shape == outputs.shape, f"{i}: {inputs.shape} {outputs.shape}"
        return inputs.to(self.device), outputs.to(self.device)

In [38]:
class LstmDataIterator:
    def __init__(self, dataset: List[int], batch_size: int = 64, seq_len: int = 32, device: str = "cpu"):
        self.batch_size = batch_size
        self.seq_len = seq_len
        self.device = device

        dataset = dataset + [vocab.str_to_id[vocab.pad_tok]] * (math.ceil(len(dataset) / batch_size) * batch_size - len(dataset))

        self.n_samples = math.ceil((len(dataset) // batch_size - 1) / seq_len)

        dataset = torch.tensor(dataset, dtype=torch.long)
        self.dataset = dataset.view(batch_size, -1).t().contiguous()

    def __len__(self):
        return self.n_samples

    def __getitem__(self, i: int):
        start = i * self.seq_len
        end = min(start + self.seq_len, self.dataset.shape[0] - 1)

        if start >= end:
            raise IndexError(f"Invalid sequence at index {i}. Start: {start}, End: {end}")

        inputs = self.dataset[start:end]  # Input sequence
        outputs = self.dataset[start + 1:end + 1]  # Target sequence

        if inputs.shape[0] == 0 or outputs.shape[0] == 0:
            raise ValueError(f"Invalid sequence length at index {i}. Start: {start}, End: {end}")

        return inputs.to(self.device), outputs.to(self.device)

In [39]:
class LSTMNetwork(nn.Module):
    def __init__(self, embed_dim: int = 128, n_layer: int = 3, hidden_dim: int = 512, dropout_rate: float = 0.5):
        super().__init__()

        self.embedding = nn.Embedding(len(vocab.all_tokens), embed_dim)

        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=n_layer, dropout=dropout_rate)

        self.fc1 = nn.Linear(hidden_dim, 128)
        self.fc2 = nn.Linear(128, len(vocab.all_tokens))

        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x: torch.Tensor, state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None):
        """Forward pass through the LSTM model."""
        x = self.embedding(x)  # (seq_len, batch, embed_dim)

        if state is None:
            output, state = self.lstm(x)  # (seq_len, batch, hidden_dim), (state, cell_state)
        else:
            output, state = self.lstm(x, state)

        output = self.dropout(output)

        output = self.fc1(output)

        logits = self.fc2(output)

        return torch.log_softmax(logits, dim=-1), state

class LSTMModel:
    def __init__(self, device: str = "cpu", **model_configs):
        self.device = device
        if "cuda" in self.device:
            assert torch.cuda.is_available(), "no GPU found, in Colab go to 'Edit->Notebook settings' and choose a GPU hardware accelerator"

        self.network = LSTMNetwork(**model_configs).to(self.device)

    def train(self, n_epoch: int = 20, lr: float = 1e-3, batch_size: int = 64, seq_len: int = 32):
        train_data_iter = LstmDataIterator(vocab.strs_to_ids(tok_train_dataset), batch_size, seq_len, self.device)

        optimizer = torch.optim.Adam(self.network.parameters(), lr=lr)
        criterion = nn.NLLLoss(ignore_index=vocab.str_to_id[vocab.pad_tok])

        self.network.train()
        for epoch in range(n_epoch):
            state = None
            total_loss = 0
            for x_batch, y_batch in tqdm.notebook.tqdm(train_data_iter):
                x_batch, y_batch = x_batch.to(self.device), y_batch.to(self.device)

                optimizer.zero_grad()

                log_probs, state = self.network(x_batch, state)

                state = tuple(s.detach() for s in state)

                log_probs = log_probs.view(-1, log_probs.size(-1))  # (seq_len * batch_size, vocab_size)
                y_batch = y_batch.view(-1)  # (seq_len * batch_size)

                if log_probs.size(0) > y_batch.size(0):
                    log_probs = log_probs[:y_batch.size(0), :]

                assert log_probs.size(0) == y_batch.size(0), f"Log_probs size: {log_probs.size()}, y_batch size: {y_batch.size()}"

                loss = criterion(log_probs, y_batch)

                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            print(f'Epoch {epoch+1}/{n_epoch}, Loss: {total_loss / len(train_data_iter)}')


    def next_word_probabilities(self, text_prefix: List[str]):
        self.network.eval()
        with torch.no_grad():
            ids_prefix = torch.tensor(vocab.strs_to_ids(text_prefix), dtype=torch.long, device=self.device).view(-1, 1)
            log_probs, _ = self.network(ids_prefix)
            return log_probs[-1].exp().cpu().tolist()

    def dataset_perplexity(self, dataset: List[str], batch_size: int = 64, seq_len: int = 32):
        self.network.eval()
        data_iterator = LstmDataIterator(vocab.strs_to_ids(dataset), batch_size, seq_len, self.device)
        total_loss = 0
        criterion = nn.NLLLoss(ignore_index=vocab.str_to_id[vocab.pad_tok], reduction='sum')

        with torch.no_grad():
            state = None
            total_words = 0
            for x_batch, y_batch in data_iterator:
                x_batch, y_batch = x_batch.to(self.device), y_batch.to(self.device)
                log_probs, state = self.network(x_batch, state)
                loss = criterion(log_probs.view(-1, log_probs.size(-1)), y_batch.view(-1))
                total_loss += loss.item()
                total_words += y_batch.numel()

        return np.exp(total_loss / total_words)


In [None]:
lstm_model = LSTMModel(device="cuda")
lstm_model.train()

print('lstm validation perplexity:', lstm_model.dataset_perplexity(tok_validation_dataset))

  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 1/20, Loss: 6.7431644133091915


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 2/20, Loss: 5.950369212284126


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 3/20, Loss: 5.605224979934843


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 4/20, Loss: 5.3797156265266315


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 5/20, Loss: 5.218492872381116


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 6/20, Loss: 5.0927356507416075


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 7/20, Loss: 4.990925568800706


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 8/20, Loss: 4.903914954770482


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 9/20, Loss: 4.827323148941852


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 10/20, Loss: 4.759754637054202


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 11/20, Loss: 4.701900216484446


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 12/20, Loss: 4.648988551642063


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 13/20, Loss: 4.601637723177848


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 14/20, Loss: 4.558765421252279


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 15/20, Loss: 4.516696905478453


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 16/20, Loss: 4.480322037928203


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 17/20, Loss: 4.447375322469828


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 18/20, Loss: 4.415026728688377


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 19/20, Loss: 4.38666970160821


  0%|          | 0/1014 [00:00<?, ?it/s]

Epoch 20/20, Loss: 4.359356371373584
lstm validation perplexity: 165.12496551725008


TODO: Report your LSTM perplexity.

LSTM validation perplexity: 165.12496551725008

# Experimentation: 1-Page Report

Now it's time for you to experiment.  Try to reach a validation perplexity below 200. You may either modify the LSTM class above, or copy it down to the code cell below and modify it there. Just **be sure to run code cell below to generate results with your improved LSTM**.  

It is okay if the bulk of your improvements are due to hyperparameter tuning (such as changing number or sizes of layers), but implement at least one more substantial change to the model.  Here are some ideas (several of which come from https://arxiv.org/pdf/1708.02182.pdf):
* activation regularization - add a l2 regularization penalty on the activation of the LSTM output (standard l2 regularization is on the weights)
* weight-drop regularization - apply dropout to the weight matrices instead of activations
* learning rate scheduling - decrease the learning rate during training
* embedding dropout - zero out the entire embedding for a random set of words in the embedding matrix
* ensembling - average the predictions of several models trained with different initialization random seeds
* temporal activation regularization - add l2 regularization on the difference between the LSTM output activations at adjacent timesteps

You may notice that most of these suggestions are regularization techniques.  This dataset is considered fairly small, so regularization is one of the best ways to improve performance.

TODO: In the report, submit a write-up describing the extensions and/or modifications that you tried.  Your description should be **1-page maximum** in length.
For full credit, your write-up should include:
1.   A concise and precise description of the extension that you tried.
2.   A motivation for why you believed this approach might improve your model.
3.   A discussion of whether the extension was effective and/or an analysis of the results.  This will generally involve some combination of tables, learning curves, etc.
4.   A bottom-line summary of your results comparing validation perplexities of your improvement to the original LSTM.


Run the cell below in order to train your improved LSTM and evaluate it.  

In [40]:
import torch
import torch.nn as nn
import numpy as np
from torch.optim.lr_scheduler import ReduceLROnPlateau

class TARLSTM(nn.Module):
    def __init__(self, embed_dim: int = 128, n_layer: int = 3, hidden_dim: int = 512, dropout_rate: float = 0.5, tar_lambda: float = 0.1, embedding_dropout: float = 0.2):
        super().__init__()

        self.embedding = nn.Embedding(len(vocab.all_tokens), embed_dim)

        self.embedding_dropout = nn.Dropout(embedding_dropout)

        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=n_layer, dropout=dropout_rate)

        self.fc1 = nn.Linear(hidden_dim, 128)
        self.fc2 = nn.Linear(128, len(vocab.all_tokens))

        self.dropout = nn.Dropout(dropout_rate)

        self.tar_lambda = tar_lambda

    def forward(self, x: torch.Tensor, state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None):
        """Forward pass through the LSTM model with temporal activation regularization."""
        x = self.embedding(x)  # (seq_len, batch, embed_dim)
        x = self.embedding_dropout(x)

        if state is None:
            output, state = self.lstm(x)  # (seq_len, batch, hidden_dim), (state, cell_state)
        else:
            output, state = self.lstm(x, state)

        output = self.dropout(output)

        output = self.fc1(output)

        logits = self.fc2(output)

        return torch.log_softmax(logits, dim=-1), state, output


class TARLSTMModel:
    def __init__(self, device: str = "cpu", **model_configs):
        self.device = device
        self.network = TARLSTM(**model_configs).to(self.device)

    def train(self, n_epoch: int = 30, lr: float = 1e-3, batch_size: int = 64, seq_len: int = 32):
        train_data_iter = LstmDataIterator(vocab.strs_to_ids(tok_train_dataset), batch_size, seq_len, self.device)
        optimizer = torch.optim.Adam(self.network.parameters(), lr=lr)
        scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3, verbose=True)
        criterion = nn.NLLLoss(ignore_index=vocab.str_to_id[vocab.pad_tok])

        self.network.train()
        for epoch in range(n_epoch):
            state = None
            total_loss = 0
            tar_loss = 0
            for x_batch, y_batch in train_data_iter:
                x_batch, y_batch = x_batch.to(self.device), y_batch.to(self.device)
                optimizer.zero_grad()

                log_probs, state, output = self.network(x_batch, state)

                state = tuple(s.detach() for s in state)

                log_probs = log_probs.view(-1, log_probs.size(-1))  # (seq_len * batch_size, vocab_size)
                y_batch = y_batch.view(-1)  # (seq_len * batch_size)

                loss = criterion(log_probs, y_batch)

                if output.size(0) > 1:
                    tar_loss = self.network.tar_lambda * torch.mean((output[1:] - output[:-1])**2)

                total_loss = loss + tar_loss

                total_loss.backward()
                optimizer.step()

            scheduler.step(total_loss.item() / len(train_data_iter))

            print(f'Epoch {epoch+1}/{n_epoch}, Loss: {total_loss.item() / len(train_data_iter)}, TAR Loss: {tar_loss.item() / len(train_data_iter)}')

    def dataset_perplexity(self, dataset: List[str], batch_size: int = 64, seq_len: int = 32):
        self.network.eval()
        data_iterator = LstmDataIterator(vocab.strs_to_ids(dataset), batch_size, seq_len, self.device)
        total_loss = 0
        criterion = nn.NLLLoss(ignore_index=vocab.str_to_id[vocab.pad_tok], reduction='sum')

        with torch.no_grad():
            state = None
            total_words = 0
            for x_batch, y_batch in data_iterator:
                x_batch, y_batch = x_batch.to(self.device), y_batch.to(self.device)
                log_probs, state, _ = self.network(x_batch, state)
                loss = criterion(log_probs.view(-1, log_probs.size(-1)), y_batch.view(-1))
                total_loss += loss.item()
                total_words += y_batch.numel()

        return np.exp(total_loss / total_words)

    def next_word_probabilities(self, text_prefix: List[str]):
        """Returns a list of probabilities for each word in the vocabulary based on the input prefix."""
        self.network.eval()
        with torch.no_grad():
            ids_prefix = torch.tensor(vocab.strs_to_ids(text_prefix), dtype=torch.long, device=self.device).view(-1, 1)
            log_probs, _, _ = self.network(ids_prefix)
            return log_probs[-1].exp().cpu().tolist()

device = "cuda" if torch.cuda.is_available() else "cpu"
tarlstm_model = TARLSTMModel(device=device)
tarlstm_model.train()

print('TAR LSTM validation perplexity:', tarlstm_model.dataset_perplexity(tok_validation_dataset))


Epoch 1/30, Loss: 0.006270152812408508, TAR Loss: 3.988480588505724e-05
Epoch 2/30, Loss: 0.005798277770273784, TAR Loss: 4.5054641789233194e-05
Epoch 3/30, Loss: 0.005502308614155245, TAR Loss: 4.793492309322959e-05
Epoch 4/30, Loss: 0.0052994840008737535, TAR Loss: 5.184409724772564e-05
Epoch 5/30, Loss: 0.005120102470443094, TAR Loss: 5.6343678480539566e-05
Epoch 6/30, Loss: 0.004977221084534534, TAR Loss: 5.570751719573546e-05
Epoch 7/30, Loss: 0.004897821114143206, TAR Loss: 5.698061862288142e-05
Epoch 8/30, Loss: 0.0048038108344143895, TAR Loss: 5.682604111863311e-05
Epoch 9/30, Loss: 0.004667315022244726, TAR Loss: 5.7663702400478384e-05
Epoch 10/30, Loss: 0.004632353547527005, TAR Loss: 5.901317227874282e-05
Epoch 11/30, Loss: 0.004582720630502795, TAR Loss: 5.886684625576704e-05
Epoch 12/30, Loss: 0.004471203279213087, TAR Loss: 6.004060660828735e-05
Epoch 13/30, Loss: 0.00444058506681604, TAR Loss: 5.969976091525964e-05
Epoch 14/30, Loss: 0.004401294437385875, TAR Loss: 6.001

In [42]:
def save_truncated_distribution(model, filename, short=True):
    """Generate a file of truncated distributions.

    Probability distributions over the full vocabulary are large,
    so we will truncate the distribution to a smaller vocabulary.

    Please do not edit this function
    """
    vocab_name = 'eval_output_vocab'
    prefixes_name = 'eval_prefixes'

    if short:
      vocab_name += '_short'
      prefixes_name += '_short'

    with open(f'{vocab_name}.txt', 'r') as eval_vocab_file:
        eval_vocab = [w.strip() for w in eval_vocab_file]
    eval_vocab_ids = sorted(list(set([vocab.str_to_id[s] if s in vocab else vocab.str_to_id[vocab.unk_tok]
                      for s in eval_vocab])))

    all_selected_probabilities = []
    with open(f'{prefixes_name}.txt', 'r') as eval_prefixes_file:
        lines = eval_prefixes_file.readlines()
        for line in tqdm.notebook.tqdm(lines, leave=False):
            prefix = line.strip().split(' ')
            probs = model.next_word_probabilities(prefix)

            selected_probs = np.array([probs[i] for i in eval_vocab_ids if i < len(probs)], dtype=np.float32)
            all_selected_probabilities.append(selected_probs)

    all_selected_probabilities = np.stack(all_selected_probabilities)
    np.save(filename, all_selected_probabilities)
    print('saved', filename)


In [44]:
import tqdm

print('TAR LSTM validation perplexity:', tarlstm_model.dataset_perplexity(tok_validation_dataset))
save_truncated_distribution(tarlstm_model, 'lstm_predictions.npy', short=False)

TAR LSTM validation perplexity: 146.05531668831435


  0%|          | 0/5000 [00:00<?, ?it/s]

saved lstm_predictions.npy


### Submission

Upload a submission with the following files to Gradescope:
* proj_1.ipynb (rename to match this exactly)
* lstm_predictions.npy
* neural_trigram_predictions.npy
* trigram_backoff_predictions.npy
* bigram_predictions.npy
* report.pdf

You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them.

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.  Note that the test set perplexities shown by the autograder are on a completely different scale from your validation set perplexities due to truncating the distribution and selecting different text.  Don't worry if the values seem much worse.

In [None]:
!ls

In [None]:
from google.colab import drive
drive.mount('/content/drive')