# Instructions

The second half of Assignment 3 will be different from previous assignments in this class. You will **not** be interacting with TextWorld or TextWorld-Express for this assignment.

Your task for this assignment is to create a simple language model based off of unigrams, bigrams, and trigrams. This language model can then be used to generate text that is similar to the text the model is trained on. We will test your solution on hidden test data that will be used to evaluate the accuracy of your implementation. You will be implementing code to build and generate text from all three the models (Unigrams, Bigrams, Trigrams)

For this part of the assignment, you will be using NumPy extensively, so please familiarize yourself with the NumPy API (see https://numpy.org/doc/stable/reference/index.html). You are **only** allowed to use a restricted set of libraries for this assignment. All packages that come with the default Python installation are permitted, as well as any imports we have already provided for you. You may not use any other libraries than the ones we have provided. If you attempt to use other libraries, the autograder will not be able to run your code.

# Installs and Imports

In [263]:
%pip install numpy

Note: you may need to restart the kernel to use updated packages.


## Imports

In [264]:
# export - DO NOT MODIFY THIS CELL
import numpy as np
import random
from helpers import (
    EOS,
    EOS_token,
    SOS,
    SOS_token,
    Vocab,
    normalize_string,
    make_vocab,
    insert_eos_sos,
)

In [265]:
# export - DO NOT MODIFY OR MOVE THIS LINE
# add any additional imports here (from the Python Standard Library only!)
from collections import defaultdict

In [266]:
# DO NOT MODIFY THIS CELL
np.random.seed(42)
random.seed(42)

# Language Models Setup

## Vocabularies for Language Models

A *language model* models the probability of a word to the appearance of prior tokens.

A bigram language model relates the probability of a word to the most recent previous word (for a total of two words modeled): $p(w_n | w_{n-1})$.

A trigram language model relates the probability of a word to the most recent two previous words (for a total of three words modeled): $p(w_n | w_{n-1}, w_{n-2})$.

We will construct a probability matrix for a bigram language model. The probability table will be a 2D table where dimension 0 is the word in position $n-1$ and dimension 1 is the word in position $n$ and dimension 2 is the probability that $w_n$ occurs directly after $w_{n-1}$. Thus, ideally, if our probability matrix is `p` then `p['you', 'are'] = 0.0089`.

But where do we know where each word should go in the matrix? We will assign an unique number to each word, from 0 to $|W|$, the total number of words in our corpus. This number will be referred to as a "token", but it is also the "index" in an arbitrary list of words. Thus is the token for "you" is 2 and the token for "are" is 3, then `p[2, 3] = 0.0089`.

The following creates a "vocabulary" object, which maps the words in a corpus to tokens and vice versa. The vocabulary will have two special tokens, `SOS_token` and `EOS_token`, to indicate the start of a sequence and end of a sequence, respectively. These tokens can be used to demarkate the beginnings and ends of things, which will be useful for generation later.

## Make a Vocabulary

In [267]:
with open("observations.txt", "r") as f:
    observations = f.read()

observations = normalize_string(insert_eos_sos(observations))
print(observations)

SOS you are in the kitchen  EOS SOS in one part of the room you see a stove  EOS SOS there is also an oven  EOS SOS you also see a fridge that is closed  EOS SOS in another part of the room you see a counter that has nothing on it  EOS SOS in one part of the room you see a kitchen cupboard that is closed  EOS SOS there is also a cutlery drawer that is closed  EOS SOS you also see a trash can that is closed  EOS SOS in another part of the room you see a dishwasher that is closed  EOS SOS in one part of the room you see a dining chair that has nothing on it  EOS SOS to the north you see a closed wood door  EOS SOS to the south you see a closed plain door  EOS SOS to the east you see the corridor  EOS SOS inventory maximum capacity is 2 items    your inventory is currently empty  EOS SOS you are in the kitchen  EOS SOS in one part of the room you see a stove  EOS SOS there is also an oven  EOS SOS you also see a fridge that is closed  EOS SOS in another part of the room you see a counter 

In [268]:
VOCAB = make_vocab([observations], "observations")
print("vocab size:", VOCAB.num_words())

vocab size: 117


Map a word to a token

In [269]:
VOCAB.word2token("kitchen")

6

Map a token to a word

In [270]:
VOCAB.token2word(42)

'east'

# Probabilistic Language Modeling **(TO-DO)**

## **Hints:**

*   Use `Vocab.word2token` to get a unique token for each word (or SOS_token/EOS_token)
*   Use `Vocab.token2word` to get the word associated with that paritcular token value
*   Do **not** use `Vocab.word2count`!
*   N-grams should represent the probability P(word_n | word_n-1, word_n-2, ...) for all combinations of words/tokens, so make sure these probabilities make sense (normalize!)

## NumPy

Because we will be using numpy matrices for this assignment, you may want to check out the [numpy package documentation](https://numpy.org/doc/stable/). API functions that will be particularly useful are:
- Creating an empty matrix: `np.zeros(shape)` will create a matrix of shape `shape=(size_of_dim_0, size_of_dim_1, ...)` willed entirely with zeros. For example `np.zeros((2, 3))` will create a $2\times 3$ matrix of zeros.
- Getting the size of a matrix: `some_matrix.shape` is the shape of a numpy matrix as a tuple where each element is the size of each dimension.
- Getting an element at a given position can be retrieved using `some_matrix[index1, index2, ...]` or `some_matrix[index1][index2]...`.
- Setting: `some_matrix[index1, index2] = some_value` will set an individual element in a matrix at the given indices. This can also be also be done as `some_matrix[index1][index2]... = some_value`.
- Slicing: sub-matrices can be extracted by giving spans of indices, similar to how python lists work, but with multiple dimensions. For example `some_matrix[5:10,5:10]` will construct a $5\times 5$ square matrix that has elements from the original matrix in positions between $(5,5)$ through $(9,9)$
- Summing: `some_matrix.sum()` adds all the elements of a matrix together. Numpy matrices have a number of built-in mathematical functions like this.
- Mathematical operations: all the normal mathematica operations work as expected in linear algebra. `+` adds element-wise, `*` multiplies element-wise, etc. Matrix shapes must match.
- Type casting: `some_matrix.astype(type)` converts every element in the matrix to a particular type, e.g. `some_matrix.astype(int)`.

## Unigram Language Model **(TO-DO)**

Complete the `make_unigram_model()` function. It takes a text `text` which is a string of text data (normalized to be easy to work with). It also takes in a `Vocab` object. The function should return a 1D NumPy matrix with the probability of seeing a word as the value based on the word's token index from the vocab.

In [271]:
# export - DO NOT MODIFY OR MOVE THIS LINE
def make_unigram_model(text: str, vocab: Vocab) -> np.ndarray:
    probabilities = np.zeros([vocab.num_words()])
    ### YOUR CODE BELOW HERE
    words = text.split()
    for i in range(len(words)):
        probabilities[vocab.word2token(words[i])] += 1
    probabilities/=len(words)
    ### YOUR CODE ABOVE HERE
    return probabilities

Make the unigram model

In [272]:
with open("observations.txt", "r") as f:
    observations = f.read()

observations = normalize_string(insert_eos_sos(observations))
VOCAB = make_vocab([observations], "observations")
unigram_lm = make_unigram_model(observations, VOCAB)

The following should be a high-probability word (note: this doesn't mean that the raw probability will be high, but that it will be higher than other, lower-probability words)

In [273]:
unigram_lm[VOCAB.word2token("kitchen")]

np.float64(0.005304760231863693)

### Test Unigram Model

In [274]:
from tests import test_unigram_probabilities

test_unigram_probabilities(build_unigram_function=make_unigram_model)

Beginning sonnet tests
 - Passed check:  EOS
 - Passed check:  SOS
 - Passed check:  thou
 - Passed check:  beauty
Passed sonnet tests!
Beginning observation tests
 - Passed check:  EOS
 - Passed check:  SOS
 - Passed check:  you
 - Passed check:  kitchen
Passed observation tests!
All tests passed!


## Bigram Language Model **(TO-DO)**

Complete the `make_bigram_model()` function. It takes a text `text` which is a string of text data (normalized to be easy to work with). It also takes in a `Vocab` object. The function should return a 2D numpy matrix where each element is the probability of token of the corresponding index in dimension 0 is followed by the token of the corresponding index in dimension 1.


In [275]:
# export - DO NOT MODIFY OR MOVE THIS LINE
def make_bigram_model(text: str, vocab: Vocab) -> np.ndarray:
    probabilities = np.zeros([vocab.num_words()] * 2)
    ### YOUR CODE BELOW HERE
    def defaultdict_func():
        return defaultdict(int)
    bigrams = defaultdict(defaultdict_func)
    toks = [vocab.word2token(word) for word in text.split()]
    for i in range(len(toks) - 1):
        bigrams[toks[i]][toks[i + 1]] += 1
    for bc in bigrams:
        tot = sum(bigrams[bc].values())
        for c in bigrams[bc]:
            probabilities[bc, c] = bigrams[bc][c] / tot
    ### YOUR CODE ABOVE HERE
    return probabilities

Make the bigram model

In [276]:
with open("observations.txt", "r") as f:
    observations = f.read()

observations = normalize_string(insert_eos_sos(observations))
VOCAB = make_vocab([observations], "observations")
bigram_lm = make_bigram_model(observations, VOCAB)

The following should be a high-probability transition (note: this doesn't mean that the raw probability will be high, but that it will be higher than other, lower-probability transitions)

In [277]:
bigram_lm[VOCAB.word2token("you"), VOCAB.word2token("see")]

np.float64(0.7187864644107351)

The following should be a lower-probability transition.

In [278]:
bigram_lm[VOCAB.word2token("a"), VOCAB.word2token("coin")]

np.float64(0.01577503429355281)

### Test Bigram Model

In [279]:
from tests import test_bigram_probabilities

test_bigram_probabilities(build_bigram_function=make_bigram_model)

Beginning sonnet tests
 - Passed check:  SOS betwixt
 - Passed check:  SOS but
 - Passed check:  me EOS
 - Passed check:  friend EOS
 - Passed check:  thou art
 - Passed check:  my joy
Passed sonnet tests!
Beginning observations tests
 - Passed check:  SOS there
 - Passed check:  SOS you
 - Passed check:  closed EOS
 - Passed check:  coin EOS
 - Passed check:  you see
 - Passed check:  a sofa
Passed observations tests!
All tests passed!


## Trigram Language Model **(TO-DO)**

Complete the `make_trigram_model()` function. It takes a text `text` which is a string of sentences. It also takes in a `Vocab` object. The function should return a 3D numpy matrix where each element is the probability of token of the corresponding index in dimension 0 is followed by the token of the corresponding index in dimension 1 then the token of the corresponding index in dimension 2.


In [280]:
# export - DO NOT MODIFY OR MOVE THIS LINE
def make_trigram_model(text: str, vocab: Vocab) -> np.ndarray:
    probabilities = np.zeros([vocab.num_words()] * 3)
    ### YOUR CODE BELOW HERE
    def defaultdict_funcThird():
        return defaultdict(int)
    def defaultdict_funcSecond():
        return defaultdict(defaultdict_funcThird)
    def defaultdict_funcFirst():
        return defaultdict(defaultdict_funcSecond)
    trigrams = defaultdict_funcFirst()
    toks = [vocab.word2token(word) for word in text.split()]
    for i in range(len(toks) - 2):
        trigrams[toks[i]][toks[i + 1]][toks[i + 2]] += 1
    for fc in trigrams:
        for sc in trigrams[fc]:
            tot = sum(trigrams[fc][sc].values())
            for tc in trigrams[fc][sc]:
                probabilities[fc, sc, tc] = trigrams[fc][sc][tc] / tot
    ### YOUR CODE ABOVE HERE
    return probabilities

In [281]:
with open("observations.txt", "r") as f:
    observations = f.read()

observations = normalize_string(insert_eos_sos(observations))
VOCAB = make_vocab([observations], "observations")
trigram_lm = make_trigram_model(observations, VOCAB)

In [282]:
trigram_lm[VOCAB.word2token("you"), VOCAB.word2token("see"), VOCAB.word2token("a")]

np.float64(0.7248376623376623)

### Test Trigam Model

In [283]:
from tests import test_trigram_probabilities

test_trigram_probabilities(build_trigram_function=make_trigram_model)

Beginning sonnet tests
 - Passed check:  SOS thou art
 - Passed check:  SOS how heavy
 - Passed check:  so dear EOS
 - Passed check:  of me EOS
 - Passed check:  thou wilt be
 - Passed check:  my love thou
Passed sonnet tests!
Beginning observations tests
 - Passed check:  SOS there is
 - Passed check:  SOS in one
 - Passed check:  screen door EOS
 - Passed check:  is closed EOS
 - Passed check:  you see a
 - Passed check:  see a sofa
Passed observations tests!
All tests passed!


# Text Generation **(TO-DO)**

### Unigram Generation

Complete the following function to generate text from a unigram language model.
`generate_from_unigram()` will take the following parameters:
- `first_token`: the token that will appear first in the generated text (e.g., `SOS_token`).
- `probabilities`: a unigram model, a numpy 1D array of probability values for each token.
- `vocab`: a Vocab object.
- `max_length`: the maximum number of words to generate.

The function should generate words until either `max_length` words is reached or until an `EOS_token` is generated.

The return value will be a list of words where the first word is always the first word that corresponds to `first_token`.

In [284]:
# export - DO NOT MODIFY OR MOVE THIS LINE
def generate_from_unigram(
    first_token: int, probabilities: np.ndarray, vocab: Vocab, max_length: int
) -> list:
    # An empty list of words to populate
    words = [vocab.token2word(first_token)]
    ### YOUR CODE BELOW HERE
    if max_length <= 0:
        return words
    for i in range(max_length - 2):
        probCopy = probabilities.copy()
        probCopy[SOS_token] = 0.0
        tok = random.choices(
            range(len(probCopy)), weights=probCopy
        )[0]
        words.append(
            vocab.token2word(tok)
        )
        if tok == EOS_token:
            return words
    words.append(vocab.token2word(EOS_token))
    ### YOUR CODE ABOVE HERE
    return words

In [285]:
" ".join(generate_from_unigram(SOS_token, unigram_lm, VOCAB, max_length=128))

'SOS has EOS'

In [286]:
from tests import test_generate_from_unigram

test_generate_from_unigram(
    generate_from_unigram_function=generate_from_unigram,
    build_unigram_function=make_unigram_model,
)

['SOS', 'to', 'eternal', 'me', 'break', 'in', 'thee', 'when', 'to', 'remembrance', 'when', 'the', 'i', 'death', 'to', 'sweets', 'heart', 'EOS']
['SOS', 'answer', 'EOS']
All tests passed!


### Bigram Generation

Complete the following function to generate text from a unigram language model.
`generate_from_bigram()` will take the following parameters:
- `first_token`: the token that will appear first in the generated text (e.g., `SOS_token`).
- `probabilities`: a bigram model, a numpy 2D array of probability values where `probabilities[i][j]` indicates the probability of token `i` followed by token `j`.
- `vocab`: a Vocab object.
- `max_length`: the maximum number of words to generate.

The function should generate words until either `max_length` words is reached or until an `EOS_token` is generated.

The return value will be a list of words where the first word is always the first word that corresponds to `first_token`.

To implement this function, start by sampling a word based on the probability of occurring after `first_token`. Then continue to sample tokens based on their probability of occurring after each subsequently generated token. Terminate when `EOS_token` is generated or when the maximum length of tokens is generated. Convert each token into a word.

You might find the [Numpy random sampling functions](https://numpy.org/doc/stable/reference/random/index.html) helpful.

In [287]:
# export - DO NOT MODIFY OR MOVE THIS LINE
def generate_from_bigram(
    first_token: int, probabilities: np.ndarray, vocab: Vocab, max_length: int
) -> list:
    # An empty list of words to populate
    words = [vocab.token2word(first_token)]
    ### YOUR CODE BELOW HERE
    if max_length <= 0:
        return words
    tok = first_token
    for i in range(max_length - 2):
       p = probabilities[tok]
       if p.sum() == 0:
           break
       tok2 = random.choices(
            range(len(p)), weights=p)[0]
       words.append(vocab.token2word(tok2))
       if tok2 == EOS_token:
        return words
       tok = tok2
    words.append(vocab.token2word(EOS_token))
        
    ### YOUR CODE ABOVE HERE
    return words

Generate from the bigram model.

In [288]:
" ".join(generate_from_bigram(SOS_token, bigram_lm, VOCAB, max_length=128))

'SOS to the room you also see the room you see a desk chair that has nothing on it EOS'

In [289]:
from tests import test_generate_from_bigram

test_generate_from_bigram(
    generate_from_bigram_function=generate_from_bigram,
    build_bigram_function=make_bigram_model,
)

['SOS', 'weary', 'with', 'my', 'love', 'and', 'make', 'the', 'strong', 'offences', 'cross', 'EOS']
['SOS', 'but', 'EOS']
All tests passed!


### Trigram Generation

Complete the following function to generate text from a unigram language model.
`generate_from_bigram()` will take the following parameters:
- `first_token`: the token that will appear first in the generated text (e.g., `SOS_token`).
- `probabilities`: a trigram model, a numpy 3D array of probability values where `probabilities[i][j][k]` indicates the probability of token `i` followed by token `j` followed by token `k`.
- `vocab`: a Vocab object.
- `max_length`: the maximum number of words to generate.

The function should generate words until either `max_length` words is reached or until an `EOS_token` is generated.

The return value will be a list of words where the first word is always the first word that corresponds to `first_token`.

To implement this function, start by sampling a word based on the probability of occurring after `first_token`. To do this you will need to figure out how to make the trigram model operate like a bigram model. Once you have a sequence of two tokens, you can proceed by sampling a next token based on its probability of occurring after the token at position $n-2$ and token at position $n-1$. If the probability vector for the current sequence of two tokens is 0, you should select the next token randomly. Terminate when `EOS_token` is generated or when the maximum length of tokens is generated. Convert each token into a word.

You might find the [Numpy random sampling functions](https://numpy.org/doc/stable/reference/random/index.html) helpful.

In [290]:
# export - DO NOT MODIFY OR MOVE THIS LINE
def generate_from_trigram(
    first_token: int, probabilities: np.ndarray, vocab: Vocab, max_length: int
) -> list:
    # An empty list of words to populate
    words = [vocab.token2word(first_token)]
    ### YOUR CODE BELOW HERE
    if max_length <= 0:
        return words
    s = [first_token, first_token]
    if len(s) < 2:
        pros = []
        w = []
        for w, c in vocab._word2count.items():
            if vocab.word2token(w) == SOS_token or vocab.word2token(w) == EOS_token:
                continue
            pros.append(vocab.word2token(w))
            w.append(c)
            if pros:
                tok = random.choices(
                    pros, weights=w
                )[0]
            else:
                tok = EOS_token
            words.append(vocab.token2word(tok))
            if tok == EOS_token:
                return words
    for i in range(max_length - len(s) - 1):
        p = probabilities[s[-2]][s[-1]]
        if p.sum() == 0:
            break
        tok = random.choices(
            range(len(p)), weights=p)[0]
        s.append(tok)
        words.append(vocab.token2word(tok))
        if tok == EOS_token:
            return words
    words.append(vocab.token2word(EOS_token))
    ### YOUR CODE ABOVE HERE
    return words

Generate from the trigram model.

In [291]:
" ".join(generate_from_trigram(SOS_token, trigram_lm, VOCAB, max_length=256))

'SOS EOS'

In [292]:
from tests import test_generate_from_trigram

test_generate_from_trigram(
    generate_from_trigram_function=generate_from_trigram,
    build_trigram_function=make_trigram_model,
)

['SOS', 'EOS']
['SOS', 'EOS']
All tests passed!


## Perplexity

*Perplexity* is a measure of how well a language model captures a dataset. It is a measure of how "surprised" the model is when it sees a sequence. If a model has perplexity close to zero, it means that any sequence you throw at the model is very probable according to the model. If the perplexity is high, it means that the sequences it is seeing seem very improbable according to the model. If the sequences are real data from the corpus, then a low perplexity is good because the model is not being surprised by real data. Thus, lower is better.

The formula for trigram perplexity is $ppl(w)=exp\big(-\frac{1}{n}\displaystyle\sum\limits_{t=1}^{n}\log p(w_t|w_{t-1}, w_{t-2})\big)$ where $w$ is a word vector conisting of $n$ words $w_1...w_n$. The $log$ maps values between $0...1$ to a scale from $-∞...0$ to avoid multiplying small probability values together. We must add log probabilities instead of multiplying regular probabilities. The $\frac{1}{n}$ gives us the average log probability. The $-1$ converts log probabilities from numbers less than 0 to numbers greater than 0 so that unlikely word vectors (probabilities close to 0 or log probabilities close to $-∞$) become very high positive numbers (high perplexity). This is also called the *negative log likelihood*, which, counter-intuitively is a positive score.

To get the final perplexity score, you would then apply a $exp(⋅)$ to map the values in the log scale back to non-log scale.

To summarize: convert probabilities to log-scale, combine probability chains, get the mean, flip to positive valued numbers, then covert to non-log-scale.

We will be making one slight modification to this: we will **not** be converting back to non-log scale. This is because perplexity is mainly used to compare different text generation schemes to each other. Since log perplexity can be compared in the same way perplexity can, we will not be using an $exp(⋅)$ in our calculations.

In [293]:
from helpers import perplexity1, perplexity2, perplexity3

with open("observations.txt", "r") as f:
    observations = f.read()
observations = normalize_string(insert_eos_sos(observations))
VOCAB = make_vocab([observations], "observations")

seq = [VOCAB.word2token(w) for w in observations.split()]

observation_unigram_lm = make_unigram_model(observations, VOCAB)
observation_bigram_lm = make_bigram_model(observations, VOCAB)
observation_trigram_lm = make_trigram_model(observations, VOCAB)

print(perplexity1(seq, observation_unigram_lm))
print(perplexity2(seq, observation_unigram_lm, observation_bigram_lm))
print(
    perplexity3(
        seq, observation_unigram_lm, observation_bigram_lm, observation_trigram_lm
    )
)

3.741279033524569
0.715972504684526
0.43539901499431016


In [294]:
with open("sonnets.txt", "r") as f:
    sonnets = f.read()
sonnets = normalize_string(insert_eos_sos(sonnets))
VOCAB = make_vocab([sonnets], "sonnets")

seq = [VOCAB.word2token(w) for w in sonnets.split()]

sonnets_unigram_lm = make_unigram_model(sonnets, VOCAB)
sonnets_bigram_lm = make_bigram_model(sonnets, VOCAB)
sonnets_trigram_lm = make_trigram_model(sonnets, VOCAB)

print(perplexity1(seq, sonnets_unigram_lm))
print(perplexity2(seq, sonnets_unigram_lm, sonnets_bigram_lm))
print(perplexity3(seq, sonnets_unigram_lm, sonnets_bigram_lm, sonnets_trigram_lm))

6.077064985815244
2.197837560885327
0.33340405928528066


## Write text with perplexity **(TO-DO)**

For this exercise, create two text sequences of your own design that have a perplexity above or below a required value. This exercise uses the observations.txt file for this sequence so all the words must exist in the observations vocabulary. Based on your understanding of perplexity, can you craft a sequence with the desired perplexity?

In [295]:
perplexity_under_3 = "you see a trash can that is closed."
perplexity_over_7 = "you see a coin that is fridge that is open plain door that is closed."

In [296]:
with open("observations.txt", "r") as f:
    observations = f.read()
observations = normalize_string(insert_eos_sos(observations))
VOCAB = make_vocab([observations], "observations")

observation_unigram_lm = make_unigram_model(observations, VOCAB)
observation_bigram_lm = make_bigram_model(observations, VOCAB)
observation_trigram_lm = make_trigram_model(observations, VOCAB)

In [297]:
low_perplexity = normalize_string(insert_eos_sos(perplexity_under_3))
low_seq = [VOCAB.word2token(w) for w in low_perplexity.split()]

print(
    perplexity3(
        low_seq, observation_unigram_lm, observation_bigram_lm, observation_trigram_lm
    )
)

1.7406409283240045


In [298]:
high_perplexity = normalize_string(insert_eos_sos(perplexity_over_7))
high_seq = [VOCAB.word2token(w) for w in high_perplexity.split()]

print(
    perplexity3(
        high_seq, observation_unigram_lm, observation_bigram_lm, observation_trigram_lm
    )
)

5.8065504708768225


### Test your sequences

In [299]:
from tests import test_perplexity

test_perplexity(
    perplexity_under_3,
    perplexity_over_7,
    build_unigram_function=make_unigram_model,
    build_bigram_function=make_bigram_model,
    build_trigram_function=make_trigram_model,
)

AssertionError: 

# Testing

## Evaluate student models

We will evaluate your models by training them on half of the data, and testing perplexity on the remaining half. To build your own tests, you can partition the given text files differently or download your own text files. You may share your results for any new text files with other students. As a reminder, you are **not** permitted to share code with other students.

In [None]:
with open("observations.txt", "r") as f:
    observations = f.read()

sil = normalize_string(insert_eos_sos(observations))
sil = " ".join(sil.split()[:2000])
sil_train = " ".join(sil.split()[:1500])
sil_test = " ".join(sil.split()[1500:])

In [None]:
sil_vocab = make_vocab([sil], "sil")
sil_vocab.num_words()

82

Build Student Models on the training data

In [None]:
unigram_lm = make_unigram_model(sil_train, sil_vocab)
bigram_lm = make_bigram_model(sil_train, sil_vocab)
trigram_lm = make_trigram_model(sil_train, sil_vocab)

Evaluate on Training Data

In [None]:
print("Training Results")

token_sequence = [sil_vocab.word2token(w) for w in sil_train.split()]

print("Unigram perplexity: ", perplexity1(token_sequence, unigram_lm))
print("Bigram perplexity: ", perplexity2(token_sequence, unigram_lm, bigram_lm))
print(
    "Trigram perplexity: ",
    perplexity3(token_sequence, unigram_lm, bigram_lm, trigram_lm),
)

Training Results
Unigram perplexity:  3.714397330463374
Bigram perplexity:  0.717304206438192
Trigram perplexity:  0.42575687600416595


Evaluate on Testing Data

In [None]:
print("Testing Results")

testing_token_seq = [sil_vocab.word2token(w) for w in sil_test.split()]

print("Unigram perplexity: ", perplexity1(testing_token_seq, unigram_lm))
print("Bigram perplexity: ", perplexity2(testing_token_seq, unigram_lm, bigram_lm))
print(
    "Trigram perplexity: ",
    perplexity3(testing_token_seq, unigram_lm, bigram_lm, trigram_lm),
)

Testing Results
Unigram perplexity:  3.745781771662003
Bigram perplexity:  0.7942959167866147
Trigram perplexity:  0.679413999328657


Run tests to ensure your perplexity falls within the acceptable bounds

In [None]:
from tests import test_model_perplexity

test_model_perplexity(
    build_unigram_function=make_unigram_model,
    build_bigram_function=make_bigram_model,
    build_trigram_function=make_trigram_model,
)

Training Results:
Unigram perplexity:  3.667381057272907
Passed unigram training perplexity
Bigram perplexity:  0.6715943147035198
Passed bigram training perplexity
Trigram perplexity:  0.38385807590353377
Passed trigram training perplexity
Testing Results:
Unigram perplexity:  3.745781771662003
Passed unigram test perplexity
Bigram perplexity:  0.7942959167866147
Passed bigram test perplexity
Trigram perplexity:  0.679413999328657
Passed trigram test perplexity
All tests passed!


# Grading

We will be grading your submission based on your probability tables, text generation, as well as the perplexity of your code evaluated on randomly choosen text data (it will be similar to the dataset we provide in this notebook). These tests will be very similar to the tests provided but on a different dataset.

Part B is worth 50 points based on your performance across several trials of randomly sampled text data (from the same data provided), and your overall score will be weighted as follows:
- 10 points for correctly calculating unigram probabilities
- 10 points for correctly calculating bigram probabilities
- 10 points for correctly calculating trigram probabilities
- 10 points for correctly generating text from your models
- 10 points for appropriate perplexity for all three models

Do not try to subvert or game the autograder, as these cases will lead to a score of 0. We have a built-in curve to our autograder to help student scores. As such, the sanity checks provided do not guarantee similar performance on the overall assignment, but they do help ensure that your code runs correctly as you intend.

# Submission

Upload this notebook with the name `submission.ipynb` file to Gradescope. The autograder will **only** run successfully if your file is named this way. You must ensure that you have removed all print statements from **your** code, or the autograder may fail to run. Excessive print statements will also result in muddled test case outputs, which makes it more difficult to interpret your score. 

We've added appropriate comments to the top of certain cells for the autograder to export (`# export`). You do NOT have to do anything (e.g. remove print statements) to cells we have provided - anything related to those have been handled for you. You are responsible for ensuring your own code has no syntax errors or unnecessary print statements. You ***CANNOT*** modify the export comments at the top of the cells, or the autograder will fail to run on your submission.

You should ***not*** add any cells that your code requires to the notebook when submitting. You're welcome to add any code as you need to extra cells when testing, but they will not be graded. Only the provided cells will be graded. As mentioned in the top of the notebook, **any helper functions that you add should be nested within the function that uses them.**

If you encounter any issues with the autograder, please feel free to make a post on Ed Discussion. We highly recommend making a public post to clarify any questions, as it's likely that other students have the same questions as you! If you have a question that needs to be private, please make a private post.