# Week 4: N-gram language models

In [None]:
from nltk.corpus import brown
import random
import math
import pandas as pd
from collections import Counter
random.seed(123)

The Brown Corpus comes preprocessed via word tokenization.

In [None]:
dataset = brown.words()
len(dataset)

For the purpose of experimentation, let's create a train/test split of the dataset.

In [None]:
train_data = dataset[:1000000]
test_data = dataset[1000000:]

### Train uni-gram language model

Let's now start by reimplementing our bag-of-words model from last week, or our unigram model.

In [None]:
def get_unigram_vocabulary(dataset):
    types = list(set(dataset))
    return types

def unigram_lm(sequence_tokens, vocabulary):
    BoW = {t: 1 for t in vocabulary}
    vocab_size = len(vocabulary)
    counts = dict(Counter(sequence_tokens))
    total = sum(counts.values())
    for token in BoW:
        if token in counts:
            BoW[token] = (BoW[token] + counts[token])/(total + vocab_size)
        else:
            BoW[token] = (BoW[token])/(total + vocab_size)
    return BoW

Let's fit our unigram model to our dataset!

In [None]:
brown_unigrams = unigram_lm(train_data, get_unigram_vocabulary(dataset))

### Train an bi-gram language model

Now let's write a function that returns a bigram model. The first step is a function that returns the set of possible bigrams in our dataset.

In [None]:
def get_bigram_vocabulary(dataset):
    bigram_types = []
    pad_token = "[PAD]"
    ## TO DO

    ##
    return bigram_types

Now that we have a way to get the set of bigram types lets write the bigram model (don't forget to implement Laplace smoothing):

In [None]:
def bigram_lm(train_data, dataset):
    bigrams = get_bigram_vocabulary(dataset)
    unigrams = get_unigram_vocabulary(dataset)+["[PAD]"]
    bigram_counts = {t: 1 for t in bigrams}
    unigram_counts = {t: 0 for t in unigrams}
    unigram_counts["[PAD]"] = 1
    bigram_probs = dict()
    vocab_size = len(unigrams)
    ## TO DO

    ##
    return bigram_probs

Let's fit a bigram model to the brown corpus.

In [None]:
brown_bigrams = bigram_lm(train_data, dataset)

### Train a tri-gram language model

Lets now repeat these steps but for a trigram model.

In [None]:
def get_trigram_vocabulary(dataset):
    trigram_types = []
    pad_token = "[PAD]"
    ## TO DO

    ##
    return trigram_types

In [None]:
def trigram_lm(train_data, dataset):
    trigrams = get_trigram_vocabulary(dataset)
    bigrams = get_bigram_vocabulary(dataset)+[("[PAD]","[PAD]")]
    unigrams = get_unigram_vocabulary(dataset)+["[PAD]"]
    trigram_counts = {t: 1 for t in trigrams}
    bigram_counts = {t: 0 for t in bigrams}
    trigram_probs = dict()
    vocab_size = len(unigrams)
    ## TO DO

    ##
    return trigram_probs

Let's fit a trigram model to the brown corpus.

In [None]:
brown_trigrams = trigram_lm(train_data, dataset)

### Compare the perplexity of the test data of each model

Which of these models performs best at representing the test data distribution? Write a function that takes a fitted ngram model and a test dataset and returns the perplexity of that dataset. 

Since the probability of the test data is the product of the probabilities of the ngrams which compose it, it is a very small number and we risk running into a floating-point error when trying to compute it. Thus, we should calculate perplexity in log base 2 space. Here is the formula.

$PP(W) = 2^{-\frac{1}{n}\log P(W)}$

In [None]:
def get_perplexity(ngram_lm, test_data):
    if type(list(ngram_lm.keys())[0]) is tuple :
        ngram_size = len(list(ngram_lm.keys())[0])
    else:
        ngram_size = 1
    perplexity = 0.0
    n = len(test_data)+(ngram_size-1)
    ## TO DO
    
    ##
    return perplexity

In [None]:
get_perplexity(brown_trigrams, train_data)

In [None]:
def compare_perplexity_scores(models, dataset):
    results = [get_perplexity(lm, dataset) for lm in models]
    return results

models = [brown_unigrams, brown_bigrams, brown_trigrams]
train_perplexity = compare_perplexity_scores(models, train_data)
test_perplexity = compare_perplexity_scores(models, test_data)

results = {'models':['unigram','bigram','trigram'],
 'train_perplexity':train_perplexity,
 'test_perplexity':test_perplexity}

df_results = pd.DataFrame(data=results)
df_results

**What do you notice about these results? Why might that be?**