In [1]:
version = "v1.6.092820"

---
# Assignment 1 Part 2: N-gram Language Models (Cont.) (30 pts)

In this assignment, we're going to train an n-gram language model that is able to "imitate" William Shakespeare's writing. 

In [2]:
# Configure nltk

import nltk

nltk_data_path = "assets/nltk_data"
if nltk_data_path not in nltk.data.path:
    nltk.data.path.append(nltk_data_path)

In [3]:
# Copy and paste the functions you wrote in Part 1 here and import any libraries necessary
# We have tried a more elegant solution by using
# from ipynb.fs.defs.assignment1_part1 import load_data, build_vocab, build_ngrams
# but it doesn't work with the autograder...

from itertools import chain
from nltk.tokenize import word_tokenize
from nltk import ngrams
from nltk.lm.preprocessing import pad_both_ends
import pandas as pd
def load_data():
    """
    Load text data from a file and produce a list of token lists
    """
    is_integer = lambda s: s.isdigit() or (s[0] == '-' and s[1:].isdigit())
    with open('assets/gutenberg/THE_SONNETS.txt', 'r') as f:
        sentences_no_newline = [line.strip() for line in f]
        sentences_no_empty_lines = list(filter(None, sentences_no_newline))
        sentences_no_integer = [item for item in sentences_no_empty_lines if not item.isdigit()]
        sentences_tokenized = [word_tokenize(i) for i in sentences_no_integer]
        sentences = [[j.lower() for j in i] for i in sentences_tokenized]
    
    
    return sentences



def build_vocab(sentences):
    """
    Take a list of sentences and return a vocab
    """
    
    list_of_sentences = sentences
    
    vocab_without_specialtoken = list(set(chain(*list_of_sentences)))
    vocab_without_specialtoken.extend(['<s>', '</s>' ])
    vocab = vocab_without_specialtoken                                   
    
   
    
    return vocab

def build_ngrams(n, sentences):
    """
    Take a list of unpadded sentences and create all n-grams as specified by the argument "n" for each sentence
    """
    updated_sentences = []
    all_ngrams = []
    

    for sentence in sentences:
        sentence = list(pad_both_ends(sentence,  n=n))
        updated_sentences.append(sentence)
        
        

        
    for sentence in updated_sentences:
        sentence = list(ngrams(sentence, n))
        all_ngrams.append(sentence)
        
    
    
    return list(all_ngrams)

    


## Question 4: Guess the next token (20 pts)

Let's first warm ourselves up by answering the following question as a review on $n$-grams:

Assume we are now working with bi-grams. What is the most likely token that comes after the sequence `<s> <s> <s>`, and how likely? Remember that a bi-gram language model is essentially a first-order Markov Chain. So, what determines the next state in a first-order Markov Chain? 

**Complete the function below to return a `tuple`, where `tuple[0]` is a `str` representing the mostly likely token and `tuple[1]` is a `float` representing its (conditional) probability of being the next token.**

In [4]:
def bigram_next_token(start_tokens=("<s>", ) * 3):
    """
    Take some starting tokens and produce the most likely token that follows under a bi-gram model
    """
    
    zero_sublist = []
    one_sublist = []
    both_sublist = []
    
    ngram_length = 2
    sentences = load_data()
    bigrams = build_ngrams(ngram_length, sentences)
    
    for sentence in bigrams:
        for item in sentence:
            zero_sublist.append(item[0])
            
    for sentence in bigrams:
        for item in sentence:
            one_sublist.append(item[1])
            
    for sentence in bigrams:
        for item in sentence:
            both_sublist.append(item)
            
    bigrams_df = pd.DataFrame(
    {'0': zero_sublist,
     '1': one_sublist,
     '2grams': both_sublist
    })
    
    bigrams_df['2grams'] = bigrams_df['2grams'].astype(str)
    bigrams_df['2grams'] = bigrams_df['2grams'].str.replace('(', '')
    bigrams_df['2grams'] = bigrams_df['2grams'].str.replace(')', '')
    bigrams_df['2grams'] = bigrams_df['2grams'].str.replace("'", '')
    
    filters = bigrams_df["2grams"].str.startswith('<s>') 
    bigrams_df = bigrams_df[filters]
    bigrams_df['freq'] = bigrams_df.groupby('1')['1'].transform('count')
    bigrams_df['probability'] = bigrams_df['freq']/2155
    max_prob = bigrams_df[bigrams_df.probability == bigrams_df.probability.max()]
    max_prob['probability'] = max_prob['probability'].astype(float)
    max_prob['1'] = max_prob['1'].astype(str)
    max_prob_row = max_prob.iloc[0, :].values.tolist()
    the_token = max_prob_row[1]
    probability = max_prob_row[4]
            

    
    next_token, prob = the_token, probability
    

    
    return next_token, prob
    

    


In [6]:
# Autograder tests

stu_ans = bigram_next_token(start_tokens=("<s>", ) * 3)

assert isinstance(stu_ans, tuple), "Q4: Your function should return a tuple. "
assert len(stu_ans) == 2, "Q4: Your tuple should have two elements. "
assert isinstance(stu_ans[0], str), "Q4: tuple[0] should be a str. "
assert isinstance(stu_ans[1], float), "Q4: tuple[1] should be a float. "

# Some hidden tests


del stu_ans

## Question 5: Train an $n$-gram language model (10 pts)

Now we are well positioned to start training an $n$-gram language model. We can fit a language model using the `MLE` class from `nltk.lm`. It requires two inputs: a list of all $n$-grams for each sentence and a vocabulary, both of which you have already written a function to build. Now it's time to put them together to work. 

**Complete the function below to return a trained $n$-gram language model.**

In [9]:
from nltk.lm import MLE

def train_ngram_lm(n):
    """
    Train a n-gram language model as specified by the argument "n"
    """
    
    ngrams = n
    sentences = load_data()
    list_of_ngrams = build_ngrams(ngrams, sentences)
    vocabulary = build_vocab(sentences)
    
    
    lm = MLE(n)
    lm.fit(list_of_ngrams, vocabulary )

    
    return lm

In [10]:
# Autograder tests

stu_n = 4
stu_lm = train_ngram_lm(stu_n)
stu_vocab = build_vocab(load_data())

assert hasattr(stu_lm, "vocab") and len(stu_lm.vocab) == len(stu_vocab) + 1, "Q3b: Your language model wasn't trained properly. "

del stu_n, stu_lm, stu_vocab

FINALLY, are you ready to compose sonnets like the real Shakespeare?! We provide some starter code below, but absolutely feel free to modify any parts of it on your own. It'd be interesting to see how the "authenticity" of the sonnets is related to the parameter $n$. Do the sonnets feel more Shakespeare when you increase $n$? 

In [12]:
# Every time it runs, depending on its mood, a different sonnet is written. 
n = 8
num_lines = 14
num_words_per_line = 8
text_seed = ["<s>"] * (n - 1)

lm = train_ngram_lm(n)

sonnet = []
while len(sonnet) < num_lines:
    while True:  # keep generating a line until success
        try:
            line = lm.generate(num_words_per_line, text_seed=text_seed)
        except ValueError:  # the generation is not always successful. need to capture exceptions
            continue
        else:
            line = [x for x in line if x not in ["<s>", "</s>"]]
            sonnet.append(" ".join(line))
            break

# pretty-print your sonnet
print("\n".join(sonnet))

thy self away , art present still with
is lust in action , and till action
i will be true despite thy scythe and
dulling my lines , and doing me disgrace
but like a sad slave stay and think
save breed to brave him , when he
such is my love , to thee i
thine eyes , that taught the dumb on
points on me graciously with fair aspect ,
against confounding age ’ s cruel knife ,
against the wrackful siege of batt ’ ring
whose worth ’ s unknown , although his
ay , fill it full with wills ,
nay if thou lour ’ st on me
