Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your collaborators below:

In [None]:
COLLABORATORS = ""

---

In [None]:
import numpy as np

Change the following variable to `True` if you want us to grade this challenge problem. 

**IMPORTANT**: You can only get points for _one_ challenge problem per problem set. This means that even if you complete both challenge problems on this problem set, we will only count your best score between the two.

In [None]:
SUBMIT_CHALLENGE = False

## Language Models

One way to add a non-deterministic element to a chatbot is to include a language model. In this context, a language model is a probability function $p$ that assigns probabilities to particular word sequences. For example, a language model might assign a probability of 0.15 to the phrase ${\bf w_1} = [{\tt I}, {\tt love}, {\tt Comp}, {\tt Models}]$. We can use language models to estimate whether one phrase is more plausible than another. For example, the same language model might assign a probability of 0.22 to the phrase ${\bf w_2} = [{\tt I}, {\tt love}, {\tt ice}, {\tt cream}]$, indicating that, with no additional evidence, we are more likely to encounter ${\bf w}_2$ than ${\bf w}_1$.

### N-Gram Models

One simple type of language model is an **n-gram model**. In these models, the probability of the next word in a sequence is calculated using the previous $n-1$ words as context. For example, if the first three words in a sentence are

$$\texttt{Turn, in, your}$$

a 4-gram language model might generate the following predictions:

$$P({\tt homework}\  | \ {\tt turn}, {\tt in}, {\tt your}) = 0.45$$
$$P({\tt assignment} \ | \ {\tt turn}, {\tt in}, {\tt your}) = 0.35$$
$$P({\tt badge} \ | \ {\tt turn}, {\tt in}, {\tt your}) = 0.05$$
$$P({\tt grave} \ | \ {\tt turn}, {\tt in}, {\tt your}) = 0.15$$

The probability of a particular sequence of word tokens in an n-gram can be calculated using raw token frequences. Given a training corpus (i.e., a collection of texts), the probability of a word given the $n-1$ preceding tokens is simply

$$P(w_n | w_1, \ldots, w_{n-1} ) = \frac{C(w_1, \ldots, w_n)}{C(w_1, \ldots, w_{n-1})}$$

where $C(w_1, \ldots w_n)$ denotes the number of times the n-gram sequence $w_1, w_2, \ldots, w_n$ was observed in the training corpus. Note that the above definition corresponds to the **maximum likelihood estimate (MLE)** for the training data n-grams. Under this model, the probability of generating a particular sentence consisting of $k$ words (for $k > n$) is given by

$$ P(w_1, \ldots, w_k) = P(w_1) \times P(w_2 | w_1) \times P(w_3|w_1, w_2) \times \ldots \times P(w_n|w_1, \ldots w_{n-1}) \times \ldots P(w_k|w_{k-n-1}, \ldots, w_{k-1})$$

### Assignment

For this challenge problem you will be building an N-Gram model for use within, e.g., a chatbot. You will interact with your model using the following functions:
1. `log_sentence_prob`: takes a complete sentence and returns its log probability according to your n-gram model. The input to this function might be something like `np.array(["this", "is", "a", "complete", "sentence"])`. 
2. `generate`: generates a random sentence using your model. 

It is up to you to select a corpus of texts that you would like to train your model on. A good starting place might be one of the texts from Project Gutenberg (e.g., [Jane Eyre](https://www.gutenberg.org/cache/epub/1260/pg1260.txt)). 

* While more sophisticated NLP approaches often perform significant preprocessing on a text before constructing a language model, the current assignment does not require that you do this. However, if you are interested, the following [lecture notes](http://pages.cs.wisc.edu/~jerryzhu/cs769/text_preprocessing.pdf) describe a few common techniques that may help to improve your model.

Constructing your model will generally entail 
1. Cleaning up your corpus to remove extraneous formatting + punctuation + unwanted tokens
2. Counting the frequencies of each unique sequence of $n$ word tokens in your corpus
3. Using these sequence frequencies ($n$-grams) to compute the probability of that sequence within your corpus
4. Using these sequence probabilities to construct a sentence by sequentially generating the most probable word given the previous $n$ words in the sentence.

In [None]:
# load your corpus
import requests
target_url = 'Replace this with the link to your corpus!'
text = requests.get(target_url).text

In [None]:
def preprocess(text):
    """
    Given a text, returns a list consisting solely of the word 
    tokens contained therein. Feel free to write helper 
    functions for removing formatting and punctuation, converting 
    word case, creating equivalence classes, etc. 
    
    Parameters
    ----------
    text: string
        A string containing the raw text of your corpus
 
    Returns
    -------
    cleaned_text: list of strings
        a list of the words in your corpus presented in the same 
        order as they appear in text
    """
    # YOUR CODE HERE
    raise NotImplementedError()

    
def calc_ngram_probs(cleaned_text, n=3):
    """
    Calculates the log probability of each unique n-word sequence 
    within a text. 
    
    Parameters
    ----------
    cleaned_text: string
        A list of the words in your corpus presented in the same 
        order as they appear in text
    
    n: int
        The gram-size for your model (i.e., the number of words in 
        your n-gram sequences).
    
    Returns
    -------
    ngram_probs: dict
        A dictionary of key,value pairs where the keys correspond 
        to unique n-gram word sequences and the values correspond 
        to the log probability of the sequence occurring within 
        your corpus
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    

def generate(ngram_probs, seed):
    """
    Completes a sentence using the probabilities using your ngram model.
    
    Parameters
    ----------
    ngram_probs: dict
        A dictionary of key,value pairs where keys correspond 
        to n-gram word sequences and values correspond to the 
        log probability of the sequence occurring within your 
        corpus
    
    seed: list
        A list containing the first n words of a sentence.
    
    Returns
    -------
    sentence: list
        A list of length >= n+1 word tokens constituting a sentence. The 
        n+1st to the last word should be generated using the probabilities
        of your ngram model.
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    
    
def log_sentence_prob(ngram_probs, sentence):
    """
    Returns the log probability of a sentence using the probabilites
    of your n-gram model. You may assume that the probability of generating 
    the first n-1 words (i.e., the seed words) is 1.
    
    Parameters
    ----------
    ngram_probs: dict
        A dictionary of key,value pairs where keys correspond 
        to n-gram word sequences and values correspond to the 
        log probability of the sequence occurring within your 
        corpus
    
    sentence: list
        A list of word tokens constituting a single sentence
    
    Returns
    -------
    prob: float
        The log probability of generating the words in sentence
        according to your n-gram model.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Test the model for n=3 (tri-grams)
n = 3
seed = ["Today", "I", "went"]

cleaned_text = preprocess(text)
ngram_probs = calc_ngram_probs(cleaned_text, n)
sentence = generate(ngram_probs, seed)
log_prob = log_sentence_prob(ngram_probs, sentence)

print('Sentence: {}'.format(sentence))
print('log P(Sentence): {}'.format(log_prob))

---

Before turning this problem in remember to do the following steps:

1. **Restart the kernel** (Kernel$\rightarrow$Restart)
2. **Run all cells** (Cell$\rightarrow$Run All)
3. **Save** (File$\rightarrow$Save and Checkpoint)

<div class="alert alert-danger">After you have completed these three steps, ensure that the following cell has printed "No errors". If it has <b>not</b> printed "No errors", then your code has a bug in it and has thrown an error! Make sure you fix this error before turning in your problem set.</div>

In [None]:
print("No errors!")