# N-Gram Language Models
In this exercise, we will use n-gram language models to predict the probability of text, and generate it.

In [1]:
import nltk
from nltk.corpus import gutenberg

First, we load Jane Austen's Emma from NLTK's gutenberg corpus that we also used in a previous exercise. Tokenize and lowercase this text such that we have a list of words.

In [2]:
nltk.download('gutenberg')
raw_text = gutenberg.raw('austen-emma.txt')
# TODO: tokenize and lowercase the text, save a list of words, print the number of words in this novel
words = nltk.word_tokenize(raw_text)
words = [w.lower() for w in words]
print(len(words))

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/blackbook/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


191776


Write an n-gram language model class that takes the word list and a parameter `n` as inputs, where `n` is a positive integer larger than 1 that determines the `n` of the n-gram LM. The LM should build a dictionary of n-gram counts from the word list.

In [3]:
from collections import defaultdict
class NGramLanguageModel:
    
    def __init__(self, words, n):
        assert n > 1, "n needs to be a positive integer > 1"
        assert n <= len(words), "n can't be larger than the number of words"
        
        # TODO: build a dictionary of n-gram counts
        self.n: int = n
        self.ngram_counts = defaultdict(int)
        
        self.build_ngram_counts(words)
        
    def build_ngram_counts(self, words) -> None:
        for i in range(len(words) - self.n + 1):
            ngram = tuple(words[i:i+self.n])
            self.ngram_counts[ngram] += 1
        
    def get_ngram_counts(self):
        return dict(self.ngram_counts)
        
        
n = 2
model = NGramLanguageModel(words, n)
print(model.get_ngram_counts())

Now we "train" the n-gram LM by building the n-gram counts of the Emma novel. Use a low `n` (i.e. 2 or 3).

Let's add a method `log_probability` to the n-gram LM class that computes the probability of an input string. Since multiplying many probabilities (<= 1) results in very small numbers that can underflow, we sum the log probabilities instead.

In [20]:
def log_probability(self, input_string):
    """ Returns the log-probability of the input string."""
    pass

NGramLanguageModel.log_probability = log_probability

Shorter texts will have higher log probability than longer texts, so we need to normalize it by the number of words in the input string.

-3.6389612731102003
-9.41029012916906


Lets predict the probabilities of two novels under our trained model: Jane Austen's *Sense and Sensibility* (`austen-sense.txt`) and Shakespeare's *Hamlet* (`shakespeare-hamlet.txt`).
- What do you expect will happen?
- What do you observe?

-11.813733285332715
-9.79124704762797
-15.993466330627491


How many n-grams are known in each input?

Let's add a method `generate` that takes the start of a sentence ("prompt") and a number of words to generate, then continues our prompt.

In [None]:
def generate(self, prompt, num_words=10):
    """ Continues a text starting with `prompt` for the `num_words` next words. """
    pass

NGramLanguageModel.generate = generate

Play around with a few different prompts.