# N-Gram Language Models
In this exercise, we will use n-gram language models to predict the probability of text, and generate it.

In [144]:
import nltk
from nltk.corpus import gutenberg

First, we load Jane Austen's Emma from NLTK's gutenberg corpus that we also used in a previous exercise. Tokenize and lowercase this text such that we have a list of words.

In [145]:
raw_text = gutenberg.raw('austen-emma.txt')
words = [w.lower() for w in nltk.word_tokenize(raw_text)]
len(words)

191855

Write an n-gram language model class that takes the word list and a parameter `n` as inputs, where `n` is a positive integer larger than 1 that determines the `n` of the n-gram LM. The LM should build a dictionary of n-gram counts from the word list.

In [146]:
def preprocess(string: str) -> list[str]:
    return [w.lower() for w in nltk.word_tokenize(string)]

In [147]:
from collections import defaultdict

class NGramLanguageModel:
    
    def __init__(self, words, n):
        assert n > 1, "n needs to be a positive integer > 1"
        assert n <= len(words), "n can't be larger than the number of words"
        self.counts: dict[tuple[str, ...], int] = defaultdict(int)
        self.n = n
        for i in range(len(words) - n + 1):
            ngram = tuple(words[i : i+n])
            self.counts[ngram] += 1
            self.counts[ngram[:-1]] += 1
        self.counts[ngram[1:]] += 1

Now we "train" the n-gram LM by building the n-gram counts of the Emma novel. Use a low `n` (i.e. 2 or 3).

In [148]:
lm = NGramLanguageModel(words, 2)

Let's add a method `log_probability` to the n-gram LM class that computes the probability of an input string. Since multiplying many probabilities (<= 1) results in very small numbers that can underflow, we sum the log probabilities instead.

In [161]:
import math

def log_probability(self, input_string) -> float:
        """ Returns the log-probability of the input string."""
        input_words = preprocess(input_string)
        probability = 0
        for i in range(len(input_words) - self.n + 1):
            ngram = tuple(input_words[i : i + self.n])
            ngram_min_one = ngram[:-1]
            if ngram in self.counts:
                probability += math.log(self.counts[ngram] / self.counts[ngram_min_one])
        return probability / len(input_words)

NGramLanguageModel.log_probability = log_probability

Shorter texts will have higher log probability than longer texts, so we need to normalize it by the number of words in the input string.

In [162]:
print(lm.log_probability("What is the meaning of life?"))
print(lm.log_probability("What is the meaning of life, given I am a student?"))

-3.5103721425666126
-2.1851329382217575


Lets predict the probabilities of two novels under our trained model: Jane Austen's *Sense and Sensibility* (`austen-sense.txt`) and Shakespeare's *Hamlet* (`shakespeare-hamlet.txt`).
- What do you expect will happen?
- What do you observe?

In [163]:
austen_sense = gutenberg.raw('austen-sense.txt')
shakespeare_hamlet = gutenberg.raw('shakespeare-hamlet.txt')

print("Austen Sense: ", lm.log_probability(austen_sense))
print("Shakespeare Hamlet: ", lm.log_probability(shakespeare_hamlet))

Austen Sense:  -2.5932445929509567
Shakespeare Hamlet:  -1.5432730138297792


How many n-grams are known in each input?

In [164]:
def count_known_n_grams(self, input_string) -> int:
    input_words = preprocess(input_string)
    count = 0
    for i in range(len(input_words) - self.n + 1):
        ngram = tuple(input_words[i : i + self.n])
        if ngram in self.counts:
            count += 1
    return count

NGramLanguageModel.count_known_n_grams = count_known_n_grams

In [165]:
print("Austen Sense: ", lm.count_known_n_grams(austen_sense))
print("Shakespeare Hamlet: ", lm.count_known_n_grams(shakespeare_hamlet))

Austen Sense:  97236
Shakespeare Hamlet:  13568


Let's add a method `generate` that takes the start of a sentence ("prompt") and a number of words to generate, then continues our prompt.

In [210]:
def generate(self, prompt, num_words=10):
    """ Continues a text starting with `prompt` for the `num_words` next words. """
    prompt_words = preprocess(prompt)
    n_gram_minus_one = tuple(prompt_words[-self.n + 1:])
    result = []
    for _ in range(num_words):
        if n_gram_minus_one not in self.counts:
            k = "?"
        else:
            k_v = list((k, v) for k, v in self.counts.items() if k[:-1] == n_gram_minus_one)
            k = sorted(k_v, key=lambda x: x[1], reverse=True)
            k = k[0][0][-1]
        result.append(k)
        n_gram_minus_one = tuple(list(n_gram_minus_one[1:]) + [k])
    return " ".join(prompt_words + result)
        

NGramLanguageModel.generate = generate

Play around with a few different prompts.

In [212]:
lm.generate("Emma is a young an talented woman")

'emma is a young an talented woman , and the same time , and the same time'