# Language Models

Let's start with a simple, Laplace-smoothed trigram model:

In [2]:
from collections import defaultdict
import numpy as np
import nltk

smoothing = 0.001
counts = defaultdict(lambda: defaultdict(lambda: smoothing))

corpus = [line.strip().split() for line in open('../data/moby_dick.txt')]

for sentence in corpus:
    tokens = ['*', '*'] + sentence + ['STOP']
    for u, v, w in nltk.ngrams(tokens, 3):
        counts[(u, v)][w] += 1

def logP(u, v, w):
    return np.log(counts[(u, v)][w]) - np.log(sum(counts[(u, v)].values()))

def sentence_logP(S):
    tokens = ['*', '*'] + S + ['STOP']
    return sum([logP(u, v, w) for u, v, w in nltk.ngrams(tokens, 3)])

We can now score arbitrary sentences:

In [80]:
sentence_logP('Captain Ahab is a white whale .'.split())

-29.31730693419735

## Generation

We can re-use the counts to generate language:

In [29]:
def sample_next_word(u, v):
    keys, values = zip(*counts[(u, v)].items())
    values = np.array(values)
    values /= values.sum()
    return keys[np.argmax(np.random.multinomial(1, values))]

def generate():
    result = ['*', '*']
    previous_word = None
    next_word = sample_next_word(result[-2], result[-1])
    result.append(next_word)
    while next_word != 'STOP':
        next_word = sample_next_word(result[-2], result[-1])
        result.append(next_word)

    return ' '.join(result[2:-1])

We can now generate non-sensical sentences:

In [81]:
generate()

'For when three days he was so afraid of black eyes that he is never hunted .'

## Exercise

Extend the code above to arbitray $n$-gram sizes. Use another corpus to try it with $n=4$.

It might be helpful to use a `class` for the LM, make the smoothing a parameter, `counts` a class property, and add a function `fit()`.

In [None]:
# Your code here