# Playing with bigram language models

This notebook uses the NLTK LanguageModel data type and its subtype MLE, a language model estimated using relative corpus frequencies as word probabilities. 

This notebook is not intended as a demo of how to implement language models from scratch but as a demo of language model output.

## The "Sam corpus"

Our first example uses the Green Eggs and Ham poem as a "corpus" from which we train a language model.

In [1]:
# we import the natural language toolkit
import nltk
# and its maximum likelihood estimation (MLE) language model
from nltk.lm import MLE

# lm is now an object of type MLE, in particular
# a bigram language model, because we put the parameter 2
lm_sam = MLE(2)

# Here is our corpus, first as plain text
text = "I am Sam. Sam I am. I do not like green eggs and ham."
# ... and here it is cut up into sentences
text_sents = nltk.sent_tokenize(text)
# ... and cut up into words
text_words = nltk.word_tokenize(text)
# ... and cut up into sentences, each of which
# is cut up into words
text_sent_words = [ nltk.word_tokenize(s) for s in text_sents]

print("Corpus as a list of sentences:", text_sents)

# The language model needs to know the vocabulary.
# We turn the list of words into a set, which eliminates
# duplicates, then turn that back into a list
vocab = list(set(text_words))

print("Vocabulary:", vocab)

# The language model needs to be trained
# with both bigrams and unigrams. 
# We start with the bigrams,
# turning each sentence of the text
# into a list of bigrams
text_bigrams = [ list(nltk.bigrams(s)) for s in text_sent_words]
print("Corpus as a list of sentences, each of which")
print("is a list of bigrams:", text_bigrams)

# The method fit() trains the language model on the text.
# The first time we call fit(), we need to also pass
# it the vocabulary, or it will raise an error.
lm_sam.fit(text_bigrams, vocabulary_text = vocab)

# Now we train with unigrams.
# We turn each sentence of the text into a list of unigrams
text_unigrams = [[(w,) for w in s] for s in text_sent_words]
print("Corpus as a list of sentences, each of which")
print("is a list of unigrams:", text_unigrams)
# Then we use fit() again. Since we have used it before, 
# we do not need to pass it the vocabulary again.
lm_sam.fit(text_unigrams)

Corpus as a list of sentences: ['I am Sam.', 'Sam I am.', 'I do not like green eggs and ham.']
Vocabulary: ['I', 'do', 'and', 'Sam', 'not', 'eggs', 'green', '.', 'am', 'ham', 'like']
Corpus as a list of sentences, each of which
is a list of bigrams: [[('I', 'am'), ('am', 'Sam'), ('Sam', '.')], [('Sam', 'I'), ('I', 'am'), ('am', '.')], [('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '.')]]
Corpus as a list of sentences, each of which
is a list of unigrams: [[('I',), ('am',), ('Sam',), ('.',)], [('Sam',), ('I',), ('am',), ('.',)], [('I',), ('do',), ('not',), ('like',), ('green',), ('eggs',), ('and',), ('ham',), ('.',)]]


In [2]:
# Now we use the language model to generate text.
# The first word is "I". We print it.
currentword = "I"
print(currentword, end = " ")
# The language model knows, for each word w, 
# the conditional probability P(w|currentword).
# We use that to generate text:
# lm.generate() picks a word to follow currentword
# It does not always pick the most likely one,
# But it will be more likely to pick a more likely following word:
for i in range(20):
    currentword = lm_sam.generate(text_seed= [currentword])
    print(currentword,end = " ")

I am Sam I do not like green eggs and ham . green eggs and ham . I am . am 

As you can see, the language model does something weird at sentence boundaries: It picks words to start a sentence that have never been observed at the start of a sentence. 

If we want the language model to behave exactly as in the book, we can do that by giving it the whole text as a single "sentence", but using sentence boundary markers ``<s>`` and ``</s>``. 

In [3]:
# Here is the corpus again.
text = "<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>"
# This time we don't make sentences
# and we use split() to split, so it does not 
# cut up our sentence start and end markers.
text_words = text.split()
print("Corpus as a list of words:", text_words)

# we make another bigram language model
lm_sam2 = MLE(2)

# vocabulary: words in the text, without repetitions
vocab = list(set(text_words))
print("Vocabulary:", vocab)

# training input: list of sentences,
# where each sentence is a tuple of bigrams
text_bigrams = [ list(nltk.bigrams(text_words)) ]
print("Corpus as bigrams:", text_bigrams)

lm_sam2.fit(text_bigrams, vocabulary_text = vocab)

# now let's do unigrams
text_unigrams = [[(w,) for w in text_words] ]
print("Corpus as unigrams:", text_unigrams)
lm_sam2.fit(text_unigrams)

Corpus as a list of words: ['<s>', 'I', 'am', 'Sam', '</s>', '<s>', 'Sam', 'I', 'am', '</s>', '<s>', 'I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham', '</s>']
Vocabulary: ['I', 'do', 'and', '<s>', 'not', 'eggs', 'green', 'am', 'Sam', 'ham', 'like', '</s>']
Corpus as bigrams: [[('<s>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</s>'), ('</s>', '<s>'), ('<s>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</s>'), ('</s>', '<s>'), ('<s>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</s>')]]
Corpus as unigrams: [[('<s>',), ('I',), ('am',), ('Sam',), ('</s>',), ('<s>',), ('Sam',), ('I',), ('am',), ('</s>',), ('<s>',), ('I',), ('do',), ('not',), ('like',), ('green',), ('eggs',), ('and',), ('ham',), ('</s>',)]]


In [4]:
# We generate once more, this time starting with
# the sentence start marker
word = "<s>"
print(word, end = " ")
for i in range(30):
    word = lm_sam2.generate(text_seed= [word])
    print(word,end = " ")

<s> I am </s> <s> I am </s> <s> I am Sam I do not like green eggs and ham </s> <s> I am Sam </s> <s> Sam </s> <s> I 

## Inaugural addresses

We now use a larger corpus: the corpus of inaugural addresses

In [5]:
# let's try training an inaugural speeches language model
from nltk.corpus import inaugural

# Getting the inaugural addresses corpus
# To run this on a different corpus,
# just switch out the definitions of corpus_sents
# and corpus_words, everything else stays the same

# Our corpus, cut up into sentences
corpus_sents = inaugural.sents()
# and cut up into words
corpus_words = inaugural.words()

# Each sentence turned into a list of bigrams
bigrams = [list(nltk.bigrams(s)) for s in corpus_sents]

# making a language model, bigram again
lm_inaug = MLE(2)

# vocabulary: a list of all words 
# in the corpus, without duplicates
vocab = list(set(corpus_words))

# training with the bigrams
lm_inaug.fit(bigrams, vocabulary_text = vocab)

# and training with the unigrams
unigrams = [[(w,) for w in s] for s in corpus_sents]
lm_inaug.fit(unigrams)

In [6]:
# Now we generate from the inaugural language model
word = lm_inaug.generate()
print(word, end = " ")
for i in range(50):
    word = lm_inaug.generate(text_seed= [word])
    print(word,end = " ")    

we can not attain their place where it or to the policy of the same glorious beauty ; and usefulness ends of those who are worse in each generation -- until it . Jefferson , and no new States have few leading strings , for their exposition of nations which there 