In this ipython notebook, we evaluate bigram language model using a subset from brown corpus in nltk.

We first train the bigram language model with the orignal bigram sequences.

The trained bigram model was evaluated on 

1. Test corpus where invidual tokens were **shuffled**, i.e. ["I", "have", "an", "apple"] ==> ["an", "have", "apple", "I"], which results in bigram sequences [("an", "have"), ("have", "apple"), ("apple", "I")]
    
2. Test corpus with the order of tokens **unchanged**.

In [1]:
import numpy as np
import sys
import itertools

from nltk.corpus import brown
from language_model import *
import copy

# read sentences in "fiction" in brown corpus
corpus = list(brown.sents(categories="fiction"))
len_corpus = len(corpus)

bigram_fwd = []
bigram_shf = []
unigram_fwd = []
unigram_shf = []

In [2]:
corpus[:2]

[[u'Thirty-three'],
 [u'Scotty', u'did', u'not', u'go', u'back', u'to', u'school', u'.']]

In [3]:
"""
For each of the ten iterations, shuffle the corpus.
Take the first 80% as training corpus and remaining 20% as test.
""" 
for k in range(10):
    np.random.shuffle(corpus)
    print k

    corpus_train = corpus[:int(len_corpus * 0.8)]
    corpus_test = corpus[int(len_corpus * 0.8):]
    
    # find the list of unique tokens in training corpus, which is used 
    # to find the number of out-of-vocabulary words later
    corpus_train_tokens = set(itertools.chain(*corpus_train))
    corpus_test_tokens = list(itertools.chain(*corpus_test))

    # forward sequences (order unchanged)
    
    # bigram
    bigram = Bigram(special_token=True)
    # oov: out-of-vocabulary words
    oov = filter(lambda token: token not in corpus_train_tokens, corpus_test_tokens)
    # `len(set(oov))`: number of oov
    bigram.fit(corpus_train, len(set(oov)))
    # for each sentence in `corpus_test` (a list of str), compute the log-probability score 
    logprob = map(lambda tokens: bigram.predict(tokens), corpus_test)
    # save the mean of the `logprob` for the `k`th iteration
    bigram_fwd.append(np.mean(logprob))

    # repated the above for unigram model 
    unigram = Unigram(special_token=False)
    unigram.fit(corpus_train, len(set(oov)))
    logprob = map(lambda tokens: unigram.predict(tokens), corpus_test)
    unigram_fwd.append(np.mean(logprob))

    # shuffle tokens in sentences in text corpus
    corpus_test_shf = copy.deepcopy(corpus_test)
    for x in corpus_test_shf:
        np.random.shuffle(x)

    # repeat the analysis on shuffled tokens
    # on bigram
    bigram = Bigram(special_token=True)
    oov = filter(lambda token: token not in corpus_train_tokens, corpus_test_tokens)
    bigram.fit(corpus_train, len(set(oov)))
    logprob = map(lambda tokens: bigram.predict(tokens), corpus_test_shf)
    bigram_shf.append(np.mean(logprob))

    # on unigram
    unigram = Unigram(special_token=False)
    unigram.fit(corpus_train, len(set(oov)))
    logprob = map(lambda tokens: unigram.predict(tokens), corpus_test_shf)
    unigram_shf.append(np.mean(logprob))

0
1
2
3
4
5
6
7
8
9


In [4]:
print "Average log-prob of bigrams (order of tokens unchaged):", np.mean(bigram_fwd)

Average log-prob of bigrams (order of tokens unchaged): -5.34970822898


In [5]:
print "Average log-prob of bigrams (order of tokens shuffled):", np.mean(bigram_shf)

Average log-prob of bigrams (order of tokens shuffled): -6.92859091893


In [6]:
print "Average log-prob of unigrams (order of tokens unchaged):", np.mean(unigram_fwd)

Average log-prob of unigrams (order of tokens unchaged): -6.56173921049


In [7]:
print "Average log-prob of unigrams (order of tokens shuffled):", np.mean(unigram_shf)

Average log-prob of unigrams (order of tokens shuffled): -6.56173921049


As we can see, the log-prob of bigrams where the order of tokens were shuffled is much lower than the case where the order of tokens were unchanged, which makes sense because the bigram models takes into account the relative order of adjacent tokens in a sentence.

In contrast, the comparison between shuffled and unchanged sequence of tokens resulted in exactly the same log-probablity, because unigram model only counts the number of unique tokens.