## Bigram with shuffled tokens

In this example, we evaluate bigram language model using a subset from brown corpus in nltk.

We first train the bigram language model with the orignal bigram sequences.

The trained bigram model was evaluated on 

1. Test corpus where invidual tokens were **shuffled**, i.e. ["I", "have", "an", "apple"] ==> ["an", "have", "apple", "I"], which results in bigram sequences [("an", "have"), ("have", "apple"), ("apple", "I")]
    
2. Test corpus with the order of tokens **unchanged**.

In [1]:
import numpy as np
import sys
import itertools
sys.path.append("/home/jichao/Desktop/reddit/language_model")
from nltk.corpus import brown
from language_model import *
import copy

# read sentences in "fiction" in brown corpus
corpus = list(brown.sents(categories="news"))
corpus = filter(lambda doc: len(doc) >= 2, corpus)
len_corpus = len(corpus)

print corpus[:2]

[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.']]


In [2]:
"""
For each of the ten iterations, shuffle the corpus.
Take the first 80% as training corpus and remaining 20% as test.
"""     
    
bigram_mean= []
bigram_mean_shuffled = []
unigram_mean = []
unigram_mean_shuffled = []    
    
for k in range(10):
    np.random.shuffle(corpus)

    corpus_train = corpus[:int(len_corpus * 0.8)]
    corpus_test = corpus[int(len_corpus * 0.8):]
    
    # find the list of unique tokens in training corpus, which is used 
    # to find the number of out-of-vocabulary words later    
    corpus_train_tokens = set(itertools.chain(*corpus_train))
    corpus_test_tokens = set(itertools.chain(*corpus_test))

    # 1. Test corpus where tokens in sentences are unchanged.
    
    # bigram
    bigram = Bigram(special_token=False)
    # oov: out-of-vocabulary words
    oov = filter(lambda token: token not in corpus_train_tokens, corpus_test_tokens)
    # `len(set(oov))`: number of oov
    bigram.fit(corpus_train, len(set(oov)))
    # for each sentence in `corpus_test` (a list of str), compute the log-probability score 
    logprob = map(lambda tokens: bigram.predict(tokens), corpus_test)
    # save the mean of the `logprob` for the `k`th iteration
    bigram_mean.append(np.mean(logprob))

    # unigram
    # repated the above for unigram model 
    unigram = Unigram(special_token=False)
    unigram.fit(corpus_train, len(set(oov)))
    logprob = map(lambda tokens: unigram.predict(tokens), corpus_test)
    unigram_mean.append(np.mean(logprob))

    # shuffle tokens in sentences in test corpus
    corpus_test_shuffled = copy.deepcopy(corpus_test)
    for x in corpus_test_shuffled:
        np.random.shuffle(x)

    # repeat the analysis on shuffled tokens
    # on bigram
    logprob = map(lambda tokens: bigram.predict(tokens), corpus_test_shuffled)
    bigram_mean_shuffled.append(np.mean(logprob))

    # on unigram
    logprob = map(lambda tokens: unigram.predict(tokens), corpus_test_shuffled)
    unigram_mean_shuffled.append(np.mean(logprob))    

In [3]:
print corpus_test[0]

[u'``', u'Leading', u'Nations', u'of', u'the', u'West', u'and', u'of', u'the', u'East', u'keep', u'busy', u'making', u'newer', u'nuclear', u'weapons', u'to', u'defend', u'themselves', u'in', u'the', u'event', u'the', u'constantly', u'threatening', u'nuclear', u'war', u'should', u'break', u'out', u'.']


In [4]:
print corpus_test_shuffled[0]

[u'in', u'the', u'Leading', u'threatening', u'themselves', u'the', u'East', u'of', u'``', u'weapons', u'the', u'newer', u'.', u'event', u'nuclear', u'nuclear', u'to', u'making', u'constantly', u'of', u'defend', u'keep', u'and', u'West', u'war', u'busy', u'out', u'Nations', u'should', u'break', u'the']


In [5]:
print "Average log-prob of bigrams (order of tokens unchaged):", np.mean(bigram_mean)
print "Average log-prob of bigrams (order of tokens shuffled):", np.mean(bigram_mean_shuffled)
print "Average log-prob of unigrams (order of tokens unchaged):", np.mean(unigram_mean)
print "Average log-prob of unigrams (order of tokens shuffled):", np.mean(unigram_mean_shuffled)

Average log-prob of bigrams (order of tokens unchaged): -6.59852267966
Average log-prob of bigrams (order of tokens shuffled): -7.85181773523
Average log-prob of unigrams (order of tokens unchaged): -7.2025810017
Average log-prob of unigrams (order of tokens shuffled): -7.2025810017


As we can see, the log-prob of bigrams where the order of tokens were shuffled is much lower than the case where the order of tokens were unchanged, which makes sense because the bigram models takes into account the relative order of adjacent tokens in a sentence.

In contrast, the comparison between shuffled and unchanged sequence of tokens resulted in exactly the same log-probablity, because unigram model only counts the number of unique tokens.

**Shuffle the sentences in the training corpus**

Note in the previous example we shuffled the sentences in the test corpus, while keeping the training corpus as is.
Now let's try shuffling the sentences in the training corpus while keeping the test corpus as is.

In [6]:
bigram_mean= []
bigram_mean_shuffled = []
unigram_mean = []
unigram_mean_shuffled = []    
    
for k in range(10):
    np.random.shuffle(corpus)

    corpus_train = corpus[:int(len_corpus * 0.8)]
    corpus_test = corpus[int(len_corpus * 0.8):]
    
    # find the list of unique tokens in training corpus, which is used 
    # to find the number of out-of-vocabulary words later    
    corpus_train_tokens = set(itertools.chain(*corpus_train))
    corpus_test_tokens = set(itertools.chain(*corpus_test))

    # 1. Test corpus where tokens in sentences are unchanged.
    
    # bigram
    bigram = Bigram(special_token=False)
    # oov: out-of-vocabulary words
    oov = filter(lambda token: token not in corpus_train_tokens, corpus_test_tokens)
    # `len(set(oov))`: number of oov
    bigram.fit(corpus_train, len(set(oov)))
    # for each sentence in `corpus_test` (a list of str), compute the log-probability score 
    logprob = map(lambda tokens: bigram.predict(tokens), corpus_test)
    # save the mean of the `logprob` for the `k`th iteration
    bigram_mean.append(np.mean(logprob))

    # unigram
    # repated the above for unigram model 
    unigram = Unigram(special_token=False)
    unigram.fit(corpus_train, len(set(oov)))
    logprob = map(lambda tokens: unigram.predict(tokens), corpus_test)
    unigram_mean.append(np.mean(logprob))

    # shuffle tokens in sentences in train corpus
    corpus_train_shuffled = copy.deepcopy(corpus_train)
    for x in corpus_train_shuffled:
        np.random.shuffle(x)

    # repeat the analysis on shuffled tokens
    # on bigram
    bigram = Bigram(special_token=False)
    bigram.fit(corpus_train_shuffled, len(set(oov)))
    logprob = map(lambda tokens: bigram.predict(tokens), corpus_test)
    bigram_mean_shuffled.append(np.mean(logprob))

    # on unigram
    unigram = Unigram(special_token=False)
    unigram.fit(corpus_train_shuffled, len(set(oov)))
    logprob = map(lambda tokens: unigram.predict(tokens), corpus_test)
    unigram_mean_shuffled.append(np.mean(logprob))    

In [7]:
print "Average log-prob of bigrams (order of tokens unchaged):", np.mean(bigram_mean)
print "Average log-prob of bigrams (order of tokens shuffled):", np.mean(bigram_mean_shuffled)
print "Average log-prob of unigrams (order of tokens unchaged):", np.mean(unigram_mean)
print "Average log-prob of unigrams (order of tokens shuffled):", np.mean(unigram_mean_shuffled)

Average log-prob of bigrams (order of tokens unchaged): -6.60034228183
Average log-prob of bigrams (order of tokens shuffled): -7.59580165101
Average log-prob of unigrams (order of tokens unchaged): -7.19853773406
Average log-prob of unigrams (order of tokens shuffled): -7.19853773406


The result still holds: shuffling the training corpus resulted in lower probabilities than keeping the training corpus unchanged.