Now, let's try a more interesting Markov model, in which we simulate the linguistic styles of various authors with recognizable styles.  To begin with, we'll train a Markov model with the complete works of William Shakespeare.  Much as in the weather example, training will consist of finding the transition probabilities from one word to the next.  As such, we'll compute prior probabilities representing the relative frequency of words in the body of work, and we'll also compute a gigantic set of transition probabilities from each word to all the other words.  First, we'll read in the Shakespeare text,

In [1]:
from __future__ import division,print_function

import numpy as np
import string
from collections import Counter
import re
import json
import unicodedata

sequence_shakespeare = []
file = open('t8.shakespeare.txt','r')
for line in file:
    line.strip('\n')
    if line[:2] == '  ':
        line_words = re.findall(r"[\w']+|[.,!?;]",line)
        line_words = [str(w).lower() for w in line_words if not w.isupper() and not w.isdigit()] 

        sequence_shakespeare.extend(line_words)
        
print (sequence_shakespeare[:100])

['from', 'fairest', 'creatures', 'we', 'desire', 'increase', ',', 'that', 'thereby', "beauty's", 'rose', 'might', 'never', 'die', ',', 'but', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', ',', 'his', 'tender', 'heir', 'might', 'bear', 'his', 'memory', 'but', 'thou', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', ',', "feed'st", 'thy', "light's", 'flame', 'with', 'self', 'substantial', 'fuel', ',', 'making', 'a', 'famine', 'where', 'abundance', 'lies', ',', 'thy', 'self', 'thy', 'foe', ',', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', 'thou', 'that', 'art', 'now', 'the', "world's", 'fresh', 'ornament', ',', 'and', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', ',', 'within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', ',', 'and', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarding', 'pity']


He goes on an on.  Now, we'll import a model that I made for this purpose.

In [3]:
from markov_models import FirstOrderMarkovModel

mm_shakespeare = FirstOrderMarkovModel(sequence_shakespeare)
mm_shakespeare.build_transition_matrices()

The last line has built the transition matrices, which are actually not matrices in this implementation, but instead dictionaries that store only entries for which there appears a transition.  For example, 'the' is never followed by another 'the', so it would be a waste to explicitly keep track of a zero probability case.  This is actually true for the vast majority of word pairs, so not keeping a 30000 by 30000 matrix is advantageous.  

With this model in hand, we can do interesting things like generate synthetic data.

In [5]:
mm_shakespeare.generate_phrase()

"stands on my brother's love bear it and kneel , why , present trouble to make thee not be gone ; firm security ; you . "

The feel is right, if not exactly sensible!  These models are great at capturing tone and style, but not so much the meaning.  Another thing that we can do is to use our statistical model to evaluate the probability of new examples.  For example, if I wanted to evaluate how probable it was that Shakespeare generate the phrase 

In [7]:
test_string = 'to be or not to be'

I would just evaluate the prior probability on 'to' then multiply that by $P(be|to)$ then the probability of $P(or|be)$ and so on.  In practice we'll use log probabilities to avoid underflow:

In [8]:
log_like_shakespeare = mm_shakespeare.compute_log_likelihood(test_string,lamda=0.01,unknown_probability=1e-10)
print (log_like_shakespeare)

-25.871950531869842


These aren't that interesting on their own.  A better use for these log-likelihoods is as a classification scheme.  If I had another statistical model built upon a corpus of text, I could compute the likelihood for both and decide which writer produced the text.  

One contemporary goldmine of idiosyncratic text is the twitter account of Donald J. Trump.  Let us create a model for his tweets.

In [10]:
sequence_trump = []
file = open('trup_tweets.json','r')
tweet_list = json.loads(file.read())
for t in tweet_list:
    tweet = unicodedata.normalize('NFKD',t['text']).encode('ascii','ignore')
    tweet = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', tweet)
    line_words = re.findall(r"[\w']+|[.,!?;]",tweet)
    line_words = [str(w).lower() for w in line_words if not w.isdigit()] 

    sequence_trump.extend(line_words)

print (sequence_trump[:100])
    
mm_trump = FirstOrderMarkovModel(sequence_trump)
mm_trump.build_transition_matrices()


['the', 'whitehouse', 'is', 'partnering', 'with', 'interior', 'and', 'natlparkservice', 'to', 'bring', 'the', "nscsafety's", 'prescribed', 'to', 'death', 'opioid', 'memorial', 'to', 'the', 'ellipse', 'beginning', 'tomorrow', ',', 'april', '12th', 'to', 'april', '18th', '.', 'more', 'information', 'speaker', 'paul', 'ryan', 'is', 'a', 'truly', 'good', 'man', ',', 'and', 'while', 'he', 'will', 'not', 'be', 'seeking', 're', 'election', ',', 'he', 'will', 'leave', 'a', 'legacy', 'of', 'achievement', 'that', 'nobody', 'can', 'question', '.', 'we', 'are', 'with', 'you', 'paul', '!', 'much', 'of', 'the', 'bad', 'blood', 'with', 'russia', 'is', 'caused', 'by', 'the', 'fake', 'amp', ';', 'corrupt', 'russia', 'investigation', ',', 'headed', 'up', 'by', 'the', 'all', 'democrat', 'loyalists', ',', 'or', 'people', 'that', 'worked', 'for', 'obama']


We can, again, generate data using this model:

In [16]:
mm_trump.generate_phrase()

"forward to obama administration , that's what does not drinking again ! "

And now we can evaluate the likelihood of to be or not to be for Trump

In [17]:
log_like_trump = mm_trump.compute_log_likelihood(test_string,lamda=0.01,unknown_probability=1e-10)

trump_factor = np.exp(log_like_trump - log_like_shakespeare)
print(trump_factor)

0.2544507833969866


'to be or not to be' is 4 times more likely to be Shakespeare, which is not very strong evidence.  Let's try again with a longer phrase.

In [18]:
test_string = 'to be or not to be , that is the question'
log_like_shakespeare = mm_shakespeare.compute_log_likelihood(test_string,lamda=0.01,unknown_probability=1e-10)
log_like_trump = mm_trump.compute_log_likelihood(test_string,lamda=0.01,unknown_probability=1e-10)

trump_factor = np.exp(log_like_trump - log_like_shakespeare)
print(trump_factor)

0.00260913427407909


This is much more likely to be Shakespeare, now that we have more data.  Conversely, let's try something that Trump actually said, but was certainly not in the tweet corpus.

In [19]:
test_string = "i moved on her very heavily"
log_like_shakespeare = mm_shakespeare.compute_log_likelihood(test_string,lamda=0.01,unknown_probability=1e-10)
log_like_trump = mm_trump.compute_log_likelihood(test_string,lamda=0.01,unknown_probability=1e-10)

trump_factor = np.exp(log_like_trump - log_like_shakespeare)
print(trump_factor)

58901.90596761701


This is much more likely to have been written by Trump.

Finally, these models in which words are only dependent on their immediate predecessor are called 'bigram' models.  They aren't particularly good at generating realistic text.  Better results can be had by considering the previous two or more words, albeit with a commensurate increase in cost and tendency to overfit.  Let's see what kind of data a trigram model generates

In [20]:
from markov_models import SecondOrderMarkovModel

mm_shakespeare = SecondOrderMarkovModel(sequence_shakespeare)
mm_shakespeare.build_transition_matrices()
mm_trump = SecondOrderMarkovModel(sequence_trump)
mm_trump.build_transition_matrices()

We can use these to play a game: we'll randomly select from the two models, generate a phrase and try to decide who said it!

In [29]:
models = [mm_shakespeare,mm_trump]
index = np.random.randint(2)
models[index].generate_phrase()

', and give me a taper in some sort , lechery eats itself . '

In [24]:
print (index)

1
