# Investigation 02 - Training Markov Models

Brian Bahmanyar - Due: Wednesday, April 13

___

In [1]:
import pandas as pd
import numpy as np
from collections import Counter

In [2]:
## Please disregard, just some css for styling
from IPython.display import HTML
HTML("""<style>@import "http://fonts.googleapis.com/css?family=Lato|Source+Code+Pro|Montserrat:400,700";#notebook-container{-webkit-box-shadow:none;box-shadow:none}h1,h2,h3,h4,h5,h6{font-family:'Avenir Next'}h1{font-size:4.5em}h2{font-size:4rem}h3{font-size:3.5rem}h4{font-size:3rem}h5{font-size:2.5rem}h6{font-size:2rem}p{font-family:'Avenir Next';font-size:12pt;line-height:15pt;color:#2F4F4F}.CodeMirror pre{font-family:'Source Code Pro', monospace;font-size:0.95em}div.input_area{border:none;background:whitesmoke}</style>""")

## The Data

All the markov models will be trained from the coprus in [trump.txt](./trump.txt) which comprises two of Donald Trump's speeches hosted by the Washington Post ([Article 1](https://www.washingtonpost.com/news/post-politics/wp/2015/06/16/full-text-donald-trump-announces-a-presidential-bid/), [Article 2](https://www.washingtonpost.com/news/post-politics/wp/2016/02/20/transcript-donald-trumps-victory-speech-after-the-south-carolina-gop-primary/)). I scraped, cleaned, and tokenized the text from the web pages' HTML, then wrote it to [trump.txt](./trump.txt).

In [3]:
%%bash
wc -lw trump.txt

     167    8264 trump.txt


The corpus contains 8264 words in total, devided among 167 lines.

___

## The Markov Models

In [4]:
## I'm importing functions from the python scripts for use here without copying and pasting
##     all the functions. Please refer to the .py files for the implementations.
from train_markov_chain import get_transition_matrix
from generate_text import simulate_markov_states, get_text

#### First Order Markov Model (Bi-Gram Model)

In [5]:
P = get_transition_matrix('trump.txt', markov_model_order=1)

In [6]:
P.shape

(1265, 1265)

Our transition probability matrix is square, as expected, and our state space is 1265 states large.

Obviously this transition matrix is going to be very sparse, so showing a slice of it won't be very helpful. Instead I'll display a subset of the state space and confirm that the first 10 row probabilities sum to one:

In [7]:
P.index

Index(['('1',)', '('10',)', '('100',)', '('10000',)', '('12',)', '('1290',)',
       '('13',)', '('15',)', '('16',)', '('18',)',
       ...
       '('yesterday',)', '('yet',)', '('york',)', '('you',)', '('young',)',
       '('your',)', '('youre',)', '('yourself',)', '('youve',)', '('zero',)'],
      dtype='object', length=1265)

In [8]:
P.sum(axis=1)[:10]

('1',)        1
('10',)       1
('100',)      1
('10000',)    1
('12',)       1
('1290',)     1
('13',)       1
('15',)       1
('16',)       1
('18',)       1
dtype: float64

Now that we have confirmed that our transition matrix is looking good, we can start using it to generate text that resembles the structure of Donald Trump's speeches.

In [9]:
print(' '.join(get_text(simulate_markov_states(P, num_states=200))))

takes care of 30 the most highly sought after all of political people that sells because my time when do something wrong with oil and thats longterm debt because dont know the world so put together its impossible theyre gonna be the middle east iran from all im totally destabilize the deal its great deals the good thing and my father is going to nevada and lets go to us in in the prison area my life ive watched the world per pupil than us not using a nation in a clue they beat mexico love the airconditioner didnt know henry right on time magazine last time when we cant do a company so thank you look at all talk jobs because buy the smartest negotiators in in mexico thats the cards because again but still hate to do we want to educate their united states zero tax until the military so well you its gonna go to do it it right good we cant do it said is just a good company so tough you cant happen if a gun on television well as an incredible and certify to believe me they dont need somebod

Using a first order Markov Model, and generating 200 states, the text generated is pretty scatted and hard to follow (although in the models defense so is the corpus itself).

To get a sense of the word frequencies this model is outputing were going to need to simulate much more text. Below I will simulate 100,000 states and display the top 20 most frequent words, and their counts from the 8264 word corpus.

In [10]:
word_count = Counter() 
word_count.update(get_text(simulate_markov_states(P, num_states=100000)))

In [11]:
list(zip(range(1,21), word_count.most_common(20)))

[(1, ('the', 3727)),
 (2, ('and', 3552)),
 (3, ('to', 2867)),
 (4, ('a', 2250)),
 (5, ('we', 2207)),
 (6, ('you', 1709)),
 (7, ('of', 1669)),
 (8, ('they', 1569)),
 (9, ('it', 1556)),
 (10, ('have', 1497)),
 (11, ('that', 1470)),
 (12, ('in', 1257)),
 (13, ('going', 1104)),
 (14, ('our', 1006)),
 (15, ('so', 970)),
 (16, ('people', 910)),
 (17, ('its', 842)),
 (18, ('be', 825)),
 (19, ('is', 804)),
 (20, ('are', 788))]

Of course the most generated words are the most common words from the corpus and the most frequently used words in the english language. Such as 'the', 'and', 'to'. 

___

#### Second Order Markov Model (Tri-Gram Model)

In [12]:
P = get_transition_matrix('trump.txt', markov_model_order=2)

In [13]:
P.shape

(5310, 5310)

Our transition probability matrix is square, as expected, and our state space is 5310 states large. It makes sense that we have a larger state space here as were comparing pairs of words.

Below is a subset of the state space:

In [14]:
P.index

Index(['('1', 'billion')', '('10', 'billion')', '('10', 'feet')',
       '('10', 'to')', '('100', 'percent')', '('10000', 'we')',
       '('12', 'billion')', '('1290', 'avenue')', '('13', 'trillion')',
       '('15', 'million')',
       ...
       '('youre', 'not')', '('youre', 'right')', '('yourself', 'how')',
       '('youve', 'seen')', '('zero', 'chance')', '('zero', 'horrible')',
       '('zero', 'ill')', '('zero', 'our')', '('zero', 'tax')',
       '('zero', 'whoever')'],
      dtype='object', length=5310)

Lets generate 200 states using our Second Order Model:

In [15]:
print(' '.join(get_text(simulate_markov_states(P, num_states=200))))

wall just got 10 feet taller its true and these are the walls because we need trump now our country needs we need somebody we need money were dying were dying we need that thinking we have tremendous people we have nothing and every truck and every part manufactured in this building they make weapons right now and then were going to be amazingly destructive doctors are quitting have a clue hes a bad negotiator hes the one that did bergdahl we get bergdahl we get a lot of them that are obsolete weve got social security thats going to have the opposite thinking we have to get environmental clearance and the finest when mexico sends its people theyre not sending you theyre sending us not the right people so ford will come back theyll all come back and make it great again so ladies and gentlemen am officially running for president would one of my family melania barron kai donnie don vanessa tiffany evanka did a great negotiator learned so much he was a done deal its going to do terrific an

This text generated from our Second Order Markov Model here is much more readable and closer to real language than the First Order Model. This is expected because we are breaking fewer links between the words from the original corpus.

To get a sense of the word frequencies this model is outputing were going to need to simulate much more text. Below I will simulate 100,000 states and display the top 20 most frequent words, and their counts from the 8264 word corpus.

In [16]:
word_count.clear()
word_count.update(get_text(simulate_markov_states(P, num_states=100000)))

In [17]:
list(zip(range(1,21), word_count.most_common(20)))

[(1, ('the', 3799)),
 (2, ('and', 3657)),
 (3, ('to', 2922)),
 (4, ('we', 2209)),
 (5, ('a', 2189)),
 (6, ('of', 1695)),
 (7, ('you', 1656)),
 (8, ('have', 1628)),
 (9, ('it', 1522)),
 (10, ('that', 1507)),
 (11, ('they', 1460)),
 (12, ('in', 1244)),
 (13, ('going', 1140)),
 (14, ('our', 1023)),
 (15, ('so', 986)),
 (16, ('people', 973)),
 (17, ('its', 854)),
 (18, ('be', 831)),
 (19, ('is', 782)),
 (20, ('were', 775))]

As expected, again, the most frequent words generated are 'the', 'and', 'to'. This is a simulation of the non-conditional frequency of the words from the corpus, so with enough state simulations this should not change significantly from model to model.

#### Third Order Markov Model (4-Gram Model)

In [18]:
P = get_transition_matrix('trump.txt', markov_model_order=3)

In [19]:
P.shape

(7272, 7272)

In [20]:
print(' '.join(get_text(simulate_markov_states(P, num_states=200))))

im making speeches all the time we want trump well you need somebody because politicians are all talk no action nothings gonna get done they will not as an example ive been on the circuit making speeches and hear my fellow republicans and theyre wonderful people like them they all want me to support them they dont see me anymore im making speeches all the time we want trump we want trump we want trump we want trump now our country needs we need that thinking we have the opposite thinking we have losers we have losers we have losers we have losers we have people that are selling this country down the drain so put together this and before say it have to say ted and marco did a really good plan and ill add in the third we had a 9000 we had a 9000 we had a really good job and they wanted to do a lot of beautiful work were going to do a lot of those votes also you dont just add them together so think were going to nevada lead lead with the hispanics im leading in every poll with the hispani

As the order of the markov model increases we break fewer links between the words from the corpus. This means the generated text will be more natural and readable, but will be very similar to the original corpus.