<h1>Markov Model Classifier / Poetry Generator<h/1>



<h3>Tips<h/3>

<h6>1. Use smoothing (add-one smoothing) to avoid zero probabilities <br>
2. Consider whether you need A and pi or log(A) and log(pi)<br>
3. For Bayes' rule, compute the priors: p(class = k) <h/6>

* Libraries


In [402]:
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.model_selection import train_test_split

* Load the data

In [403]:
from urllib.request import urlopen
eap_data_link ="https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/edgar_allan_poe.txt"
rf_data_link = "https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt"
eap_data = urlopen(eap_data_link).read().decode('utf-8')
rf_data = urlopen(rf_data_link).read().decode('utf-8')

In [404]:
type(eap_data)

str

* Loop through each file, save each line to a list (one line == one sample)

In [405]:
eap_list = eap_data.split('\n')
eap_list

["LO! Death hath rear'd himself a throne",
 'In a strange city, all alone,',
 'Far down within the dim west',
 'Where the good, and the bad, and the worst, and the best,',
 'Have gone to their eternal rest.',
 '\u2009',
 'There shrines, and palaces, and towers',
 'Are not like any thing of ours',
 'Oh no! O no! ours never loom',
 'To heaven with that ungodly gloom!',
 'Time-eaten towers that tremble not!',
 'Resemble nothing that is ours.',
 'Around, by lifting winds forgot,',
 'Resignedly beneath the sky',
 'The melancholy waters lie.',
 '\u2009',
 'No holy rays from heaven come down',
 'On the long night-time of that town,',
 'But light from out the lurid sea',
 'Streams up the turrets silently',
 'Up thrones up long-forgotten bowers',
 "Of scultur'd ivy and stone flowers",
 'Up domes up spires up kingly halls',
 'Up fanes up Babylon-like walls',
 'Up many a melancholy shrine',
 'Whose entablatures intertwine',
 'The mask the viol and the vine.',
 '\u2009',
 'There open temples open 

In [406]:
rf_list = rf_data.split('\n')
rf_list

['Two roads diverged in a yellow wood,',
 'And sorry I could not travel both',
 'And be one traveler, long I stood',
 'And looked down one as far as I could',
 'To where it bent in the undergrowth; ',
 '',
 'Then took the other, as just as fair,',
 'And having perhaps the better claim',
 'Because it was grassy and wanted wear,',
 'Though as for that the passing there',
 'Had worn them really about the same,',
 '',
 'And both that morning equally lay',
 'In leaves no step had trodden black.',
 'Oh, I kept the first for another day! ',
 'Yet knowing how way leads on to way',
 'I doubted if I should ever come back.',
 '',
 'I shall be telling this with a sigh',
 'Somewhere ages and ages hence:',
 'Two roads diverged in a wood, and I,',
 'I took the one less traveled by,',
 'And that has made all the difference.',
 '',
 'Whose woods these are I think I know.',
 'His house is in the village, though; ',
 'He will not see me stopping here',
 'To watch his woods fill up with snow.',
 '',
 'My 

<small><b> Solution: </b></small> 

In [407]:
#collect data into lists
input_text = []
labels = []

for line in eap_list:
    line = line.lower()
    if line:
        line = line.translate(str.maketrans('', '', string.punctuation))
        labels.append(0)
        input_text.append(line)
        
for line in rf_list:
    line = line.lower()
    if line:
        line = line.translate(str.maketrans('', '', string.punctuation))
        labels.append(1)
        input_text.append(line)
        
input_text


['lo death hath reard himself a throne',
 'in a strange city all alone',
 'far down within the dim west',
 'where the good and the bad and the worst and the best',
 'have gone to their eternal rest',
 '\u2009',
 'there shrines and palaces and towers',
 'are not like any thing of ours',
 'oh no o no ours never loom',
 'to heaven with that ungodly gloom',
 'timeeaten towers that tremble not',
 'resemble nothing that is ours',
 'around by lifting winds forgot',
 'resignedly beneath the sky',
 'the melancholy waters lie',
 '\u2009',
 'no holy rays from heaven come down',
 'on the long nighttime of that town',
 'but light from out the lurid sea',
 'streams up the turrets silently',
 'up thrones up longforgotten bowers',
 'of sculturd ivy and stone flowers',
 'up domes up spires up kingly halls',
 'up fanes up babylonlike walls',
 'up many a melancholy shrine',
 'whose entablatures intertwine',
 'the mask the viol and the vine',
 '\u2009',
 'there open temples open graves',
 'are on a level 

* Save the labels too (another list)

In [408]:
#create labels
eap_labels = [0]*len(eap_list)
rf_labels = [1]*len(rf_list)

#combine data and labels
eap = list(zip(eap_list, eap_labels))
rf = list(zip(rf_list, rf_labels))

#combine all data
data = eap + rf

data

[("LO! Death hath rear'd himself a throne", 0),
 ('In a strange city, all alone,', 0),
 ('Far down within the dim west', 0),
 ('Where the good, and the bad, and the worst, and the best,', 0),
 ('Have gone to their eternal rest.', 0),
 ('\u2009', 0),
 ('There shrines, and palaces, and towers', 0),
 ('Are not like any thing of ours', 0),
 ('Oh no! O no! ours never loom', 0),
 ('To heaven with that ungodly gloom!', 0),
 ('Time-eaten towers that tremble not!', 0),
 ('Resemble nothing that is ours.', 0),
 ('Around, by lifting winds forgot,', 0),
 ('Resignedly beneath the sky', 0),
 ('The melancholy waters lie.', 0),
 ('\u2009', 0),
 ('No holy rays from heaven come down', 0),
 ('On the long night-time of that town,', 0),
 ('But light from out the lurid sea', 0),
 ('Streams up the turrets silently', 0),
 ('Up thrones up long-forgotten bowers', 0),
 ("Of scultur'd ivy and stone flowers", 0),
 ('Up domes up spires up kingly halls', 0),
 ('Up fanes up Babylon-like walls', 0),
 ('Up many a melanc

* Train-test split

In [409]:
#shuffle data
import random
random.shuffle(data)


In [410]:
split = int(0.8*len(data))
train = data[:split]
test = data[split:]

train
    

[('In the seraphic glancing of thine eyes-', 0),
 ('As if the turret-tops had given', 0),
 ("The road there, if you'll let a guide direct you", 1),
 ("You'll be expecting John. I pity Estelle; ", 1),
 ("And last year's berries shining scarlet red.", 1),
 ('Oh, I kept the first for another day! ', 1),
 ('And the other listening things)', 0),
 ('From north to south across the blue;', 1),
 ('In her highest noon,', 0),
 ('And built a stone baptismal font outdoors-', 1),
 ("No doubt it's grown up some to woods around it.", 1),
 ('She seems to have the housework, and besides, ', 1),
 ('', 1),
 ('voted the only times he ever voted,', 1),
 ("They don't dispose me, either one of them,", 1),
 ('Say it right out- no matter for her mother. ', 1),
 ('The moon was waiting for her chill effect.', 1),
 ('I listened till it almost climbed the stairs', 1),
 ('I never could have done the thing I did', 1),
 ('Of all who owe thee most- whose gratitude', 0),
 ('', 0),
 ('I had a glimpse through curtain lace

<small><b> Solution: </b></small> 

In [411]:
train_text, test_text, Ytrain, Ytest = train_test_split(input_text, labels)

In [412]:
len(Ytrain), len(Ytest)

(1618, 540)

In [413]:
train_text[:5]  #first 5 lines of training data

['ye avengers of libertys wrongs',
 'i know and it was raining ',
 'what bones the cellar bones out of the grave',
 'in the ultimate climes of the pole',
 'to go west to a worse fight with the desert']

In [414]:
Ytrain[:5]  #first 5 labels of training data

[0, 1, 1, 0, 1]

* Create a mapping from unique word to unique integer

In [415]:
#mapping words to integers
word2idx = {}
current_idx = 0
for sentence, _ in (train + test):
    for word in sentence.split(" "):
        if word not in word2idx:
            word2idx[word] = current_idx
            current_idx += 1
            
V = len(word2idx)
V


4523

<small><b> Solution: </b></small> 

In [416]:
idx = 1
word2idx = {'<unk>': 0}

In [417]:
#populate word2idx
for text in train_text:
    tokens = text.split()
    for token in tokens:
        if token not in word2idx:
            word2idx[token] = idx
            idx += 1

In [418]:
word2idx

{'<unk>': 0,
 'ye': 1,
 'avengers': 2,
 'of': 3,
 'libertys': 4,
 'wrongs': 5,
 'i': 6,
 'know': 7,
 'and': 8,
 'it': 9,
 'was': 10,
 'raining': 11,
 'what': 12,
 'bones': 13,
 'the': 14,
 'cellar': 15,
 'out': 16,
 'grave': 17,
 'in': 18,
 'ultimate': 19,
 'climes': 20,
 'pole': 21,
 'to': 22,
 'go': 23,
 'west': 24,
 'a': 25,
 'worse': 26,
 'fight': 27,
 'with': 28,
 'desert': 29,
 'as': 30,
 'stardials': 31,
 'hinted': 32,
 'morn': 33,
 'or': 34,
 'like': 35,
 'poetesss': 36,
 'heart': 37,
 'love': 38,
 'elfin': 39,
 'from': 40,
 'green': 41,
 'grass': 42,
 'me': 43,
 'creation': 44,
 'height': 45,
 'adventure': 46,
 'is': 47,
 'monarch': 48,
 'thoughts': 49,
 'dominion': 50,
 'do': 51,
 'along': 52,
 'him': 53,
 'stop': 54,
 'his': 55,
 'shouting': 56,
 'disturbed': 57,
 'slumbering': 58,
 'village': 59,
 'street': 60,
 'kitchen': 61,
 'had': 62,
 'been': 63,
 'dark': 64,
 'own': 65,
 'perhaps': 66,
 'you': 67,
 'have': 68,
 'art': 69,
 'mean': 70,
 'poetess': 71,
 'who': 72,
 'wro

In [419]:
len(word2idx)

2512

* Loop through data, tokenize each line(string split should suffice)

In [420]:
#tokenize each line
import numpy as np
X_train = np.empty(len(train), dtype = object)
Y_train = np.empty(len(train), dtype = object)
for i in range(len(train)):
    X_train[i] = train[i][0].split(" ")
    Y_train[i] = train[i][1]
    
X_test = np.empty(len(test), dtype = object)    
Y_test = np.empty(len(test), dtype = object)
for i in range(len(test)):  
    X_test[i] = test[i][0].split(" ")
    Y_test[i] = test[i][1]
    


<small><b> Solution: </b></small> 

In [421]:
#integer formar
train_text_int = []
test_text_int = []

for text in train_text:
    tokens = text.split()
    line_as_int = [word2idx[token] for token in tokens]
    train_text_int.append(line_as_int)
    
for text in test_text:
    tokens = text.split()
    line_as_int = [word2idx.get(token, 0) for token in tokens]
    test_text_int.append(line_as_int)

In [422]:
train_text_int[100:105]

[[131, 374, 375, 130, 376, 377],
 [6, 378, 379, 9, 52, 380, 381],
 [130, 382, 383, 3, 14, 384, 228, 385, 386],
 [380, 14, 387, 388, 130, 89, 389],
 [6, 390, 196, 6, 391, 380, 392, 393]]

* Assign each unique word a unique integer index ("mapping" == "dict)

* Create a special index for unknown words (words in the test set but not in the training set)

In [423]:
#unknown words
unknown_words = set()
for sentence, _ in (train + test):
    for word in sentence.split(" "):
        if word not in word2idx:
            unknown_words.add(word)

unknown_words


{'',
 'sap',
 "'Company?'",
 'life?',
 'clear',
 'remained;',
 '(Who',
 'spectre.',
 "'How",
 'snows.',
 'Somewhere',
 'purport',
 'yield,',
 'broken.',
 "safely.'",
 'glances',
 'Calypso."',
 'Flattered',
 'important,',
 'Lowes,',
 'What',
 'ledges',
 'Fell',
 'skies,',
 'lady,',
 'Uttered',
 'astride,',
 'vista,',
 'sere,',
 'birth:',
 'cruel',
 'round,',
 'track.',
 "He's",
 "Ulalume!'",
 'forever,',
 'thinking.',
 "'You've",
 'loved?-',
 'throne;',
 'decently',
 'mistake.',
 'technical.-',
 'enkindle-',
 'slight,',
 'worse.',
 'muffled.',
 'dormant',
 'walk.',
 'midnight-',
 'Mars',
 'Winds',
 "stand.'",
 'bliss!',
 'wilder,',
 'legs.',
 'October',
 'Time',
 'love,',
 'Among',
 'farming',
 'As',
 'sport.',
 "Granny's,",
 'Smith,',
 'awful!',
 'ridiculous.',
 'attic.',
 'dim-remembered',
 'orchard.',
 'fervour',
 'dust.',
 'terminates-',
 'lately?',
 'bare,',
 'Until',
 '(With',
 'outsiders.',
 'respect',
 'twelvemonth',
 'hands,',
 "piano's",
 'Faith-',
 'avalanches,',
 'clothes,',

* Train a Markov Model for each class (Edgar Allen Poe / Rober Frost)


<small><b> Solution: </b></small> 

In [424]:
#initialize A and pi matrices - for both classes
V = len(word2idx)

A0 = np.ones((V, V))
A1 = np.ones((V, V))

pi0 = np.zeros(V)
pi1 = np.zeros(V)

#initial fake counts

* Write a function to compute the posterior for each class, given an input

In [425]:
def compute_p_y_given_x(x, eap_markov, rf_markov):
    p_eap = 1
    p_rf = 1
    for i in range(len(x)-1):
        p_eap *= eap_markov[x[i], x[i+1]]
        p_rf *= rf_markov[x[i], x[i+1]]
    return p_eap, p_rf

<small><b> Solution: </b></small> 

In [426]:
#compute counts for A and pi
def compute_counts(text_as_int, A, pi):
    for tokens in text_as_int:
        last_idx = None
        for idx in tokens:
            if last_idx is None:
                #it's the first word in the sentence
                pi[idx] += 1
            else:
                #the last word exists, so count a transition
                A[last_idx, idx] += 1
            #Update the last index
            last_idx = idx    

In [427]:
compute_counts([t for t in zip(train_text_int, Ytrain) if y == 0], A0, pi0) 
compute_counts([t for t in zip(train_text_int, Ytrain) if y == 1], A1, pi1)

In [428]:
#normalize A and pi so they are valid probability matrices
A0 /= A0.sum(axis = 1, keepdims = True)
pi0 /= pi0.sum()

A1 /= A1.sum(axis = 1, keepdims = True)
pi1 /= pi1.sum()

  pi0 /= pi0.sum()


In [429]:
#compute log probabilities. we don't need the actaul probabilities, just the log probabilities
logA0 = np.log(A0)
logA1 = np.log(A1)

logpi0 = np.log(pi0)

#divide by zero encountered in log
if np.isnan(logpi0).any():
    logpi0[np.isnan(logpi0)] = 0
    
logpi1 = np.log(pi1)
        


  logpi1 = np.log(pi1)


In [430]:
#compute priors
count0 = sum(y==0 for y in Ytrain)
count1 = sum(y==1 for y in Ytrain)

total = len(Ytrain)

p0 = count0 / total
p1 = count1 / total

logp0 = np.log(p0)
logp1 = np.log(p1)

p0, p1

#posterior probability - map method

(0.3368355995055624, 0.6631644004944376)

* Take the argmax over the posteriors to get the predicted class

<small><b> Solution: </b></small> 

In [431]:
#build a classifier
class Classifier:
    def __init__(self, logAs, logpis, logpriors):
        self.logAs = logAs
        self.logpis = logpis
        self.logpriors = logpriors
        self.K = len(logpriors) #number of classes
        
  
        
    def _compute_log_likelihood(self, input_, class_):
        logA = self.logAs[class_]
        logpi = self.logpis[class_]
        
        last_idx = None
        logprob = 0
        for idx in input_:
            if last_idx is None:
                #it's the first token
                logprob += logpi[idx]
            else:
                logprob += logA[last_idx, idx]
                
            #update the last index
            last_idx = idx
            
        return logprob
    
    def predict(self, inputs):
        predictions = np.zeros(len(inputs))
        for i, input_ in enumerate(inputs):
            #compute the log likelihood for each class
            posteriors = [self._compute_log_likelihood(input_, c) + self.logpriors[c] 
                          for c in range(self.K)]
            pred = np.argmax(posteriors)
            predictions[i] = pred
            
        return predictions
    
        

* Make predictions for both train and test sets 

<small><b> Solution: </b></small> 

In [432]:
#each array must be in order since classes are assumed to index these lists
clf = Classifier([logA0, logA1], [logpi0, logpi1], [logp0, logp1])

* Compute the accuracy for train/test sets

<small><b> Solution: </b></small> 

In [433]:
Ptrain = clf.predict(train_text_int)
Ptest = clf.predict(test_text_int)

#print the accuracy
print("Train accuracy:", np.mean(Ptrain == Ytrain))
print("Test accuracy:", np.mean(Ptest == Ytest))

Train accuracy: 0.33498145859085293
Test accuracy: 0.4166666666666667


* Check for class imbalance

* If there is class imbalance, check confusion matrix and F1-score

<small><b> Solution: </b></small> 

In [434]:
from sklearn.metrics import confusion_matrix, f1_score

In [435]:
cm_train = confusion_matrix(Ytrain, Ptrain)
cm_train


array([[ 542,    3],
       [1073,    0]], dtype=int64)

In [436]:
cm_test = confusion_matrix(Ytest, Ptest)
cm_test

array([[143,  34],
       [281,  82]], dtype=int64)

In [437]:
f1_score(Ytrain, Ptrain)

0.0

In [438]:
f1_score(Ytest, Ptest)

0.34237995824634654