##### Markov Model Captures "Flow," Not "Meaning"

The Markov model calculates the probability of which word is likely to follow another  
($P(w_t \mid w_{t-1})$). An author's signature is not only the words they use, but how they  
structure them.

- **Poe**: "The bird hath flown" (archaic English usage)  
- **Frost**: "The bird has flown" (modern usage)

If you apply lemmatization:  
- *hath* → *have*  
- *has* → *have*  
- *flown* → *fly*  

Both sentences collapse into: **"the bird have fly"**.  
This destroys the author's unique grammatical style (their *voice*), making it harder for the model  
to distinguish between Poe and Frost.


In [49]:
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.model_selection import train_test_split
import re

In [None]:
def get_data(TXT_DIR):
    line_list = []
    TXT_DIR = "../"+TXT_DIR
    with open(TXT_DIR, 'r', encoding='utf-8') as file:
        for line in file:
            #normalization
            line = line.strip()
            line = line.lower()
            line = re.sub(r'[^\w\s]', '', line) #substitute (replace)
            line = re.sub(r'\d', '', line)
            if line != "":
                tokens = line.split()
                line_list.append(tokens) #Markov models want this format. 
    return line_list 

In [51]:
poe_list = get_data('data/edgar_allan_poe.txt')
frost_list = get_data('data/robert_frost.txt')

In [52]:
print(len(poe_list))
print(len(frost_list))


718
1436


In [53]:
# for poe = 0 and frost = 1 mapping
inputs = poe_list + frost_list
labels = [0]*len(poe_list) + [1]*len(frost_list)

print(len(inputs))
print(len(labels))

2154
2154


In [54]:
X_train, X_test, y_train, y_test = train_test_split(inputs, labels, shuffle=True, train_size=0.80, test_size=0.20, random_state=42)

print(len(X_train), "-", len(y_train))
print(len(X_test), "-", len(y_test))
print(X_train[100], "-", y_train[100])

1723 - 1723
431 - 431
['not', 'lupine', 'living', 'on', 'sand', 'and', 'drouth'] - 1


* in markov models we need to map unique words into integers, so it can procude it.
* since in real world we can have unknown words, we should only use X_train

In [None]:
#Old code:

#all_x_train = ""
#for sentence in X_train:
#    all_x_train += ' '.join(sentence) + ' '
#unique_words = list(set(all_x_train.split()))
#unique_words.insert(0, '<UNKNOWN>')
##create a unique word dict
#word2idx = {word: i for i, word in enumerate(unique_words)}
#
#print(len(word2idx))

2604


In [90]:
#Since strings in python are immutable, editing is slow
#Faster way:

#create empty set
word_set = set()
#directly add words into set
for sentence in X_train:
    word_set.update(sentence)
idx2word = list(word_set)
idx2word.insert(0, '<UNKNOWN>')

word2idx = {word:i for i, word in enumerate(idx2word)}
print(len(word2idx))
print(list(word2idx.items())[0:5])

2604
[('<UNKNOWN>', 0), ('grace', 1), ('mosses', 2), ('ghastly', 3), ('point', 4)]


In [123]:
print(X_train[0])
print([word2idx[word] for word in X_train[0]])

['and', 'many', 'must', 'have', 'seen', 'him', 'make']
[2086, 445, 94, 2076, 641, 117, 1232]


In [None]:
##turn inputs into numbers
#
#test_idx = []
#for sentence in test_list:
#    sample = []
#    for word in sentence:
#        try:            
#            sample.append(word2idx[word])
#        except:
#            sample.append(0)
#    test_idx.append(sample)
#test_idx

[[2086, 445, 94, 2076, 641, 117, 1232, 0, 445],
 [2457, 0, 838, 1540, 2257, 1052, 155, 2508, 403]]

In [114]:
def word_numerizer(arr):
    int_list = []
    for sentence in arr:
        sample = [word2idx.get(word, 0) for word in sentence]
        int_list.append(sample)
    return int_list

In [115]:
X_train_int = word_numerizer(X_train)
X_test_int  = word_numerizer(X_test)

In [119]:
print(X_test[10])
print(X_test_int[10])

['in', 'joy', 'and', 'wo', 'in', 'good', 'and', 'ill']
[2508, 2207, 2086, 0, 2508, 434, 2086, 929]
