## Exercise 12.17

In [1]:
import numpy as np
import string
from hmmlearn import hmm

### Part 1

In [2]:
def vec_translate(a, my_dict):
    # translate array from symbols to state numbers or vice versa
    
    return np.vectorize(my_dict.__getitem__)(a)

def prep_data(filename):
    
    # Get the data as a single string
    with open(filename) as f:
        data = f.read().lower() # read and convert to lower case
        
    # remove punctuation and newlines
    remove_punct = {ord(char) : None for char in string.punctuation+"\n\r"}
    data = data.translate(remove_punct)
    symbls = sorted(list(set(data)))
    
    # convert the data to a NumPy array of symbols
    a = np.array(list(data))
                
    # make a conversion dict from symbols to state number
    symbls_to_obs = {x:i for i,x in enumerate(symbls)}
    
    # convert the symbols in a to state numbers
    obs_sequence = vec_translate(a, symbls_to_obs)
    
    return symbls, obs_sequence
symbols, obs = prep_data('declaration.txt')

### Part 3

In [3]:
dec = hmm.MultinomialHMM(n_components=2, n_iter=200, tol=0.0001)

### Part 4

In [4]:
dec.fit(obs.reshape(-1, 1))

MultinomialHMM(n_components=2, n_iter=200,
               random_state=RandomState(MT19937) at 0x7F19D8538340, tol=0.0001)

### Part 5

In [5]:
B = dec.emissionprob_.T
for i,b in enumerate(B):
    print(f"{symbols[i]} : {b[0]:0.4f}, {b[1]:0.4f}")

  : 0.2992, 0.0495
a : 0.1316, 0.0000
b : 0.0000, 0.0226
c : 0.0000, 0.0438
d : 0.0000, 0.0600
e : 0.2370, 0.0000
f : 0.0000, 0.0428
g : 0.0000, 0.0309
h : 0.0003, 0.0828
i : 0.1239, 0.0000
j : 0.0000, 0.0038
k : 0.0004, 0.0030
l : 0.0000, 0.0543
m : 0.0000, 0.0343
n : 0.0000, 0.1149
o : 0.1382, 0.0029
p : 0.0000, 0.0328
q : 0.0000, 0.0014
r : 0.0000, 0.1011
s : 0.0000, 0.1138
t : 0.0000, 0.1523
u : 0.0577, 0.0000
v : 0.0000, 0.0176
w : 0.0000, 0.0231
x : 0.0000, 0.0021
y : 0.0117, 0.0092
z : 0.0000, 0.0010


As mentioned in the book, it looks like we have detected a consonant and vowel state. 

In [6]:
dec = hmm.MultinomialHMM(n_components=3, n_iter=200, tol=0.0001)
dec.fit(obs.reshape(-1, 1))
B = dec.emissionprob_.T
for i,b in enumerate(B):
    print(f"{symbols[i]} : {b[0]:0.4f}, {b[1]:0.4f}, {b[2]:0.4f}")

  : 0.2712, 0.2076, 0.0164
a : 0.1735, 0.0000, 0.0000
b : 0.0000, 0.0018, 0.0342
c : 0.0000, 0.0170, 0.0539
d : 0.0000, 0.0690, 0.0320
e : 0.1822, 0.1478, 0.0000
f : 0.0000, 0.0103, 0.0584
g : 0.0000, 0.0330, 0.0189
h : 0.0000, 0.1440, 0.0000
i : 0.1427, 0.0234, 0.0000
j : 0.0000, 0.0049, 0.0016
k : 0.0009, 0.0028, 0.0017
l : 0.0000, 0.0426, 0.0470
m : 0.0011, 0.0146, 0.0399
n : 0.0000, 0.0018, 0.1804
o : 0.1505, 0.0409, 0.0000
p : 0.0000, 0.0141, 0.0391
q : 0.0000, 0.0016, 0.0008
r : 0.0001, 0.0397, 0.1238
s : 0.0016, 0.0639, 0.1200
t : 0.0000, 0.0832, 0.1652
u : 0.0760, 0.0000, 0.0000
v : 0.0000, 0.0046, 0.0237
w : 0.0000, 0.0020, 0.0347
x : 0.0000, 0.0000, 0.0034
y : 0.0000, 0.0295, 0.0036
z : 0.0000, 0.0000, 0.0015


It looks like the first state is still a vowel state as looking at all the letters with "large" nonzero numbers in the first column are vowels. 
The second column seems to still closeley represent consonants. 
The final column seems to represent more closely the most common letters in the alphabet. 
A quick search will tell you that $E, A, R, I , O, T, N, S$ are among the most common which are all largely represented in the final column. 

In [7]:
dec = hmm.MultinomialHMM(n_components=4, n_iter=200, tol=0.0001)
dec.fit(obs.reshape(-1, 1))
B = dec.emissionprob_.T
for i,b in enumerate(B):
    print(f"{symbols[i]} : {b[0]:0.4f}, {b[1]:0.4f}, {b[2]:0.4f}, {b[3]:0.4f}")

  : 0.2678, 0.0027, 0.0000, 0.4267
a : 0.1944, 0.0000, 0.0036, 0.0000
b : 0.0000, 0.0382, 0.0020, 0.0020
c : 0.0000, 0.0524, 0.0393, 0.0000
d : 0.0000, 0.0152, 0.1337, 0.0031
e : 0.1222, 0.0000, 0.0000, 0.3773
f : 0.0000, 0.0624, 0.0221, 0.0000
g : 0.0000, 0.0184, 0.0549, 0.0000
h : 0.0000, 0.0000, 0.2200, 0.0000
i : 0.1581, 0.0000, 0.0139, 0.0294
j : 0.0000, 0.0004, 0.0095, 0.0000
k : 0.0011, 0.0017, 0.0047, 0.0000
l : 0.0000, 0.0485, 0.0641, 0.0092
m : 0.0024, 0.0363, 0.0301, 0.0040
n : 0.0000, 0.1994, 0.0123, 0.0000
o : 0.1650, 0.0000, 0.0578, 0.0143
p : 0.0000, 0.0371, 0.0304, 0.0024
q : 0.0000, 0.0000, 0.0038, 0.0000
r : 0.0063, 0.1227, 0.0637, 0.0158
s : 0.0009, 0.1117, 0.1179, 0.0196
t : 0.0000, 0.1831, 0.0664, 0.0731
u : 0.0818, 0.0000, 0.0000, 0.0071
v : 0.0000, 0.0252, 0.0097, 0.0000
w : 0.0000, 0.0368, 0.0073, 0.0000
x : 0.0000, 0.0039, 0.0000, 0.0000
y : 0.0000, 0.0024, 0.0326, 0.0159
z : 0.0000, 0.0017, 0.0000, 0.0000


Again, It looks like the first state is still a vowel state as looking at all the letters with "large" nonzero numbers in the first column are vowels. 
The second column seems to still closeley represent consonants. 
The final column seems to represent more closely the most common letters in the alphabet. 
A quick search will tell you that $E, A, R, I , O, T, N, S$ are among the most common which are all largely represented in the final column. 
The third column looks like supporting letters. 

## Exercise 12.18

In [8]:
symbols, obs = prep_data('WarAndPeace.txt')
dec = hmm.MultinomialHMM(n_components=2, n_iter=200, tol=0.0001)
dec.fit(obs.reshape(-1, 1))
B = dec.emissionprob_.T
for i,b in enumerate(B):
    print(f"{symbols[i]} : {b[0]:0.4f}, {b[1]:0.4f}")

  : 0.2146, 0.0877
а : 0.0000, 0.1760
б : 0.0250, 0.0000
в : 0.0655, 0.0000
г : 0.0296, 0.0000
д : 0.0385, 0.0000
е : 0.0180, 0.1427
ж : 0.0140, 0.0000
з : 0.0252, 0.0000
и : 0.0016, 0.1315
й : 0.0149, 0.0000
к : 0.0497, 0.0010
л : 0.0719, 0.0000
м : 0.0381, 0.0000
н : 0.0973, 0.0000
о : 0.0000, 0.2407
п : 0.0346, 0.0062
р : 0.0597, 0.0000
с : 0.0513, 0.0280
т : 0.0780, 0.0000
у : 0.0000, 0.0590
ф : 0.0018, 0.0003
х : 0.0111, 0.0000
ц : 0.0049, 0.0000
ч : 0.0167, 0.0038
ш : 0.0109, 0.0000
щ : 0.0047, 0.0000
ъ : 0.0003, 0.0003
ы : 0.0000, 0.0376
ь : 0.0009, 0.0433
э : 0.0000, 0.0066
ю : 0.0079, 0.0024
я : 0.0128, 0.0328
ё : 0.0000, 0.0001


it looks like the Cyrillic characters $a, e,и, o$ are vowels and the rest are consonants.  

In [9]:
dec = hmm.MultinomialHMM(n_components=3, n_iter=200, tol=0.0001)
dec.fit(obs.reshape(-1, 1))
B = dec.emissionprob_.T
for i,b in enumerate(B):
    print(f"{symbols[i]} : {b[0]:0.4f}, {b[1]:0.4f}, {b[2]:0.4f}")

  : 0.4197, 0.0426, 0.0979
а : 0.0000, 0.1967, 0.0000
б : 0.0103, 0.0000, 0.0327
в : 0.0466, 0.0000, 0.0716
г : 0.0142, 0.0000, 0.0372
д : 0.0216, 0.0010, 0.0453
е : 0.0486, 0.1535, 0.0000
ж : 0.0050, 0.0000, 0.0188
з : 0.0186, 0.0000, 0.0272
и : 0.0095, 0.1427, 0.0000
й : 0.0337, 0.0000, 0.0000
к : 0.0316, 0.0000, 0.0581
л : 0.0322, 0.0000, 0.0920
м : 0.0249, 0.0000, 0.0433
н : 0.0349, 0.0000, 0.1307
о : 0.0026, 0.2672, 0.0000
п : 0.0218, 0.0000, 0.0464
р : 0.0124, 0.0000, 0.0865
с : 0.0960, 0.0002, 0.0434
т : 0.0286, 0.0000, 0.1042
у : 0.0018, 0.0646, 0.0000
ф : 0.0016, 0.0000, 0.0021
х : 0.0077, 0.0000, 0.0122
ц : 0.0004, 0.0000, 0.0076
ч : 0.0149, 0.0000, 0.0202
ш : 0.0038, 0.0000, 0.0147
щ : 0.0000, 0.0000, 0.0076
ъ : 0.0000, 0.0006, 0.0002
ы : 0.0000, 0.0420, 0.0000
ь : 0.0000, 0.0500, 0.0000
э : 0.0098, 0.0000, 0.0000
ю : 0.0175, 0.0029, 0.0000
я : 0.0299, 0.0359, 0.0000
ё : 0.0000, 0.0001, 0.0000


Again, it looks like the Cyrillic characters $a, e,и, o$ are vowels and the rest are consonants.  
Now the last column probably shows the most common characters and the first the least common. 