### Erik Holmgren
2018-10-04  
CS425: Assignment 3  
Dr. Hamid Chitsaz  

### 1:  Implementing the forward algorithm

In [3]:
import numpy as np

In [4]:
seq = [0.1, 0.3, 0.2, 0.4, 0.7, 0.9, 1, 1, 0.8, 0.9]
states = ['r', 'o'] # r -> hyper, o -> hypo
emissions = {0 : {'r' : 0.8, 'o' : 0.2}, 1 : {'r' : 0.2, 'o' : 0.8}}
transitions = {'r' : {'r' : 0.9, 'o' : 0.1}, 'o' : {'r' : 0.1, 'o' : 0.9}, 's' : {'r' : 0.5, 'o' : 0.5}}

In [174]:
def forward(X, Q, A, E):
    """
    X - emission sequance
    Q - state list
    A - state transition matrix
    E - emission probibility matrix
    """
    
    X = np.copy(np.round(X)) # we consider x in [0, .5] low and x in (.5, 1] high
    
    f = np.zeros((len(X), len(Q)))
    
    # initalize the first row to the inital probibility of state s times the emission
    # probibillity of the first element of the sequane given state s for all states s
    for r in range(len(Q)):
        f[0, r] = A['s'][Q[r]] * E[X[0]][Q[r]]
    
    # compute the recurance for all sequence elements after the first for all states
    for i in range(1, len(X)):
        for k in range(len(Q)):
            f[i, k] = sum([f[i - 1, r] * A[Q[k]][Q[r]] * E[X[i]][Q[k]] for r in range(len(Q))])
                
    return np.sum(f[-1, :]), f

In [94]:
np.log(forward(seq, states, transitions, emissions)[0])

-5.449526801374106

Thus the natural log of the probibility of sequance 0.1, 0.3, 0.2, 0.4, 0.7, 0.9, 1, 1, 0.8, 0.9 with the HMM described above is -5.4495

### 2: Implementing the backward algorithm

In [95]:
def backward(X, Q, A, E):
    X = np.copy(np.round(X)) # we consider [0, .5] low and (.5, 1] high
            
    b = np.zeros((len(X), len(Q)))
    
    # initalize the marginal probibility for the last sequance element to 1 for each state
    b[-1,:] = 1
    
    # compute the recurance for each sequance element from the second to last to first for eachs state
    for i in reversed(range(len(X) - 1)):
        for k in range(len(Q)):
            b[i, k] = sum([b[i + 1, r] * A[Q[k]][Q[r]] * E[X[i + 1]][Q[r]] for r in range(len(Q))])
    
    return b

In [96]:
def posterior_decode(X, Q, A, E):
    p, f = forward(X, Q, A, E)
    b = backward(X, Q, A, E)
    
    post = (f * b) / p
        
    return post

In [102]:
post_dec = posterior_decode(seq, states, transitions, emissions)
np.log(post_dec[4][states.index('r')])

-1.6215009686655406

The log of the  probibility that the 5th element of the sequance is from a hypermethylated region is -1.6215

### 3: Learning HMM parameters from mouse embryo stem cell genome

In [175]:
# pip install hmmlearn
from hmmlearn import hmm
import pandas as pd

In [110]:
GSE30202 = pd.read_csv('GSE30202_BisSeq_ES_CpGmeth.tsv', delimiter='\t')
GSE30202.head(5)

Unnamed: 0,chr,position,nTot,nMeth
0,chr1,3000574,37,33
1,chr1,3000726,5,5
2,chr1,3000901,2,0
3,chr1,3001346,13,12
4,chr1,3001394,17,12


In [124]:
GSE30202_chrY = GSE30202[(GSE30202['chr'] == 'chrY')]

In [132]:
GSE30202_chrY_nMeth_nTot = GSE30202_chrY['nMeth'] / GSE30202_chrY['nTot']
GSE30202_chrY_nMeth_nTot.head(5)

19068860    0.714286
19068861    0.833333
19068862    0.857143
19068863    1.000000
19068864    0.900000
dtype: float64

In [130]:
GSE30202_chrY_nMeth_nTot_np = np.array(GSE30202_chrY_nMeth_nTot)
GSE30202_chrY_nMeth_nTot_np

array([0.71428571, 0.83333333, 0.85714286, ..., 0.50980392, 0.10294118, 0.82352941])

In [149]:
GSE30202_chrY_nMeth_nTot_np = GSE30202_chrY_nMeth_nTot_np.round()
GSE30202_chrY_nMeth_nTot_np

array([1., 1., 1., ..., 1., 0., 1.])

In [141]:
start_prob = [0.5, 0.5]
transmition_prob = [[0.9, 0.1],
                    [0.1, 0.9]]
emission_prob = [[0.8, 0.2],
                 [0.2, 0.8]]

In [137]:
GSE30202_model = hmm.MultinomialHMM(n_components=2, transmat_prior=transmition_prob, startprob_prior=start_prob, )
GSE30202_model.startprob_prior
GSE30202_model

MultinomialHMM(algorithm='viterbi', init_params='ste', n_components=2,
        n_iter=10, params='ste', random_state=None, startprob_prior=1.0,
        tol=0.01, transmat_prior=1.0, verbose=False)

In [140]:
GSE30202_model.fit(GSE30202_chrY_nMeth_nTot_np.reshape(-1, 1), len(GSE30202_chrY_nMeth_nTot_np))

ValueError: expected a sample from a Multinomial distribution.

#### **Ignore the cells above this, I'm still trying to figure out how to use that library.**

In [166]:
# pip install hidden_markov
import hidden_markov

In [167]:
states_t = tuple(states)
alphabet_t = (0, 1)
start_prob = np.matrix('0.5, 0.5')
trans_prob = np.matrix('0.9, 0.1 ; 0.1, 0.9')
emissions_prob = np.matrix('0.8, 0.2 ; 0.2, 0.8')

In [168]:
observation_t = [tuple(GSE30202_chrY_nMeth_nTot_np)]
# observation_t

In [169]:
model = hidden_markov.hmm(states_t, alphabet_t, start_prob, trans_prob, emissions_prob)

In [170]:
e, t, s = model.train_hmm(observation_t, 1000, [len(observation_t[0])])

In [171]:
e

matrix([[0.76217898, 0.23782102],
        [0.05259006, 0.94740994]])

In [172]:
t

matrix([[0.81121887, 0.18878113],
        [0.01217247, 0.98782753]])

In [173]:
s

matrix([[0.00329042, 0.99670958]])

### Learned probabilities:
emissions:  

|   | Hyper   | Hypo   |
|---|---------|--------|
| **Low** | 0.01217 | 0.9878 |
| **HIgh** | 0.8112  | 0.1887 |

transitions:  

| | Hyper | Hypo |
|---|---|---|
|__Hyper__| 0.8112 | 0.1887 |
|__Hypo__| 0.0121 | 0.9878 |

start:

|Hyper|Hypo|
|---|---|
|0.0121|0.9967|