# Introduction

This IPython notebook represents my written report and source code for COMPSCI 369 Assignment 3. To generate a PDF version of this report, simply type `rake` at the command line. The generated report can be found at `report.pdf`. Please note that you will need the following software installed.

* Graphviz
* IPython 3
* \LaTeX
* NumPy
* pandoc
* R
* Rake

In [None]:
from IPython.display import display_markdown, display_pdf
import itertools as it
import numpy as np
import subprocess

# Question 1

Consider the following HMM for secondary structure in protein sequences. The secondary structure at a residue is either an $\alpha$-helix, $\beta$-strand, or a loop, represented by the states

In [None]:
STATES = ('H', 'E', 'T')

The state transitions for the given HMM for protein secondary structure are described by the following diagram. The initial state for the sequence is distributed uniformly.

In [None]:
A = {
    'H': {'H': 15/16, 'E': 3/160, 'T': 7/160},
    'E': {'H': 1/15, 'E': 5/6, 'T': 1/10},
    'T': {'H': 1/8, 'E': 3/4, 'T': 1/8},
}

edges = ('{} -> {} [label = {:.5g}];'.format(x, y, A[x][y])
         for x, y in it.product(STATES, repeat=2))
hmm = 'digraph {{{}}}'.format(''.join(edges)).encode()
display_pdf(subprocess.check_output(['dot', '-Tpdf'], input=hmm), raw=True)

A state may emit a hydrophobic, hydrophilic, or neutral amino acid.

In [None]:
EMISSIONS = ('B', 'I', 'N')

The emission probabilities for each state are given by the following table.

In [None]:
E = {
    'H': {'B': 0.3, 'I': 0.5, 'N': 0.2},
    'E': {'B': 0.15, 'I': 0.55, 'N': 0.3},
    'T': {'B': 0.4, 'I': 0.1, 'N': 0.5},
}

|       |**B**|**I**|**N**|
|:-----:|:---:|:---:|:---:|
| **H** | 0.3 | 0.5 | 0.2 |
| **E** | 0.15| 0.55| 0.3 |
| **T** | 0.4 | 0.1 | 0.5 |

Here, we simulate a sequence of length 200 under the HMM.

In [None]:
def d2l(d, K):
    return [d[k] for k in K]

def simulate_hmm(length):
    states = []
    symbols = []
    if length > 0:
        states.append(np.random.choice(STATES))
        symbols.append(np.random.choice(EMISSIONS, p=d2l(E[states[-1]], EMISSIONS)))
        for _ in range(1, length):
            states.append(np.random.choice(STATES, p=d2l(A[states[-1]], STATES)))
            symbols.append(np.random.choice(EMISSIONS, p=d2l(E[states[-1]], EMISSIONS)))
    return states, symbols

states, symbols = simulate_hmm(200)
print('\n'.join(','.join(s) for s in np.array_split(states, 5)))
print()
print('\n'.join(','.join(s) for s in np.array_split(symbols, 5)))

Then we calculate the logarithm of the joint probability $\log\left(P\left(x,\pi\right)\right)$ of this sequence.

In [None]:
def joint_logp(states, symbols):
    logp = 0
    if len(states) > 0:
        logp -= np.log2(len(STATES))
        logp += np.log2(E[states[0]][symbols[0]])
        for i in range(1, len(states)):
            logp += np.log2(A[states[i-1]][states[i]])
            logp += np.log2(E[states[i]][symbols[i]])
    return logp

joint_logp(states, symbols)

Given that
$$\pi = H,H,H,H,H,T,T,E,E,E,H,H,H,H,H,H,E,E,E,E,E,E$$
$$x = N,I,N,B,N,I,I,B,N,I,B,B,I,N,B,I,I,N,B,B,N,B$$
then $\log\left(P\left(x,\pi\right)\right)$ is

In [None]:
pi = ['H','H','H','H','H','T','T','E','E','E','H','H','H','H','H','H','E','E','E','E','E','E']
x = ['N','I','N','B','N','I','I','B','N','I','B','B','I','N','B','I','I','N','B','B','N','B']
joint_logp(pi, x)

The log probability of the simulated symbols is given by the forward algorithm. 

In [None]:
def log2sum(x):
    return x[0] + np.log2(np.sum(np.pow(2, x - x[0])))

def forward(x):
    logp = np.array([0] * len(STATES))
    if len(x) > 0:
        for i, s in enumerate(STATES):
            logp[i] = np.log2(E[s][x[0]])
        
    return log2sum(logp) - np.log2(len(STATES))


# Question 2