In [37]:
import torch
import numpy as np

# Data Engineering

Suppose we are given amino acid sequence s and class label sequence c. Let n be the length of both sequences. 
For simplicity, we assume there are 20 amino acids. We create a bijection mapping from each amino acid to [1, 20]. We map each amino acid sequence s to a n sized vector $s'\in R^n$, where the $i^{th}$ position is its corresponding numerical mapping from its amino acid.

In a similar fashion, we perform the same process for secondary structures. We assume there are two types of secondary structures alpha helix and beta sheet and optionally no secondary structure. We map each class label sequence c to a n sized vector $c' \in R^n$, where the $i^{th}$ position is its corresponding numerical mapping from its class label.

In [22]:
def aa_to_vec(s):
    amino_acids = list("AGILPVFWYDERHKSTCMNQ")
    amino_acids.sort()
    aa_mapping = {}
    for i, aa in enumerate(amino_acids):
        aa_mapping[aa] = i + 1
    return list(map(lambda aa: aa_mapping[aa], list(s)))
        

def ss_to_vec(c):
    #TODO
    return list(c)

# shorthand aliases
aa = lambda s: aa_to_vec(s)
ss = lambda c: ss_to_vec(c)

# Model Parameterization

For notion, let $\theta$ := hidden markov model parameters (state transition probabilities, symbol emission probabilities), and let $\phi$ := class emission parameters.

Our objective function attempts to find maximize the conditionally probability of obtaining class label sequence c given amino acid sequence s, hidden markov parameters $\theta$, class emission parameters $\phi$.

The number of hidden states usually requires some expert insights. Here, we adopt the hidden markov model setup introduced in assignment two - which includes two hidden states A and B. Then, we have a 4 state transisition probabilities $(t_{aa}, t_{ab}, t_{ba}, t_{bb})$, 20 symbol emission probabilties for state A $(e_{a1}, ... , e_{a20})$, 20 symbol emission probabilties for state B $(e_{b1}, ... , e_{b20})$. Then, we have 44 parameters total for $\theta$.


In this setup, we also have 3 class emission probabilities for state A $(\phi_{a1}, \phi_{a2} , \phi_{a3})$, 3 class emission probabilities for state B $(\phi_{b1}, \phi_{b2} , \phi_{b3})$. 

For simplicity, we initalize these variables to a uniform distribution.

In [11]:
def init_theta():
    return [1 / 44 for _ in range(44)]

def init_phi():
    return [1 / 6 for _ in range(6)]

# Gradient Calculations

We can convert our objective function into a minimization problem by changing our objective into minimizing the negative log likelihood.

$$\hat{\theta} = \arg \max_{\theta} P(c | s, \theta, \phi) = \frac{P(c, s | \theta, \phi)}{P(s | \theta)}$$
$$\hat{\theta} = \arg \min_{\theta} -\log(\frac{P(c, s | \theta, \phi)}{P(s | \theta)})$$

We can calculate the gradient of this expression into terms that we know from forward-backward algorithms.

$$\frac{dL}{d\theta_k} = -\frac {m_k(c, s) - n_k(s)}{\theta_k}$$

$$n_k(s) := idk$$

$$m_k(c, s) := idk$$

In [12]:
def n_k(s):
    #TODO
    pass

def m_k(c, s):
    #TODO
    pass

def gradient(c, s, theta):
    #TODO
    pass


# Gradient Descent

We use gradient descent to minimize our objective function.

We repeat the following operation until convergence. $\theta'=\theta - \alpha \nabla L$. For simplicity, we fix our step size $\alpha$.

In [None]:
def naive_gd(init_theta, step_size):
    #TODO
    pass

# Validation Results

Since our dataset if limited, we approximate our validation error using k-cross-folds validation in particular we use LOOCV (leave-one-out-cross-validation). 

In [10]:
def predict(theta, s):
    #TODO
    pass

def error_est(s_data, c_data):
    #TODO
    pass

# Workflow on Human Data

In [None]:
with open("HUMAN_training_data.txt") as f:
    content = f.read().split("\n")
    s_seqs = []
    c_seqs = []
    for i in range(0, len(content), 2):
        s_seqs.append(aa(content[i]))
    for i in range(1, len(content), 2):
        c_seqs.append(ss(content[i]))
    
    human_train = zip(s_seqs, c_seqs)
    
    
        
        
        