# Profile HMMs

---
## Learning Objectives

1. Apply HMMs to describe PSSMs
* Develop model structure for Profile HMM
* Build and apply model



---

Today we will be reviewing Profile HMMs in class including a demonstration of how we can implement profile HMMs using our existing framework. 

This is a diagram of Hidden Markov Model used in HMMER (from the HMMER User Guide by Sean Eddy). The chain of match (M), insert (I), and deletion (D) states can be extended to match the length of the multiple sequence alignment that is used as the training set to produce a model. Individual sequences may then be aligned to the model and scored based on the probability that the model would emit that sequence.

<center><img src='./figures/HMM_Diagram.PNG'/></center>


We will be implementing a Profile HMM using the BAR domain discussed in the slides.

To accomplish this, we will implement two functions. First, `get_valid_states()` will provide a list of states that meet our heuristic threshold in the model (denoted as *s in the slides). Second, we will implement `build_profileHMM()` that will use our existing HMM class structure (inlcuded here in HMM.py) to develop a model in the above structure.

A few caveats: Our HMM implementation requires that all possible emissions and transitions exist in the dictionary. That is, any hidden state must have probabilities of emiting the entire alphabet and the transition matrix must have a probability for every state going to every other state in the model. These probabilites can be set to 0 to create the profile HMM structure, but they must be set.

In [1]:
# Imports
from data_readers import get_fasta
from collections import Counter, defaultdict
from HMM import HMM
import json

In [None]:
def get_valid_states(fasta_file, threshold=0.7):
    ''' 
    Function to determine which positions in an alignment are valid states in the profile HMM given a threshold
    
    Args: 
        fasta_file (str): fasta file containing alignments
        threshold (float): the threshold of allowed gap characters (default = 0.7)

    Returns:
        valid_states (list of bools): list of booleans (True/False) if each position is above the threshold
        
    Example:
        >>> get_valid_states("data/BAR.fa", 0.7) #doctest: +ELLIPSIS
        [True, True, True, True, False, ...]
    '''
    #create what is to be returned in the end. 
    #empty list for all the boolean states. 
    valid_states = []
    
    #fasta_file: fasta file containing all alignments 
    #we first need to grab all our alignments 
    for name, seq in get_fasta(fasta_file):
        for pos in range(len(seq) - 1):
            tracker = ""
            #for each position, if the value at that position != "-" 
            if seq[pos] != "-" and seq[pos] > threshold:
                tracker + "True"
            else:
                tracker + "False"
    
    valid_states.append(tracker) 
    
    

In [4]:
test1 = []

In [5]:
tracker = "True, Flase, True"

In [6]:
test1.append(tracker)

In [7]:
test1

['True, Flase, True']

As mentioned in the slides, we need to update the states for the HMM using the following equations:

$a_{kl} = A_{kl} / \sum_{l'}A_{kl'}$

$e_{k}(a) = E_{k}(a) / \sum_{a'}E_{k}(a')$

Where $k$ and $l$ represent state indices, $a_{kl}$ and $e_{k}$ are transition and emission probabilities, respectively, and $A_{kl}$ and $E_{k}$ are the corresponding frequencies.


In [None]:
#Train model using the BAR domain data in data/BAR.fa

# In order to build our model, we will need to set default paramters in an 
# initialized HMM using pseudocounts and then update these values with the 
# information in the fasta file
def build_profileHMM(alphabet, valid_states, fasta_file, pseudocount=0.01):
    ''' 
    Function to initialize a Profile HMM structure
    
    Args: 
        alphabet (list): alphabet characters for the model
        valid_states (list of bools): all positions in the alignments that are in match states
        fasta_file (str): fasta file containing alignments
        pseudocount (float): value to set as initial probabilities

    Returns:
        profile_HMM (HMM): HMM object
        
    Pseudocode:
    Initialize full initial, emission, and transition matrix to 0s (our HMM object requires all transitions and emissions are set to at least 0)
    Initialize possible emissions and transitons to pseudocount
    Calculate probabilities at each position given valid_states and fasta_file
        
    '''
    
    #initialize the full initial, emission, and transition matrix to 0s. 
    np.zeros #
    
    
    
    

In [None]:
valid_states = get_valid_states("data/BAR_Short.fa", 0.5)
alphabet = list('GALMFWKQESPVICYHRNDT')
profile = build_profileHMM(alphabet, valid_states, "data/BAR_Short.fa")

In [None]:
# Exact example from slides
sequence = "TKLDDDFKE"

print ("Forward:")
f_Px, f_matrix = profile.forward(list(sequence))
print (f_Px)


Expected output:
Forward:
9.605647218365268e-05

In [None]:
print(profile)