# HMM 1
## Specification
- Name: Cade Mirchandani
- Group: Jodie, Gabe
- Name your notebooks as: problem19.ipynb, problem20.ipynb, problem 22.ipynb
- options: none
- input: filename passed as first parameter to main
- output: a text file. ( using print ( .... file=someFileObject) is a handy way to do this after you have opened someFileObject as a text file). I find it name to name these files by creating a string by concatenating the string named infile with ".out" ... rosalind_ba10a.txt.out ( for example).
-Rosalind Problem Names:
    - Compute the Probability of a Hidden Path
    - Compute the Probability of an Outcome Given a Hidden Path
    - Implement the Viterbi Algorithm
    - Compute the Probability of a String Emitted by an HMM

As always, include an Inspection Intro Markdown that describes your specific algorithm at the beginning of the notebook, and another Inspection Results markdown at the end of the notebook that documents: your inspection team, the findings of the team, and your resolution of those findings.

Please submit your notebooks, an example of one of the Rosalind files that you ran and passed, and the output that your program generated as a text file.

## Description
These are drawn from material presented in Ch. 10 of Compeau and Pevzner.

We begin an exploration of models that provide us with inference into the hidden meaning lying within a sequence. Four HMM Rosalind examples provide a framework that will extend throughout the course. Note that these examples operate with probabilities rather than log-probabilities or log-odds scores. The concepts of the models are consistent, however.

Please submit the rosalind input file that you tested with. This would be one of the "Extra Datasets" for each problem

## Hints
1) use of the python numpy module will simplify many of your operators, especially when higher precision math is required ( when products of many probabilities are calculated), or when vector operations are called for.

2) Concepts presented in class will mirror those in the reading

3) write your code for reusability. 

4) Consider your data structures carefully. You will likely need transition and emission matrices. Carefully consider how to address these. For those who are comfortable with dot product and matrix multiplication, these can dramatically simplify your code.

Here is a template to consider.

## Inspection Intro

For this problem we had to implement the forward algorithim to compute the probability of a string emitted by an HMM

We reuse our nested dicts to store the transition and emission matrices.
We use a deque to create the path when we backtrack becuase it has the appendleft fucntion.

In [9]:
from collections import defaultdict
from math import inf
class HMM:
    def __init__(self, inFile):
        """
        Class HMM holds the code for working with HMMs from rosalind.

        Args:
            inFile (str): path to input file
        """
        self.inFile = inFile
        # parse input file to initalize needed variables.
        self.string, self.alphabet, self.states, self.transition, self.emission = self.parseInput()
           
    def parseInput(self):
        """
        Reads the rosalind input file and parses the string, alphabet, states, and transition and emission matrices.
        Returns:
            string (str): the input string
            alphabet (list): the alphabet of the string
            states (list): the states
            transition (dict): the transition matrix
            emission (dict): the emission matrix
        """
        with open(self.inFile, "r") as f:
            # read the string
            string = f.readline().rstrip()
            # junk is the ----- separator
            junk = f.readline().rstrip()
            alphabet = f.readline().rstrip().split()
            junk = f.readline().rstrip()
            states = f.readline().rstrip().split()
            junk = f.readline().rstrip()    
            
            # initalize and fill in transition matrix
            transition = defaultdict(dict)
            columns = f.readline().rstrip().split()
            for i in range(len(states)):
                line = f.readline().rstrip().split()
                rowState = line[0]
                probs = [float(f) for f in line[1:]]
                for i in range(len(probs)):
                    transition[rowState][columns[i]] = probs[i] 
            
            junk = f.readline().rstrip()
            
            # initalize and fill in emission matrix
            emission = defaultdict(dict)
            columns = f.readline().rstrip().split()
            for i in range(len(states)):
                line = f.readline().rstrip().split()
                rowState = line[0]
                probs = [float(f) for f in line[1:]]
                for i in range(len(probs)):
                    emission[rowState][columns[i]] = probs[i] 
        
        return string, alphabet, states, transition, emission

    def calcProb(self):
        """
        Calculates probablity of emit string

        Returns:
            out (float): the probability of the emit string
        """
        # initalize distance matrix
        dists = {s: {} for s in self.states}
        # iterate each emit in emit string
        for i, emit in enumerate(self.string):
            # iterate each state
            for state in self.states:
                # set starting probabilites when at first emit
                if i == 0:
                    dists[state][i] = 1 / len(self.states) * self.emission[state][emit]
                else:
                    dists[state][i] = 0
                    for otherState in self.states:
                        # get prob of nextState and add it
                        nextProb = self.transition[state][otherState] * self.emission[state][emit] 
                        dists[state][i] += dists[otherState][i-1] * nextProb
        out = 0
        for state in self.states:
           out += dists[state][len(self.string)-1]
        return out
        

    def pathProbability(self, path):
        """
        Computes probability of given path using transition matrix.

        Args:
            path (str): Input path
            
        
        Returns:
            prob (float): Computed probability.
        """
        prob = 0.5
        for i in range(len(path)-1):
            prob *= self.transitionMatrix[path[i]][path[i+1]]
        return prob
    
    def conditionalProb(self, hiddenPath, observedPath):
        """
        Calculates conditional probability of hidden path given observered path

        Args:
            hiddenPath (str): the hidden path
            observedPath (str): the observed path

        Returns:
            _type_: the probability of the observed path given the hidden path
        """
        prob = 1
        for i in range(len(hiddenPath)):
            prob *= self.emissionMatrix[hiddenPath[i]][observedPath[i]]
        return prob

def main(inFile = None):
    '''
    Do the main thing
    '''
    p = HMM(inFile)

    out = p.calcProb()

    print(out)
    with open("cm_22_out.txt", "w") as f:
        print(out, file=f)
    
if __name__ == "__main__":
    main(inFile = 'rosalind_ba10d.txt') 

1.2719603197444308e-48


## Inspection Results

- Added inline comments to explain algo
- Changed intial state prob to 1 / #num states * emissionProb(state, emit)
- Set prob to 0 not -inf for new nodes.