In [1]:
import math  # Just ignore this :-)

def log(x):
    if x == 0:
        return float('-inf')
    return math.log(x)

# CTiB - Week 12 - Practical Exercises

In the exercise below, you will implement and experiment with training-by-counting as a way to select the parameters of an HMM (i.e. training) as explained in the lectures in week 12.

# 1 - Background

Below you will implement and experiment with estimating the parameters (transition, start, and emission probabilities) for an HMM from data (sequences of observations, ${\bf X}$, and corresponding sequences of hidden states, ${\bf Z}$, using the training-by-counting method. 

We will be working with the 7-state model (`hmm_7_state`) model that we also worked with last time. The model is included below. 

We will use your implementation of the Viterbi algorithm (`compute_w` and `opt_path_prob`) from week 10 to investiage your trained models, so you need to add them below.

In [2]:
class hmm:
    def __init__(self, init_probs, trans_probs, emission_probs):
        self.init_probs = init_probs
        self.trans_probs = trans_probs
        self.emission_probs = emission_probs

In [3]:
init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]

trans_probs_7_state = [
    [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],
    [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],
    [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],
]

emission_probs_7_state = [
    #   A     C     G     T
    [0.30, 0.25, 0.25, 0.20],
    [0.20, 0.35, 0.15, 0.30],
    [0.40, 0.15, 0.20, 0.25],
    [0.25, 0.25, 0.25, 0.25],
    [0.20, 0.40, 0.30, 0.10],
    [0.30, 0.20, 0.30, 0.20],
    [0.15, 0.30, 0.20, 0.35],
]

hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)

We also need the helper functions for translating between observations/paths and indices.

In [4]:
def translate_path_to_indices(path):
    return list(map(lambda x: int(x), path))

def translate_indices_to_path(indices):
    return ''.join([str(i) for i in indices])

def translate_observations_to_indices(obs):
    mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}
    return [mapping[symbol.lower()] for symbol in obs]

def translate_indices_to_observations(indices):
    mapping = ['a', 'c', 'g', 't']
    return ''.join(mapping[idx] for idx in indices)

Additionally, you're given the function below that constructs a table of a specific size filled with zeros.

In [5]:
def make_table(m, n):
    """Make a table with `m` rows and `n` columns filled with zeros."""
    return [[1] * n for _ in range(m)]



You'll be testing your code with the same two sequences as last time, i.e:

In [6]:
x_short = 'GTTTCCCAGTGTATATCGAGGGATACTACGTGCATAGTAACATCGGCCAA'
z_short = '33333333333321021021021021021021021021021021021021'

In [7]:
x_long = 'TGAGTATCACTTAGGTCTATGTCTAGTCGTCTTTCGTAATGTTTGGTCTTGTCACCAGTTATCCTATGGCGCTCCGAGTCTGGTTCTCGAAATAAGCATCCCCGCCCAAGTCATGCACCCGTTTGTGTTCTTCGCCGACTTGAGCGACTTAATGAGGATGCCACTCGTCACCATCTTGAACATGCCACCAACGAGGTTGCCGCCGTCCATTATAACTACAACCTAGACAATTTTCGCTTTAGGTCCATTCACTAGGCCGAAATCCGCTGGAGTAAGCACAAAGCTCGTATAGGCAAAACCGACTCCATGAGTCTGCCTCCCGACCATTCCCATCAAAATACGCTATCAATACTAAAAAAATGACGGTTCAGCCTCACCCGGATGCTCGAGACAGCACACGGACATGATAGCGAACGTGACCAGTGTAGTGGCCCAGGGGAACCGCCGCGCCATTTTGTTCATGGCCCCGCTGCCGAATATTTCGATCCCAGCTAGAGTAATGACCTGTAGCTTAAACCCACTTTTGGCCCAAACTAGAGCAACAATCGGAATGGCTGAAGTGAATGCCGGCATGCCCTCAGCTCTAAGCGCCTCGATCGCAGTAATGACCGTCTTAACATTAGCTCTCAACGCTATGCAGTGGCTTTGGTGTCGCTTACTACCAGTTCCGAACGTCTCGGGGGTCTTGATGCAGCGCACCACGATGCCAAGCCACGCTGAATCGGGCAGCCAGCAGGATCGTTACAGTCGAGCCCACGGCAATGCGAGCCGTCACGTTGCCGAATATGCACTGCGGGACTACGGACGCAGGGCCGCCAACCATCTGGTTGACGATAGCCAAACACGGTCCAGAGGTGCCCCATCTCGGTTATTTGGATCGTAATTTTTGTGAAGAACACTGCAAACGCAAGTGGCTTTCCAGACTTTACGACTATGTGCCATCATTTAAGGCTACGACCCGGCTTTTAAGACCCCCACCACTAAATAGAGGTACATCTGA'
z_long = '3333321021021021021021021021021021021021021021021021021021021021021021033333333334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210321021021021021021021021033334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563333333456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456332102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102103210210210210210210210210210210210210210210210210210210210210210'

Remember to translate these sequences to indices before using them with your algorithms.

In [8]:
# Your implementations of compute_w_log and opt_path_prob_log from week 10

def compute_w_log(model, x):
    k = len(model.init_probs)
    n = len(x)
    
    w = make_table(k, n)
    
    # Base case: fill out w[i][0] for i = 0..k-1
    for i in range(k):
        w[i][0] = log(model.init_probs[i])+log(model.emission_probs[i][x[0]])
#        print(w)


    # Inductive case: fill out w[i][j] for i = 0..k, j = 0..n-1
    # ...

    for j in range(1,n):
        for i in range(k):
            for h in range(k):
                total = 0
                total += w[h][j-1]
                total += log(model.trans_probs[h][i])
                total += log(model.emission_probs[i][x[j]])
                w[i][j] = max(w[i][j], total)
  #              print(w)
                
    return w    



def opt_path_prob_log(w):
    big = float('-inf')
    for i in range(len(w)):
            big = max(big, w[i][len(w[i])-1])
    return big

# 2 - Training-by-counting using `x_short` and `z_short`

Assume that (`x_short`, `z_short`) is our given training data. Estimate the parameters of a 7-state hmm (`hmm_7_state_tbc_short`) as declared below from this data using training-by-counting. 

In [9]:
# Declaration of an 'empty' 7-state HMM, i.e. one where all params are set to zero


#Prob for starting in one of the states.
init_probs_7_state_tbc_short = [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]


# one row = going from that z state to all the other z states possible
trans_probs_7_state_tbc_short = [
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
]


#
emission_probs_7_state_tbc_short = [
    #   A     C     G     T
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
]

hmm_7_state_tbc_short = hmm(init_probs_7_state_tbc_short, trans_probs_7_state_tbc_short, emission_probs_7_state_tbc_short)




In [10]:
x_short = 'GTTTCCCAGTGTATATCGAGGGATACTACGTGCATAGTAACATCGGCCAA'
z_short = '33333333333321021021021021021021021021021021021021'



def init_probs_fill(z, K):
    res = [0 for i in range(K)]
    res[z[0]] = 1
    return res

m = init_probs_fill(translate_path_to_indices(z_short), 7)
#print(m)


def trans_probs_fill(z_int, K):
    #fill out matrix with counts
    matrix_trans = make_table(K,K)
    for i in range(len(z_int)-1):
            curr_state = z_int[i]
            next_state = z_int[i+1]
            matrix_trans[curr_state][next_state] += 1
    
    #Make list of sums of rows in matrix
    lst_sum = []
    for lst in matrix_trans:
        lst_sum.append(sum(lst))
        
    #Divide all values in list in matrix with the corresponding index in the list of sums.   
    for i in range(K):
        for j in range(K):
            matrix_trans[i][j] = matrix_trans[i][j] / lst_sum[i]
    
    return print(matrix_trans)

print(trans_probs_fill(translate_path_to_indices(z_short), 7))




[[0.05263157894736842, 0.05263157894736842, 0.6842105263157895, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842], [0.6842105263157895, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842], [0.05, 0.7, 0.05, 0.05, 0.05, 0.05, 0.05], [0.05263157894736842, 0.05263157894736842, 0.10526315789473684, 0.631578947368421, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285]]
None


Now adjust the parameters of the above model to reflect the training data (`x_short`, `z_short`). You can probably do this by hand since the sequences are quite short. You can of course also implement some code for doing it.

**Explain to another student how you do this.**

Adjust the parameters in the above declaration of `hmm_7_state_tbc_short` to the parameters that you get by training-by-counting. (Remember to validate that they are legal parameters.)

Now Compute the probability of the Viterbi decoding of `x_short` using your trained model and the 'original' 7-state model from the previous weeks, i.e.:

In [11]:
# Probability of the Viterbi decoding of x_short using hmm_7_state_tbc_short
w = compute_w_log(hmm_7_state_tbc_short, translate_observations_to_indices(x_short))
print opt_path_prob_log(w)

# Probability of the Viterbi decoding of x_short using hmm_7_state
w = compute_w_log(hmm_7_state, translate_observations_to_indices(x_short))
print opt_path_prob_log(w)

SyntaxError: invalid syntax (<ipython-input-11-757830eb71e0>, line 3)

How do the two probabilities compare? What do you expect?

Now compute the probability of the Viterbi decoding of `x_long` using the same two models, i.e.:

In [None]:
# Probability of the Viterbi decoding of x_long using hmm_7_state_tbc_short
w = compute_w_log(hmm_7_state_tbc_short, translate_observations_to_indices(x_long))
print opt_path_prob_log(w)

# Probability of the Viterbi decoding of x_long using hmm_7_state
w = compute_w_log(hmm_7_state, translate_observations_to_indices(x_long))
print opt_path_prob_log(w)

How do the two probabilities compare? What do you expect?

# 3 - Training-by-counting using `x_long` and `z_long`

Now, we want to redo what we did above, but with (`x_long`, `z_long`) is our given training data. Estimate the parameters of a 7-state hmm (`hmm_7_state_tbc_long`) as declared below from this data using training-by-counting.

In [None]:
# Declaration of an 'empty' 7-state HMM, i.e. one where all params are set to zero

init_probs_7_state_tbc_long = [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]

trans_probs_7_state_tbc_long = [
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
]

emission_probs_7_state_tbc_long = [
    #   A     C     G     T
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00],
]

hmm_7_state_tbc_long = hmm(init_probs_7_state_tbc_long, trans_probs_7_state_tbc_long, emission_probs_7_state_tbc_long)

Now adjust the parameters of the above model to reflect the training data (`x_long`, `z_long`). You can still probably do this by hand, but since the sequences are longer, you might want to implement some code to assist you in counting.

**Explain to another student how you do this, and see section 4 below for how to implement code for training-by-counting.**

Adjust the parameters in the above declaration of `hmm_7_state_tbc_long` to the parameters that you get by training-by-counting. (Remember to validate that they are legal parameters.)

In [None]:
# Your code from doing training-by-counting using (x_long, z_long) as training data

Now Compute the probability of the Viterbi decoding of `x_short` using your trained model and the 'original' 7-state model from the previous weeks, i.e.:

In [None]:
# Probability of the Viterbi decoding of x_short using hmm_7_state_tbc_long
w = compute_w_log(hmm_7_state_tbc_long, translate_observations_to_indices(x_short))
print opt_path_prob_log(w)

# Probability of the Viterbi decoding of x_short using hmm_7_state
w = compute_w_log(hmm_7_state, translate_observations_to_indices(x_short))
print opt_path_prob_log(w)

How do the two probabilities compare? What do you expect?

Now compute the probability of the Viterbi decoding of `x_long` using the same two models, i.e.:

In [None]:
# Probability of the Viterbi decoding of x_long using short_hmm_7_state_tbc_long
w = compute_w_log(hmm_7_state_tbc_long, translate_observations_to_indices(x_long))
print opt_path_prob_log(w)

# Probability of the Viterbi decoding of x_long using hmm_7_state
w = compute_w_log(hmm_7_state, translate_observations_to_indices(x_long))
print opt_path_prob_log(w)

How do the two probabilities compare? What do you expect?

# 4 - Training-by-counting in general

Training a hidden Markov model is a matter of estimating the initial, transition and emission probabilities. If we are given training data, i.e. a sequence of observations, ${\bf X}$, and a corresponding sequence of hidden states, ${\bf Z}$, we can do "training by counting" by counting the number of observed the transitions and observations in the training dataand as explained in the lecture.

Given ${\bf X}$ and ${\bf Z}$ we would like to count the number of transitions from one state to another, and the number of times that symbol $k$ was observed while being in state $i$.  That is, we want to construct a $K \times K$ matrix such that entry $i, j$ is the number of times that a transition from state $i$ to state $j$ is observed in the training data, and a $K \times D$ matrix where entry $i, k$ contains the number of times that symbol $k$ is observed in the training data while being in state $i$.

Implement this as the below function:

In [None]:
def count_transitions_and_emissions(K, D, x, z):
    """
    Returns a KxK matrix and a KxD matrix containing counts cf. above
    """
    pass

Test your implementation of `count_transitions_and_emissions` on (prefixes) of `x_long` and `z_long` above in order to conclude that your implementation works as expected.

In [None]:
# Your code here ...

Use your implementation of `count_transitions_and_emissions` to implement a function `training_by_counting` that given the number of hidden states, $K$, the number of observables, $D$, a sequence of observations, ${\bf X}$, and a corresponding sequence of hidden states, ${\bf Z}$, returns a HMM (as an instance of `class hmm`), where the tranistion, emission, and initial probabilities are set cf. training by counting on ${\bf X}$ and ${\bf Z}$.

In [None]:
def training_by_counting(K, D, x, z):
    """
    Returns a HMM trained on x and z cf. training-by-counting.
    """
    pass

You can now construct a HMM trained on `x_long` and `z_long` as:

In [None]:
hmm_7_state_tbc_long = training_by_counting(7, 4, x_long, z_long)