# Markov Chains

---

## Before Class
In class today we will be implementing a Markov chain to process sentences
Prior to class, please do the following:
1. Review slides on Markov chains in detail
* Explore using the python dict() structure and how a dict() can contain nested dict() structures

---
## Learning Objectives

1. Conceptually understand Markov Chains
* Implement a Markov Chain


---
## Background

Recall from the lectures that Markov Chains represent a series of events following the Markov Property: future states are memory-less in that they depend only on the current state. This can be expanded to the idea of variable order Markov models where there is a variable-length memory (eg. 1st order Markov Model). Markov models consist of fully observable states. A common example of this is in predicting the weather: We can clearly see the current weather and would like to predict tomorrow's weather. As shown in the slides, this is also applicable to biology with one case being CpG islands. 

Our goal today will be to implement a Markov model built from words. For our example text, we will use the classic example of Dr. Seuss because of the repetitive nature of the text.

---
## Train Markov model

For our initial implementation of the Markov Model, we will use the simple example of Dr. Seuss: "One fish two fish red fish blue fish."



In [1]:
def build_markov_model(markov_model, new_text):
    '''
    Function to build or add to a 1st order Markov model given a string of text
    We will store the markov model as a dictionary of dictionaries
    The key in the outer dictionary represents the current state
    and the inner dictionary represents the next state with their contents containing
    the transition probabilities.
    Note: This would be easier to read if we were to build a class representation
           of the model rather than a dictionary of dictionaries, but for simplicitiy
           our implementation will just use this structure.
    
    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)
        new_text (str): a string to build or add to the moarkov_model

    Returns:
        markov_model (dict of dicts): an updated markov_model
        
    Pseudocode:
        Add artificial states for start and end
        For each word in text:
            Increment markov_model[word][next_word]
        
    '''
    
    # Split up our sentence by spaces
    words = new_text.split()
    
    # Add our artifical start and end states to the sentence
    words.insert(0, "*S*")
    words.append("*E*")
    
    # Iterate over each word in the input
    for i in range(0, len(words)-1):
        
        # Here we do a lot of checking to properly initialize the dictionaries
        # But the main goal is to increment a counter if we have seen word[i+1]
        # after word[i]
        if words[i] in markov_model:                     # If we have already seen this word
            if words[i+1] in markov_model[words[i]]:     # and if we have already seen the next word
                markov_model[words[i]][words[i+1]] += 1  # increment the word counter
            else:
                markov_model[words[i]][words[i+1]] = 1   # If we haven't seen the next word, create it
        else:
            markov_model[words[i]] = {}                  # If we haven't see the word then create it
            markov_model[words[i]][words[i+1]] = 1
    
    return markov_model
    


In [2]:
markov_model = dict()
text = "one fish two fish red fish blue fish"
markov_model = build_markov_model(markov_model, text)
print (markov_model)

{'*S*': {'one': 1}, 'one': {'fish': 1}, 'fish': {'two': 1, 'red': 1, 'blue': 1, '*E*': 1}, 'two': {'fish': 1}, 'red': {'fish': 1}, 'blue': {'fish': 1}}


###  Nth order Markov chain
In the above model, each event or word is output from only the previous state with no memory of any prior states. While this is useful in some cases, typical biological applications of Markov chains require higher-order models to accurately capture what we know about a system. For instance, in attempting to identify coding regions of a genome, we know that open reading frames (ORFs) contain codon triplets, and so a third or sixth order Markov chain would better describe these regions. Here you will implement a generalized form of our previous Markov Chain to allow for Nth order chains.


In [3]:
def build_markov_model(markov_model, text, order=1):
    '''
    Function to build or add to a Nth order Markov model given a string of text

    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)
            or None if a new model is being built
        new_text (str): a string to build or add to the moarkov_model
        order (int): the number of previous states to consider for the model
        
    Returns:
        markov_model (dict of dicts): an updated/new markov_model
    '''
    words = text.split()
    words.append("*E*")
    
    current_keyList = []
    for i in range (order):
        words.insert(0,"*S*")
  
    for i in range(0, len(words)-order):
        word_set = tuple(words[i:i+order])
        
        if word_set in markov_model:
            if words[i+order] in markov_model[word_set]:
                markov_model[word_set][words[i+order]] += 1
            else:
                markov_model[word_set][words[i+order]] = 1
        else:
            markov_model[word_set] = {}
            markov_model[word_set][words[i+order]] = 1
                            
    return markov_model

In [4]:
markov_model = dict()
text = "one fish two fish red fish blue red fish blue"
markov_model = build_markov_model(markov_model, text, order=2)
markov_model

{('*S*', '*S*'): {'one': 1},
 ('*S*', 'one'): {'fish': 1},
 ('one', 'fish'): {'two': 1},
 ('fish', 'two'): {'fish': 1},
 ('two', 'fish'): {'red': 1},
 ('fish', 'red'): {'fish': 1},
 ('red', 'fish'): {'blue': 2},
 ('fish', 'blue'): {'red': 1, '*E*': 1},
 ('blue', 'red'): {'fish': 1}}

## Generate text from Markov Model

Markov models are "generative models". That is, the probability states in the model can be used to generate output following the conditional probabilities in the model.

We will now generate a sequence of text from the Markov model. For this section, I recommend using np.random.choice, which allows for you to provide a probability distribution for drawing the next edge in the chain.

In [5]:
import numpy as np

def get_next_word(current_word, markov_model, seed=42):
    '''
    Function to randomly move a valid next state given a markov model
    and a current state (word)
    
    Args: 
        current_word (tuple): a word that exists in our model
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)

    Returns:
        next_word (str): a randomly selected next word based on transition probabilies
        
    Pseudocode:
        Calculate transition probilities for all next states from a given state (counts/sum)
        Randomly draw from these to generate the next state
        
    '''
    # Get all of our possible next states
    next_words = list(markov_model[current_word].keys())
    
    # Calculate the probabilities to move to those based on word counts
    next_words_frequencies = list(markov_model[current_word].values())
    next_words_probabilities = [x / sum(next_words_frequencies) for x in next_words_frequencies]

    # Randomly move to the next state
    np.random.seed(seed)
    next_state = np.random.choice(next_words, 1, p=next_words_probabilities)

    # Return next state
    return next_state[0]

def generate_random_text(markov_model, seed=42):
    '''
    Function to generate text given a markov model
    
    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)

    Returns:
        sentence (str): a randomly generated sequence given the model
        
    Pseudocode:
        Initialize sentence at start state
        Until End State:
            append get_next_word(current_word, markov_model)
        Return sentence
        
    '''
    # We must start at the initial state of the model
    # estimate order
    order = len(list(markov_model.keys())[0])
    
    # We must start at the initial state of the model
    current_keyList = []
    for i in range (order):
        current_keyList.append("*S*")
        
    current_tuple = tuple(current_keyList)

    # Keeping track of the sentence as a list (ignoring the start state)
    sentence = list()
    current_word = ''

    # Until the model generates an end state, keep adding random words
    while current_word != "*E*":
        current_word = get_next_word(current_tuple, markov_model, seed=seed)
        
        # Don't append the end state to our output
        if current_word != "*E*":
            sentence.append(current_word)
            
        current_list = list(current_tuple)
        current_list.pop(0)
        current_list.append(current_word)
        current_tuple = tuple(current_list)

    # Return the words with spaces between them
    return ' '.join(sentence)

In [6]:
# Now just add some more training data to the markov model
markov_model = dict()
markov_model = build_markov_model(markov_model, "black fish blue fish old fish new fish", order=2)
markov_model = build_markov_model(markov_model, "this one has a little car", order=2)
markov_model = build_markov_model(markov_model, "this one has a little star", order=2)
markov_model = build_markov_model(markov_model, "say what a lot of fish there are", order=2)
markov_model = build_markov_model(markov_model, "yes some are red and some are blue", order=2)
markov_model = build_markov_model(markov_model, "some are old and some are new", order=2)

print (generate_random_text(markov_model,seed=2))

this one has a little car


In [7]:
# An example of a more complex text that we can use to generate more complex output
sonet_markov_model = dict()
file = open("data/sonnets.txt", "r")
sonet = ""
for line in file:
    line = line.strip()
    if line == "":
        # Empty line so build model
        sonet_markov_model = build_markov_model(sonet_markov_model, sonet, order=2)
        sonet = ""
    else:
        sonet = sonet + ' ' + line
 
print (generate_random_text(sonet_markov_model,seed=7))

When forty winters shall besiege thy brow, And dig deep trenches in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, or else this glutton be, To eat the world's fresh ornament, And only herald to the very same And that unfair which fairly doth excel; For never-resting time leads summer on To hideous winter, and confounds him there; Sap checked with frost, and lusty leaves quite gone, Beauty o'er-snowed and bareness every where: Then were not summer's distillation left, A liquid prisoner pent in walls of glass, Beauty's effect with beauty were bereft, Nor it, nor no remembrance what it was: But flowers distill'd, though they with winter meet, Leese but their show; their substance still lives sweet.
