# Markov Chains

---

## Before Class
In class today we will be implementing a Markov chain to process sentences
Prior to class, please do the following:
1. Review slides on Markov chains in detail
* Explore using the python dict() structure and how a dict() can contain nested dict() structures

---
## Learning Objectives

1. Conceptually understand Markov Chains
* Implement a Markov Chain


---
## Background

Recall from the lectures that Markov Chains represent a series of events following the Markov Property: future states are memory-less in that they depend only on the current state. This can be expanded to the idea of variable order Markov models where there is a variable-length memory (eg. 1st order Markov Model). Markov models consist of fully observable states. A common example of this is in predicting the weather: We can clearly see the current weather and would like to predict tomorrow's weather. As shown in the slides, this is also applicable to biology with one case being CpG islands. 

Our goal today will be to implement a Markov model built from words. For our example text, we will use the classic example of Dr. Seuss because of the repetitive nature of the text.

---
## Train Markov model

For our initial implementation of the Markov Model, we will use the simple example of Dr. Seuss: "One fish two fish red fish blue fish."



In [6]:
def build_markov_model(markov_model, new_text):
    '''
    Function to build or add to a 1st order Markov model given a string of text
    We will store the markov model as a dictionary of dictionaries
    The key in the outer dictionary represents the current state
    and the inner dictionary represents the next state with their contents containing
    the transition probabilities.
    Note: This would be easier to read if we were to build a class representation
           of the model rather than a dictionary of dictionaries, but for simplicitiy
           our implementation will just use this structure.
    
    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)
        new_text (str): a string to build or add to the markov_model

    Returns:
        markov_model (dict of dicts): an updated markov_model
        
    Pseudocode:
        Add artificial states for start and end
        For each word in text:
            Increment markov_model[word][next_word]
        
    '''
    #goal to build just first-order Markov Model, that's it! Pretty restrictive and boring....
    
    #first order in case of DNA: previous base is considered while second order with DNA: two previous bases are considered. 
    
    #turn the new_text string into a list
    new_text_lst = new_text.split()
    # print(new_text_lst)

    #add artificial begin and end states
    new_text_lst.insert(0, "START")
    new_text_lst.append("END")
    # print(new_text_lst)
    
    #now have to iterate through each word in the text. 
    #do indexes to keep track. 
    
    #needs to be -1 because you'll be out of range since you have been adding one. How are you going to account for the very last one. 
    for index in range(len(new_text_lst) - 1):
        # print(index)
        # print(new_text_lst[index])
        # print(new_text_lst[index + 1])
        #will this word be in the markov_model 
        
        #dictionaries in python don't like to be addressed when there's no value in the dictionary itself.
        #as we're building the dictionary model, we must keep checking as we continue to build. 
        
        #again we're only focused on one word here, 
        #so does it have "START", "one", "fish", "two", etc. 
        if new_text_lst[index] in markov_model: #if the current word is already in the markov_model
            #then we want to see if the next word is in the markov_model within the second dictionary. 
            if new_text_lst[index + 1] in markov_model[new_text_lst[index]]: #so if 
                #we want to increment in that frequency position. 
                markov_model[new_text_lst[index]][new_text_lst[index + 1]] += 1
            if new_text_lst[index + 1] not in markov_model[new_text_lst[index]]:
                #if the next word is not already in the markov_model, then we're going to add it to the markov_model by adding a frequency of 1
                markov_model[new_text_lst[index]][new_text_lst[index + 1]] = 1
        #now we need to check if the actual word is already in the markov model or not. If it's not, then we need to add it to the empty dictionary. 
        if new_text_lst[index] not in markov_model:
            markov_model[new_text_lst[index]] = {}
            markov_model[new_text_lst[index]][new_text_lst[index + 1]] = 1
          
    #return the dictionary markov_model 
    return markov_model
    

In [7]:
markov_model = dict()
text = "one fish two fish red fish blue fish"
markov_model = build_markov_model(markov_model, text) #markov_model is set to equal the output of the function. 
print (markov_model)

{'START': {'one': 1}, 'one': {'fish': 1}, 'fish': {'two': 1, 'red': 1, 'blue': 1, 'END': 1}, 'two': {'fish': 1}, 'red': {'fish': 1}, 'blue': {'fish': 1}}


In [8]:
markov_model = dict()
text = "one fish two fish red fish blue fish blue fish Amelia"
markov_model = build_markov_model(markov_model, text) #markov_model is set to equal the output of the function. 
print (markov_model)

{'START': {'one': 1}, 'one': {'fish': 1}, 'fish': {'two': 1, 'red': 1, 'blue': 2, 'Amelia': 1}, 'two': {'fish': 1}, 'red': {'fish': 1}, 'blue': {'fish': 2}, 'Amelia': {'END': 1}}


In [4]:
test = {'*S*': {'one': 1}, 'one': {'fish': 1}, 'fish': {'two': 1, 'red': 1, 'blue': 1, '*E*': 1}, 'two': {'fish': 1}, 'red': {'fish': 1}, 'blue': {'fish': 1}}

In [5]:
test["one"]["fish"]

1

In [9]:
string = "I will be driving from Michigan back to Wisconsin"
string_list = string.split()
string_list
string_list.insert(0, "START")
string_list
string_list.append("END")
string_list

['START',
 'I',
 'will',
 'be',
 'driving',
 'from',
 'Michigan',
 'back',
 'to',
 'Wisconsin',
 'END']

###  Nth order Markov chain
In the above model, each event or word is output from only the previous state with no memory of any prior states. While this is useful in some cases, typical biological applications of Markov chains require higher-order models to accurately capture what we know about a system. For instance, in attempting to identify coding regions of a genome, we know that open reading frames (ORFs) contain codon triplets, and so a third or sixth order Markov chain would better describe these regions. Here you will implement a generalized form of our previous Markov Chain to allow for Nth order chains.


In [1]:
def build_markov_model2(markov_model, text, order=1):
    '''
    Function to build or add to a Nth order Markov model given a string of text

    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)
            or None if a new model is being built
        text (str): a string to build or add to the moarkov_model
        order (int): the number of previous states to consider for the model
        
    Returns:
        markov_model (dict of dicts): an updated/new markov_model
    '''
    #splitting the new_text
    new_text_lst = text.split()
    # print(type(new_text_lst))
    new_text_lst.append("END")
    
    # print(new_text_lst)  
    
    #order is just an integer, need to have 
    for j in range(order):
        new_text_lst.insert(0, "START")
    # print(new_text_lst)
    #this will be the size of the order for the START for markov chain 
        
    
    #last time we subtracted one since it was a first order. Now we're subtracting whatever the order is: 
    for index in range(len(new_text_lst) - order): 
        #we cannot search for just one word since this is NOT a first order Markov Model, it is nth order (can be any number)
        #it is already a list so you can just use the tuple() method. 
        
        #i had a list at first, but realized you need a tuple to accomplish this task. 
        #tuples are hashable, but lists are not. 
        top_key_words = tuple(new_text_lst[index:index + order])
        #we need to see if that number of order is present in the markov_model, then we can go ahead and implement the same thing. 
        # print(top_key_words)
        
        #need to do the same if conditions for the empty dict of dictionaries. 
        #remember that you are working with nth order, so it can be how ever many beginning states. 
        if top_key_words in markov_model:
            if new_text_lst[index + order] in markov_model[top_key_words]:
                #need the += increment in case there are more than one certain word that are within the inner dictionaries. 
                markov_model[top_key_words][new_text_lst[index + order]] += 1
                
            if new_text_lst[index + order] not in markov_model[top_key_words]:
                markov_model[top_key_words][new_text_lst[index + order]] = 1
                
        if top_key_words not in markov_model:
            markov_model[top_key_words] = {}
            markov_model[top_key_words][new_text_lst[index + order]] = 1
            
        
    return markov_model
    

In [2]:
#this case the text has an order of 2 since it's looking at how many starts there are.
markov_model = dict()
text = "one fish two fish red fish blue fish"
markov_model = build_markov_model2(markov_model, text, order=2)
markov_model

{('START', 'START'): {'one': 1},
 ('START', 'one'): {'fish': 1},
 ('one', 'fish'): {'two': 1},
 ('fish', 'two'): {'fish': 1},
 ('two', 'fish'): {'red': 1},
 ('fish', 'red'): {'fish': 1},
 ('red', 'fish'): {'blue': 1},
 ('fish', 'blue'): {'fish': 1},
 ('blue', 'fish'): {'END': 1}}

In [18]:
markov_model

{('START', 'START'): {'one': 1},
 ('START', 'one'): {'fish': 1},
 ('one', 'fish'): {'two': 1},
 ('fish', 'two'): {'fish': 1},
 ('two', 'fish'): {'red': 1},
 ('fish', 'red'): {'fish': 1},
 ('red', 'fish'): {'blue': 1},
 ('fish', 'blue'): {'fish': 1},
 ('blue', 'fish'): {'END': 1}}

In [20]:
tup = ("START", "START")

In [23]:
next_word = markov_model[tup].keys()
next_word_lst = list(next_word)
next_word_lst

['one']

In [46]:
freq = markov_model[tup].values()
next_word_freq = list(freq)
next_word_freq

KeyError: ('START', 'START')

In [11]:
test = ["a", "b", "c"]

In [13]:
test2 = tuple(test)

In [14]:
num = 2
num

2

In [15]:
empty = []
for index in range(num):
    empty.append("hi")
tuple(empty)

('hi', 'hi')

In [16]:
full = ["a", "b", "c"]
for index in range(num):
    full.insert(0, "hi")
print(full)

['hi', 'hi', 'a', 'b', 'c']


## Generate text from Markov Model

Markov models are "generative models". That is, the probability states in the model can be used to generate output following the conditional probabilities in the model.

We will now generate a sequence of text from the Markov model. For this section, I recommend using np.random.choice, which allows for you to provide a probability distribution for drawing the next edge in the chain.

In [12]:
def get_next_word2(current_word, markov_model, seed=42):
    '''
    Function to randomly move a valid next state given a markov model
    and a current state (word)
    
    Args: 
        current_word (tuple): a word that exists in our model
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)

    Returns:
        next_word (str): a randomly selected next word based on transition probabilies
        
    Pseudocode:
        Calculate transition probabilities for all next states from a given state (counts/sum)
        Randomly draw from these to generate the next state
        
    '''
    next_words = markov_model[current_word].keys()
    next_words_lst = list(next_words)
    # next_state_lst = list(next_state)
    # print(next_state_lst)
    
    #now get frequencies 
    freq = markov_model[current_word].values()
    next_state_freq = list(freq) #all the integers. 
    

    # now calculate probabilities 
    for x in next_state_freq:
        next_word_prob = [x / sum(next_state_freq)]
        next_word_probabilities = list(next_word_prob)
        print(next_word_probabilities)
    
    #their code. 
    next_words_probabilities = [x / sum(next_state_freq) for x in next_state_freq]
    
    np.random.seed(seed)
    #random.choice(a, size=None, replace=True, p=None)
    #a =  random sample is generated from its elements
    #size = Default is None, in which case a single value is returned.
    #p = probabilities 
    
    next_word = np.random.choice(next_words, None, p = next_words_probabilities)
    
    return next_word[0] #output is the inner dictionary, so you have this {"S": 1} and you only want the "S". so [0]
#we only want to return the word, not the 

In [13]:
def generate_random_text2(markov_model, seed=42):
    '''
    Function to generate text given a markov model
    
    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)

    Returns:
        sentence (str): a randomly generated sequence given the model
        
    Pseudocode:
        Initialize sentence at start state
        Until End State:
            append get_next_word(current_word, markov_model)
        Return sentence
        
    '''

# initialize sentence at start state.
    # need to account for an nth Markov Chain 
    test = markov_model.keys()[0] #list 
    order = len(test) #this will give me an integer 

    #we first need to estimate the order of the model 
    #need to convert the tuple (the keys in the first dict) into a list
    # order_test = (list(markov_model.keys())[0])
    # order = len(order_test)
    
    #we must start at the initial state. So we need to append the same order of "START" to a tuple 
    sentence_lst = []
    for i in range(order):
        sentence_lst.append("START") #will make the number of "START"s for the order of the model 
        
    start_tup = tuple(sentence_lst) #convert list of "START" from a list back to a tuple, tuples are hashable, lists are not. 
    #("START", "START") 

    sentence_test = [] #empty sentence this is what we'll be returning 
    current_word = ""
    
    #we don't want the "END", so do a while loop, until the model generates the "END" state. 
    while current_word != "END":
        #we're going to get the next word. 
        get_next_word(start_tup, markov_model, seed = seed)
        
        if current_word != "END":
            sentence_test.append(current_word)
        
        if current_word == "END":
            break 
            
        #removing the first word from the tuple, adding our new word to the tuple, and turn it back into a tuple. 
        current = list(start_tup)
        current.pop(0) #removes the first element. index[:1]
        current.append(current_word) #append the next word. 
        start_tup = tuple(current) 
        
    #need to take our original state and our next word and turn it into our next word. 
    #("START", "START") = one
    #("START", "one) = fish 
    
    sentence = "".join(sentence_test)
    
    return(sentence)
    

In [15]:
import numpy as np

def get_next_word(current_word, markov_model, seed=42):
    '''
    Function to randomly move a valid next state given a markov model
    and a current state (word)
    
    Args: 
        current_word (tuple): a word that exists in our model
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)

    Returns:
        next_word (str): a randomly selected next word based on transition probabilies
        
    Pseudocode:
        Calculate transition probabilities for all next states from a given state (counts/sum)
        Randomly draw from these to generate the next state
        
    '''
    
    
    #this will be the a in the np.random.choice. 
    #I want all the next_states which are within the inner dictionary. 
    #this given state will be in the current_word tuple. 
    
    
     # Get all of our possible next states
    next_words = list(markov_model[current_word].keys())
    
    # Calculate the probabilities to move to those based on word counts
    next_words_frequencies = list(markov_model[current_word].values())
    next_words_probabilities = [x / sum(next_words_frequencies) for x in next_words_frequencies]

    # Randomly move to the next state
    np.random.seed(seed)
    next_state = np.random.choice(next_words, 1, p=next_words_probabilities)

    # Return next word, it returns a list. so we need to convert it from a list to a string. 
    return next_state[0]

    
def generate_random_text(markov_model, seed=42):
    '''
    Function to generate text given a markov model
    
    Args: 
        markov_model (dict of dicts): a dictionary of word:(next_word:frequency pairs)

    Returns:
        sentence (str): a randomly generated sequence given the model
        
    Pseudocode:
        Initialize sentence at start state
        Until End State:
            append get_next_word(current_word, markov_model)
        Return sentence
        
    '''
    
    
     # We must start at the initial state of the model
    # estimate order
    order = len(list(markov_model.keys())[0])
    
    # We must start at the initial state of the model
    current_keyList = []
    for i in range (order):
        current_keyList.append("START")
        
    current_tuple = tuple(current_keyList)

    # Keeping track of the sentence as a list (ignoring the start state)
    sentence = list()
    current_word = ''

    # Until the model generates an end state, keep adding random words
    while current_word != "END":
        current_word = get_next_word(current_tuple, markov_model, seed=seed)
        
        # Don't append the end state to our output
        if current_word != "END":
            sentence.append(current_word)
            
        current_list = list(current_tuple)
        current_list.pop(0)
        current_list.append(current_word)
        current_tuple = tuple(current_list)

    # Return the words with spaces between them
    return ' '.join(sentence)
    


In [None]:
# Now just add some more training data to the markov model
markov_model = dict()
markov_model = build_markov_model2(markov_model, "black fish blue fish old fish new fish")
markov_model = build_markov_model2(markov_model, "this one has a little car")
markov_model = build_markov_model2(markov_model, "this one has a little star")
markov_model = build_markov_model2(markov_model, "say what a lot of fish there are")
markov_model = build_markov_model2(markov_model, "yes some are red and some are blue")
markov_model = build_markov_model2(markov_model, "some are old and some are new")

print (generate_random_text(markov_model,seed=7))

In [None]:
# An example of a more complex text that we can use to generate more complex output
sonet_markov_model = dict()
file = open("data/sonnets.txt", "r")
sonet = ""
for line in file:
    line = line.strip()
    if line == "":
        # Empty line so build model
        sonet_markov_model = build_markov_model(sonet_markov_model, sonet)
        sonet = ""
    else:
        sonet = sonet + ' ' + line

print ((generate_random_text(sonet_markov_model,seed=7))

In [108]:
tup = (("Four", "Three", "Two", "One"))
tup
tup_list = list(tup)
tup_list

['Four', 'Three', 'Two', 'One']

In [109]:
test

{'*S*': {'one': 1},
 'one': {'fish': 1},
 'fish': {'two': 1, 'red': 1, 'blue': 1, '*E*': 1},
 'two': {'fish': 1},
 'red': {'fish': 1},
 'blue': {'fish': 1}}

In [111]:
list(test["fish"].keys())

['two', 'red', 'blue', '*E*']

In [112]:
list(test["fish"].values())

[1, 1, 1, 1]

In [119]:
order = 3
