We will now turn to randomized algorithms that flip coins and roll dice in order to search for motifs. Making random algorithmic decisions may sound like a disastrous idea — just imagine a chess game in which every move would be decided by rolling a die. However, an 18th Century French mathematician and naturalist, Comte de Buffon, first proved that randomized algorithms are useful by randomly dropping needles onto parallel strips of wood and using the results of this experiment to accurately approximate the constant π (see DETOUR: Buffon’s Needle). 

Randomized algorithms may be nonintuitive because they lack the control of traditional algorithms. Some randomized algorithms are Las Vegas algorithms, which deliver solutions that are guaranteed to be exact, despite the fact that they rely on making random decisions. Yet most randomized algorithms, including the motif finding algorithms that we will consider in this chapter, are Monte Carlo algorithms. These algorithms are not guaranteed to return exact solutions, but they do quickly find approximate solutions. Because of their speed, they can be run many times, allowing us to choose the best approximation from thousands of runs.

We previously defined Profile(Motifs) as the profile matrix constructed from a collection of k-mers Motifs in Dna. Now, given a collection of strings Dna and an arbitrary 4 x k matrix Profile, we define Motifs(Profile, Dna) as the collection of k-mers formed by the Profile-most probable k-mers in each string from Dna. For example, consider the following Profile and Dna: 

Write a function Motifs(Profile, Dna) that takes a profile matrix Profile corresponding to a list of strings Dna as input and returns a list of the Profile-most probable k-mers in each string from Dna. Then add this function to Motifs.py.

In [6]:
from numba import jit

# Input:  A profile matrix Profile and a list of strings Dna
# Output: Motifs(Profile, Dna)
@jit
def Motifs(Profile, Dna):
    output_motifs = []
    k = len(Profile["A"]) # I think this should be an input parameter, no?
    for text in Dna: #For each string, what is the most probable kmer?
        output_motifs.append(ProfileMostProbableKmer(text, k, Profile))
    return output_motifs

# Input:  String Text and profile matrix Profile
# Output: Pr(Text, Profile)
@jit
def Pr(Motif, Profile):
    prob=1.0
    for char in range(len(Motif)):
        prob *= Profile[Motif[char]][char] # What is the probability that this character is in this position?    
    return prob

# The profile matrix assumes that the first row corresponds to A, the second corresponds to C,
# the third corresponds to G, and the fourth corresponds to T.
# You should represent the profile matrix as a dictionary whose keys are 'A', 'C', 'G', and 'T' and whose values are lists of floats
#@jit #This last @jit returns emptry strings!
def ProfileMostProbableKmer(text, k, Profile):
    most_prob = ""
    high_prob = -1.0
    for index in range(len(text)-k+1):
        Motif = text[index:index+k]
        Motif_prob = Pr(Motif, Profile)
        if Motif_prob > high_prob:
            high_prob = Motif_prob
            most_prob = Motif
    return str(most_prob)

In [7]:
ProfileTestCase0= { 'A': [0.8, 0.0, 0.0, 0.2 ],'C': [ 0.0, 0.6, 0.2, 0.0], 'G': [ 0.2 ,0.2 ,0.8, 0.0], 'T': [ 0.0, 0.2, 0.0, 0.8]}   
DnaTC0=['TTACCTTAAC','GATGTCTGTC','ACGGCGTTAG','CCCTAACGAG','CGTCAGAGGT']
out0 = ["ACCT",
    "ATGT",
    "GCGT",
    "ACGA",
    "AGGT"]

Motifs(ProfileTestCase0,DnaTC0)

['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT']

In general, we can begin from a collection of randomly chosen k-mers Motifs in Dna, construct Profile(Motifs), and use this profile to generate a new collection of k-mers:

  Motifs(Profile(Motifs), Dna).

Why would we do this? Because our hope is that Motifs(Profile(Motifs), Dna) has a better score than the original collection of k-mers Motifs. We can then form the profile matrix of these k-mers,

Profile(Motifs(Profile(Motifs), Dna))

and use it to form the most probable k-mers,

Motifs(Profile(Motifs(Profile(Motifs), Dna)), Dna).

We can continue to iterate. . .

...Profile(Motifs(Profile(Motifs(Profile(Motifs), Dna)), Dna))...

for as long as the score of the constructed motifs keeps improving, which is exactly what RandomizedMotifSearch does. To implement this algorithm, we will need to randomly select the initial collection of k-mers that form the motif matrix Motifs. To do so, we will first need to implement a random number generator that is equally likely to return any integer between 1 and M. You might like to think about this random number generator as an unbiased M-sided die.

STOP and Think: How would you implement an algorithm for generating a random integer? 

In [10]:
import random
# Input:  A list of strings Dna, and integers k and t
# Output: RandomMotifs(Dna, k, t)
# HINT:   You might not actually need to use t since t = len(Dna), but you may find it convenient
@jit
def RandomMotifs(Dna, k, t):
    rand_motifs = []
    m = len(Dna[0])# What is the max length of one of our strings?
    for text in Dna:
            i = random.randint(0, m-k)#Pick a random int that is up to k less than the max value
            rand_motifs.append(text[i:i+k])
    return rand_motifs

In [11]:
k, t = 3, 5

RandomMotifs(DnaTC0, k, t)

['TTA', 'TGT', 'TTA', 'TAA', 'CGT']

In [12]:
out0 = ["aTTAz",
    "aGTCz",
    "aACGz",
    "aACGz",
    "aGAGz"]
RandomMotifs(out0, k, t)

['aTT', 'GTC', 'aAC', 'aAC', 'aGA']

We are now ready to develop RandomizedMotifSearch. We start by generating a collection of random motifs using the function from the previous step, which we set as the best-scoring collection of motifs.
```
    M = RandomMotifs(Dna, k, t)
    BestMotifs = M
```
The code below stops running as soon as the score of the motifs that we generate stops improving. It uses the loop “while True”, which iterates until it encounters a return statement. It can be dangerous to use such a loop, since it could lead to an infinite loop in which a program never terminates. However, in this particular case, the motif score must eventually stop improving, so that RandomizedMotifSearch must eventually terminate.
```
       while True:
        Profile = ProfileWithPseudocounts(M)
        M = Motifs(Profile, Dna)
        if Score(M) < Score(BestMotifs):
            BestMotifs = M
        else:
            return BestMotifs 
```
Code Challenge (1 point): Put this code into a function RandomizedMotifSearch that takes a list of strings Dna along with integers k and t as input.   Then add this function to Motifs.py.

In [21]:
# Input:  A set of kmers Motifs
# Output: A consensus string of Motifs.
def Consensus(Motifs):
    consensus = "" #empty string.
    counts = CountWithPseudocounts(Motifs)

    for j in range(len(Motifs[0])):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if counts[symbol][j] > m:
                m = counts[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol # Add most frequent symbol
    return consensus, counts


def Score(Motifs):
    #This really seems inefficient. We could have consensus return both values...
    counts = CountWithPseudocounts(Motifs)
    consensus, counts = Consensus(Motifs)
    score = 0
    for char in range(len(consensus)):
        nucleotide = consensus[char] #Our nucleoutide
        keys = [key for key in ['A','C',"G","T"] if key != nucleotide ] # What are the non-consensus nucleotides
        for key in keys:# For each key
            score += counts[key][char] # add the number of times we were incorrect to our score
    return score


# Input:  A set of kmers Motifs
# Output: CountWithPseudocounts(Motifs)
def CountWithPseudocounts(Motifs):
    count = {} # initializing the count dictionary
    #Initialize each nucleotide with an empty list, 
    for nucleotide in ["A","C","G","T"]:
        count[nucleotide] = []     
    for ind in range(len(Motifs[0])):
        for nucleotide in ["A","C","G","T"]:
            count[nucleotide].append(1.0) #everything must have a 1 initially  (Now that we are dealing with pseudo counts)
        for motif in range(len(Motifs)): #For each Motif, loop through chars
            count[Motifs[motif][ind]][ind] += 1.0 # FOr each nuc, increment its count for that Motif
    return count

# Input:  A set of kmers Motifs
# Output: ProfileWithPseudocounts(Motifs)
#@jit
def ProfileWithPseudocounts(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    profile = {}
    counts = CountWithPseudocounts(Motifs)

    for nucleotide in ["A","C","G","T"]:
        #everything must divided by the total number of Motifs  
        #We added a single count to each nucleotide, hence the +4.0
        profile[nucleotide] = [ float(count)/(float(t) + 4.0) for count in counts[nucleotide]]  
    return profile

# Input:  Positive integers k and t, followed by a list of strings Dna
# Output: RandomizedMotifSearch(Dna, k, t)
#@jit#(nopython=True)#Dictionaries not supported?
def RandomizedMotifSearch(Dna, k, t):

    M = RandomMotifs(Dna, k, t)
    BestMotifs = M
    
    while True:
        Profile = ProfileWithPseudocounts(M)
        M = Motifs(Profile, Dna)
        if Score(M) < Score(BestMotifs):
            BestMotifs = M
        else:
            return BestMotifs 

In [24]:
k,t = 8, 5
Dna = ["CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA",
    "GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG",
    "TAGTACCGAGACCGAAAGAAGTATACAGGCGT",
    "TAGATCAAGTTTCAGGTGCACGTCGGTGAACC",
    "AATCCACCAGCTCCACGTGCAATGTTGGCCTA"]

RandomizedMotifSearch(Dna,k,t)

['GGGTGTTC', 'AGTGCCAA', 'AGTACCGA', 'AGGTGCAC', 'ACGTGCAA']

Exercise Break (2 points): In practice, we retain the best-scoring set of motifs over many runs of RandomizedMotifSearch. Add an input parameter N representing the number of runs of RandomizedMotifSearch, and then find the best-scoring motifs with k-mer length equal to 15 in the DosR dataset over N runs (click here to download). Don't forget to use pseudocounts!

In [25]:
Dna = ["GCGCCCCGCCCGGACAGCCATGCGCTAACCCTGGCTTCGATGGCGCCGGCTCAGTTAGGGCCGGAAGTCCCCAATGTGGCAGACCTTTCGCCCCTGGCGGACGAATGACCCCAGTGGCCGGGACTTCAGGCCCTATCGGAGGGCTCCGGCGCGGTGGTCGGATTTGTCTGTGGAGGTTACACCCCAATCGCAAGGATGCATTATGACCAGCGAGCTGAGCCTGGTCGCCACTGGAAAGGGGAGCAACATC",
"CCGATCGGCATCACTATCGGTCCTGCGGCCGCCCATAGCGCTATATCCGGCTGGTGAAATCAATTGACAACCTTCGACTTTGAGGTGGCCTACGGCGAGGACAAGCCAGGCAAGCCAGCTGCCTCAACGCGCGCCAGTACGGGTCCATCGACCCGCGGCCCACGGGTCAAACGACCCTAGTGTTCGCTACGACGTGGTCGTACCTTCGGCAGCAGATCAGCAATAGCACCCCGACTCGAGGAGGATCCCG",
"ACCGTCGATGTGCCCGGTCGCGCCGCGTCCACCTCGGTCATCGACCCCACGATGAGGACGCCATCGGCCGCGACCAAGCCCCGTGAAACTCTGACGGCGTGCTGGCCGGGCTGCGGCACCTGATCACCTTAGGGCACTTGGGCCACCACAACGGGCCGCCGGTCTCGACAGTGGCCACCACCACACAGGTGACTTCCGGCGGGACGTAAGTCCCTAACGCGTCGTTCCGCACGCGGTTAGCTTTGCTGCC",
"GGGTCAGGTATATTTATCGCACACTTGGGCACATGACACACAAGCGCCAGAATCCCGGACCGAACCGAGCACCGTGGGTGGGCAGCCTCCATACAGCGATGACCTGATCGATCATCGGCCAGGGCGCCGGGCTTCCAACCGTGGCCGTCTCAGTACCCAGCCTCATTGACCCTTCGACGCATCCACTGCGCGTAAGTCGGCTCAACCCTTTCAAACCGCTGGATTACCGACCGCAGAAAGGGGGCAGGAC",
"GTAGGTCAAACCGGGTGTACATACCCGCTCAATCGCCCAGCACTTCGGGCAGATCACCGGGTTTCCCCGGTATCACCAATACTGCCACCAAACACAGCAGGCGGGAAGGGGCGAAAGTCCCTTATCCGACAATAAAACTTCGCTTGTTCGACGCCCGGTTCACCCGATATGCACGGCGCCCAGCCATTCGTGACCGACGTCCCCAGCCCCAAGGCCGAACGACCCTAGGAGCCACGAGCAATTCACAGCG",
"CCGCTGGCGACGCTGTTCGCCGGCAGCGTGCGTGACGACTTCGAGCTGCCCGACTACACCTGGTGACCACCGCCGACGGGCACCTCTCCGCCAGGTAGGCACGGTTTGTCGCCGGCAATGTGACCTTTGGGCGCGGTCTTGAGGACCTTCGGCCCCACCCACGAGGCCGCCGCCGGCCGATCGTATGACGTGCAATGTACGCCATAGGGTGCGTGTTACGGCGATTACCTGAAGGCGGCGGTGGTCCGGA",
"GGCCAACTGCACCGCGCTCTTGATGACATCGGTGGTCACCATGGTGTCCGGCATGATCAACCTCCGCTGTTCGATATCACCCCGATCTTTCTGAACGGCGGTTGGCAGACAACAGGGTCAATGGTCCCCAAGTGGATCACCGACGGGCGCGGACAAATGGCCCGCGCTTCGGGGACTTCTGTCCCTAGCCCTGGCCACGATGGGCTGGTCGGATCAAAGGCATCCGTTTCCATCGATTAGGAGGCATCAA",
"GTACATGTCCAGAGCGAGCCTCAGCTTCTGCGCAGCGACGGAAACTGCCACACTCAAAGCCTACTGGGCGCACGTGTGGCAACGAGTCGATCCACACGAAATGCCGCCGTTGGGCCGCGGACTAGCCGAATTTTCCGGGTGGTGACACAGCCCACATTTGGCATGGGACTTTCGGCCCTGTCCGCGTCCGTGTCGGCCAGACAAGCTTTGGGCATTGGCCACAATCGGGCCACAATCGAAAGCCGAGCAG",
"GGCAGCTGTCGGCAACTGTAAGCCATTTCTGGGACTTTGCTGTGAAAAGCTGGGCGATGGTTGTGGACCTGGACGAGCCACCCGTGCGATAGGTGAGATTCATTCTCGCCCTGACGGGTTGCGTCTGTCATCGGTCGATAAGGACTAACGGCCCTCAGGTGGGGACCAACGCCCCTGGGAGATAGCGGTCCCCGCCAGTAACGTACCGCTGAACCGACGGGATGTATCCGCCCCAGCGAAGGAGACGGCG",
"TCAGCACCATGACCGCCTGGCCACCAATCGCCCGTAACAAGCGGGACGTCCGCGACGACGCGTGCGCTAGCGCCGTGGCGGTGACAACGACCAGATATGGTCCGAGCACGCGGGCGAACCTCGTGTTCTGGCCTCGGCCAGTTGTGTAGAGCTCATCGCTGTCATCGAGCGATATCCGACCACTGATCCAAGTCGGGGGCTCTGGGGACCGAAGTCCCCGGGCTCGGAGCTATCGGACCTCACGATCACC"]

In [27]:

# set t equal to the number of strings in Dna, k equal to 15, and N equal to 100.
k, t = 15, len(Dna)

# Call RandomizedMotifSearch(Dna, k, t) N times, storing the best-scoring set of motifs
# resulting from this algorithm in a variable called BestMotifs
N = 100 
best_score = 999999.
best_motifs = []
for i in range(N):
    motifs = RandomizedMotifSearch(Dna, k, t)
    
    if Score(motifs) < best_score:
        best_score = Score(motifs)
        best_motifs = motifs    

# Print the BestMotifs variable
print (best_motifs)
# Print Score(BestMotifs)
print (Score(best_motifs))

"""Passed test #2. Correct! Below is the best set of Motifs in the DosR with k = 15 using RandomizedMotifSearch (with a score of 60):
CCTGGCTTCGATGGC
ACCCGCGGCCCACGG
CGCCGCGTCCACCTC
CGACGCATCCACTGC
ACCGACGTCCCCAGC
ACGCTGTTCGCCGGC
CCGCTGTTCGATATC
AGCCTCAGCTTCTGC
GGTGGGGACCAACGC
CCACTGATCCAAGTC"""

SyntaxError: invalid syntax (<ipython-input-27-bb3a111525c4>, line 22)

NameError: name 'jit' is not defined