Our proposed greedy motif search algorithm, GreedyMotifSearch, starts by setting BestMotifs equal to the first k-mer from each string in Dna. These strings will serve as the best-scoring motifs found thus far.
```
    BestMotifs = []
    for i in range(0, t):
        BestMotifs.append(Dna[i][0:k])
```
It then ranges over all possible k-mers in Dna[0], trying each one as Motifs[0]. For a given choice of k-mer in Dna[0] for Motifs[0], the algorithm then builds a profile matrix Profile for this lone k-mer, and sets Motifs[1] equal to the Profile-most probable k-mer in Dna[1]. GreedyMotifSearch then iterates by updating Profile as the profile matrix formed from Motifs[0] and Motifs[1], and sets Motifs[2] equal to the Profile-most probable k-mer in Dna[2]. In general, after finding k-mers Motifs in the first i strings of Dna, GreedyMotifSearch constructs Profile(Motifs) and sets Motifs[i] equal to the Profile-most probable k-mer from Dna[i] based on this profile matrix.
```
    n = len(Dna[0])
    for i in range(n-k+1):
        Motifs = []
        Motifs.append(Dna[0][i:i+k])
        for j in range(1, t):
            P = Profile(Motifs[0:j])
            Motifs.append(ProfileMostProbablePattern(Dna[j], k, P))
```
After selecting a k-mer from each string in Dna to obtain a collection of strings Motifs, GreedyMotifSearch checks whether Motifs outscores the current best scoring collection of motifs, BestMotifs.
```
        if Score(Motifs) < Score(BestMotifs):
            BestMotifs = Motifs
```
It then returns to the top of the for loop and moves one symbol over in Dna[0], beginning the entire process of generating  Motifs again. After generating a collection Motifs for every possible initial k-mer from Dna[0], it returns the high-scoring strings BestMotifs.

Note: Graeme Benstead-Hume wrote a post explaining GreedyMotifSearch in greater detail. Graeme was one of the excellent learners in our first ever MOOC session, and has since gone on to become a PhD student in bioinformatics at the University of Sussex! If you need more time to understand the algorithm, you may want to check out his post: http://www.mrgraeme.co.uk/greedy-motif-search/.

Code Challenge (1 point): Consolidate this code into a function GreedyMotifSearch(Dna, k, t), and then add this function to Motifs.py.

In [1]:
# Input:  A set of kmers Motifs
# Output: Count(Motifs)
def Count(Motifs):
    count = {} # initializing the count dictionary
    #Initialize each nucleotide with an empty list, 
    for nucleotide in ["A","C","G","T"]:
        count[nucleotide] = []     
    for ind in range(len(Motifs[0])):
        for nucleotide in ["A","C","G","T"]:
            count[nucleotide].append(0) #everything must have a 0 initially  
        for motif in range(len(Motifs)): #For each Motif, loop through chars
            count[Motifs[motif][ind]][ind] += 1 # FOr each nuc, increment its count for that Motif
    return count

# Input:  A list of kmers Motifs
# Output: the profile matrix of Motifs, as a dictionary of lists.
def Profile(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    profile = {}
    counts = Count(Motifs)
    for nucleotide in ["A","C","G","T"]:
        #everything must divided byt the total number of Motifs  
        profile[nucleotide] = [ count/float(t) for count in counts[nucleotide]]  
    return profile

# Input:  A set of kmers Motifs
# Output: A consensus string of Motifs.
def Consensus(Motifs):
    consensus = "" #empty string.
    counts = Count(Motifs)

    for j in range(len(Motifs[0])):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if counts[symbol][j] > m:
                m = counts[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol # Add most frequent symbol
    return consensus, counts

# Input:  String Text and profile matrix Profile
# Output: Pr(Text, Profile)
def Pr(Motif, Profile):
    prob=1.0
    for char in range(len(Motif)):
        prob *= Profile[Motif[char]][char] # What is the probability that this character is in this position?
    
    return prob

def Score(Motifs):
    #This really seems inefficient. We could have consensus return both values...
    #counts = Count(Motifs)
    consensus, counts = Consensus(Motifs)
    score = 0
    for char in range(len(consensus)):
        nucleotide = consensus[char] #Our nucleoutide
        keys = [key for key in ['A','C',"G","T"] if key != nucleotide ] # What are the non-consensus nucleotides
        for key in keys:# For each key
            score += counts[key][char] # add the number of times we were incorrect to our score
    return score

# Write your ProfileMostProbableKmer() function here.
# The profile matrix assumes that the first row corresponds to A, the second corresponds to C,
# the third corresponds to G, and the fourth corresponds to T.
# You should represent the profile matrix as a dictionary whose keys are 'A', 'C', 'G', and 'T' and whose values are lists of floats
def ProfileMostProbableKmer(text, k, Profile):
    most_prob = ""
    high_prob = -1.0
    for index in range(len(text)-k+1):
        Motif = text[index:index+k]
        Motif_prob = Pr(Motif, Profile)
        if Motif_prob > high_prob:
            high_prob = Motif_prob
            most_prob = Motif
    return most_prob

In [2]:
# Input:  A list of kmers Dna, and integers k and t (where t is the number of kmers in Dna)
# Output: GreedyMotifSearch(Dna, k, t)
def GreedyMotifSearchWithPseudocounts(Dna, k, t):
    BestMotifs = []
    for i in range(0, t):
        BestMotifs.append(Dna[i][0:k])
        
    n = len(Dna[0])
    for i in range(n-k+1):
        Motifs = []
        Motifs.append(Dna[0][i:i+k])
        for j in range(1, t):
            P = Profile(Motifs[0:j])
            Motifs.append(ProfileMostProbableKmer(Dna[j], k, P))
        if Score(Motifs) < Score(BestMotifs):
            BestMotifs = Motifs
    return BestMotifs

In [3]:
Dna = [
"GGCGTTCAGGCA", 
"AAGAATCAGTCA", 
"CAAGGAGTTCGC",
"CACGTCAATCAC", 
"CAATAATATTCG"
]

GreedyMotifSearch(Dna, 3, 5)

['CAG', 'CAG', 'CAA', 'CAA', 'CAA']

In 2003, biologists found the dormancy survival regulator (DosR), a transcription factor that regulates many genes whose expression dramatically changes under hypoxic conditions. However, it remained unclear how DosR regulates these genes, and its transcription factor binding site remained unknown. In an attempt to resolve this puzzle, biologists performed a DNA array experiment and found 25 genes whose expression levels significantly changed in hypoxic conditions. Given the upstream regions of these genes, each of which is 250 nucleotides long, we would like to discover the “hidden message” that DosR uses to control the expression of these genes.

To simplify the problem a bit, we have selected just 10 of the 25 genes, resulting in the DosR dataset (click here to download). In this chapter, we will try to identify motifs in this dataset using the motif finding algorithms that we will develop. However, we will not give you a hint about the DosR motif.

In [9]:
k, t = 3, 5
DNA = ["GGCGTTCAGGCA",
    "AAGAATCAGTCA",
    "CAAGGAGTTCGC",
    "CACGTCAATCAC",
    "CAATAATATTCG"]
GreedyMotifSearch(DNA, 3, 5)

['CAG', 'CAG', 'CAA', 'CAA', 'CAA']

In [10]:
Dna = [
"GCGCCCCGCCCGGACAGCCATGCGCTAACCCTGGCTTCGATGGCGCCGGCTCAGTTAGGGCCGGAAGTCCCCAATGTGGCAGACCTTTCGCCCCTGGCGGACGAATGACCCCAGTGGCCGGGACTTCAGGCCCTATCGGAGGGCTCCGGCGCGGTGGTCGGATTTGTCTGTGGAGGTTACACCCCAATCGCAAGGATGCATTATGACCAGCGAGCTGAGCCTGGTCGCCACTGGAAAGGGGAGCAACATC",
"CCGATCGGCATCACTATCGGTCCTGCGGCCGCCCATAGCGCTATATCCGGCTGGTGAAATCAATTGACAACCTTCGACTTTGAGGTGGCCTACGGCGAGGACAAGCCAGGCAAGCCAGCTGCCTCAACGCGCGCCAGTACGGGTCCATCGACCCGCGGCCCACGGGTCAAACGACCCTAGTGTTCGCTACGACGTGGTCGTACCTTCGGCAGCAGATCAGCAATAGCACCCCGACTCGAGGAGGATCCCG",
"ACCGTCGATGTGCCCGGTCGCGCCGCGTCCACCTCGGTCATCGACCCCACGATGAGGACGCCATCGGCCGCGACCAAGCCCCGTGAAACTCTGACGGCGTGCTGGCCGGGCTGCGGCACCTGATCACCTTAGGGCACTTGGGCCACCACAACGGGCCGCCGGTCTCGACAGTGGCCACCACCACACAGGTGACTTCCGGCGGGACGTAAGTCCCTAACGCGTCGTTCCGCACGCGGTTAGCTTTGCTGCC", 
"GGGTCAGGTATATTTATCGCACACTTGGGCACATGACACACAAGCGCCAGAATCCCGGACCGAACCGAGCACCGTGGGTGGGCAGCCTCCATACAGCGATGACCTGATCGATCATCGGCCAGGGCGCCGGGCTTCCAACCGTGGCCGTCTCAGTACCCAGCCTCATTGACCCTTCGACGCATCCACTGCGCGTAAGTCGGCTCAACCCTTTCAAACCGCTGGATTACCGACCGCAGAAAGGGGGCAGGAC",
"GTAGGTCAAACCGGGTGTACATACCCGCTCAATCGCCCAGCACTTCGGGCAGATCACCGGGTTTCCCCGGTATCACCAATACTGCCACCAAACACAGCAGGCGGGAAGGGGCGAAAGTCCCTTATCCGACAATAAAACTTCGCTTGTTCGACGCCCGGTTCACCCGATATGCACGGCGCCCAGCCATTCGTGACCGACGTCCCCAGCCCCAAGGCCGAACGACCCTAGGAGCCACGAGCAATTCACAGCG", 
"CCGCTGGCGACGCTGTTCGCCGGCAGCGTGCGTGACGACTTCGAGCTGCCCGACTACACCTGGTGACCACCGCCGACGGGCACCTCTCCGCCAGGTAGGCACGGTTTGTCGCCGGCAATGTGACCTTTGGGCGCGGTCTTGAGGACCTTCGGCCCCACCCACGAGGCCGCCGCCGGCCGATCGTATGACGTGCAATGTACGCCATAGGGTGCGTGTTACGGCGATTACCTGAAGGCGGCGGTGGTCCGGA", 
"GGCCAACTGCACCGCGCTCTTGATGACATCGGTGGTCACCATGGTGTCCGGCATGATCAACCTCCGCTGTTCGATATCACCCCGATCTTTCTGAACGGCGGTTGGCAGACAACAGGGTCAATGGTCCCCAAGTGGATCACCGACGGGCGCGGACAAATGGCCCGCGCTTCGGGGACTTCTGTCCCTAGCCCTGGCCACGATGGGCTGGTCGGATCAAAGGCATCCGTTTCCATCGATTAGGAGGCATCAA", 
"GTACATGTCCAGAGCGAGCCTCAGCTTCTGCGCAGCGACGGAAACTGCCACACTCAAAGCCTACTGGGCGCACGTGTGGCAACGAGTCGATCCACACGAAATGCCGCCGTTGGGCCGCGGACTAGCCGAATTTTCCGGGTGGTGACACAGCCCACATTTGGCATGGGACTTTCGGCCCTGTCCGCGTCCGTGTCGGCCAGACAAGCTTTGGGCATTGGCCACAATCGGGCCACAATCGAAAGCCGAGCAG", 
"GGCAGCTGTCGGCAACTGTAAGCCATTTCTGGGACTTTGCTGTGAAAAGCTGGGCGATGGTTGTGGACCTGGACGAGCCACCCGTGCGATAGGTGAGATTCATTCTCGCCCTGACGGGTTGCGTCTGTCATCGGTCGATAAGGACTAACGGCCCTCAGGTGGGGACCAACGCCCCTGGGAGATAGCGGTCCCCGCCAGTAACGTACCGCTGAACCGACGGGATGTATCCGCCCCAGCGAAGGAGACGGCG", 
"TCAGCACCATGACCGCCTGGCCACCAATCGCCCGTAACAAGCGGGACGTCCGCGACGACGCGTGCGCTAGCGCCGTGGCGGTGACAACGACCAGATATGGTCCGAGCACGCGGGCGAACCTCGTGTTCTGGCCTCGGCCAGTTGTGTAGAGCTCATCGCTGTCATCGAGCGATATCCGACCACTGATCCAAGTCGGGGGCTCTGGGGACCGAAGTCCCCGGGCTCGGAGCTATCGGACCTCACGATCACC"
]

In [14]:
# set t equal to the number of strings in Dna and k equal to 15
k, t = 15, len(Dna)
print (k, t)
Motifs = GreedyMotifSearch(Dna, k, t)

15 10


In [15]:
Score(Motifs)

64

GreedyMotifSearch is fast and can be run with k = 15 to find candidate motifs in the DosR dataset, as we saw on the previous step.  At the same time, it trades speed for accuracy; when we run it on the Subtle Motif Dataset, GreedyMotifSearch returns the 15-mer "gtAAAtAgaGatGtG" (total score: 58), which varies greatly from the true implanted motif "AAAAAAAAGGGGGGG".  This makes us highly suspicious of the results we obtained when running this algorithm on the DosR dataset.

STOP and Think: Why does GreedyMotifSearch perform so poorly?

At first glance, GreedyMotifSearch may seem like a reasonable algorithm, but it is not! Let’s see whether GreedyMotifSearch will find the (4, 1)-motif "ACGT" implanted in the following strings Dna:
```
ttACCTtaac
gATGTctgtc
acgGCGTtag
ccctaACGAg
cgtcagAGGT
```
We will assume that the algorithm has already correctly chosen the implanted 4-mer "ACCT" from the first string in Dna and constructed the corresponding Profile:
```
A: 1 0 0 0
C: 0 1 1 0
G: 0 0 0 0
T: 0 0 0 1
```
The algorithm is now ready to search for a Profile-most probable 4-mer in the second string. The issue, however, is that there are so many zeroes in the profile matrix that the probability of every 4-mer but "ACCT" is zero! Thus, unless "ACCT" is present in every string in Dna, there is little chance that GreedyMotifSearch will find the implanted motif. Zeroes in the profile matrix are not just a minor annoyance but rather a persistent problem that we must address.