Many algorithms are iterative procedures that must choose among various alternatives at each iteration. Some of these alternatives may lead to correct solutions, whereas others may not. Greedy algorithms select the “most attractive” alternative at each iteration. For example, a greedy algorithm in chess might attempt to capture an opponent’s most valuable piece at every move. Yet anyone who has played chess knows that a strategy looking only one move ahead will likely produce disastrous results.

In general, most greedy algorithms typically fail to find an exact solution of the problem; instead, they are often fast heuristics that trade accuracy for speed in order to find an approximate solution. Nevertheless, for many biological problems, greedy algorithms will prove quite useful.

In this section, we will explore a greedy approach to motif finding. Again, let Motifs be a collection of k-mers taken from t strings Dna. We can view each column of Profile(Motifs) as a four-sided biased die with one nucleotide on each side. Thus, a profile matrix with k columns can be viewed as a collection of k dice that we will roll to randomly generate a k-mer. For example, if the first column of the profile matrix is (0.2, 0.1, 0.0, 0.7), then we generate A as the first nucleotide with probability 0.2, C with probability 0.1, G with probability 0.0, and T with probability 0.7.

In the figure below, we reproduce the profile matrix for the NF-kB binding sites, where the lone colored entry in the i-th column corresponds to the i-th nucleotide in "ACGGGGATTACC". The probability Pr("ACGGGGATTACC", Profile) that Profile generates "ACGGGGATTACC" is computed by multiplying the highlighted entries in the profile matrix.



Figure: Generating a random string based on a profile matrix by selecting the i-th nucleotide in the string with the probability corresponding to that nucleotide in the i-th column of the profile matrix. The probability that a profile matrix will produce a given string is given by the product of individual nucleotide probabilities.

 Generating a random string based on a profile matrix by selecting the i-th nucleotide in the string with the probability corresponding to that nucleotide in the i-th column of the profile matrix. The probability that a profile matrix will produce a given string is given by the product of individual nucleotide probabilities.

A k-mer tends to have a higher probability when it is more similar to the consensus string of a profile. For example, for the NF-κB profile matrix (reproduced at bottom) and its consensus string "TCGGGGATTTCC",

Pr("TCGGGGATTTCC", Profile) = .7 · .6 · 1 · 1 · .9 · .9 · .9 · .5 · .8 · .7 · .4 · .6 = 0.0205753 ,

which is larger than the value of Pr("ACGGGGATTACC", Profile) = 0.000839808 that we computed on the previous step.

Exercise Break (1 point): Compute Pr("TCGTGGATTTCC", Profile) for the NF-κB profile matrix shown below. (Note: you can do this exercise with a calculator.)

In [2]:
# Input:  String Text and profile matrix Profile
# Output: Pr(Text, Profile)
def Pr(Motif, Profile):
    prob=1.0
    for char in range(len(Motif)):
        prob *= Profile[Motif[char]][char] # What is the probability that this character is in this position?
    
    return prob

In [3]:
#Profile:
Profile = {
    'A': [0.2, 0.2, 0.0, 0.0, 0.0, 0.0, 0.9, 0.1, 0.1, 0.1, 0.3, 0.0],
    'C': [0.1, 0.6, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.1, 0.2, 0.4, 0.6],
    'G': [0.0, 0.0, 1.0, 1.0, 0.9, 0.9, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0],
    'T': [0.7, 0.2, 0.0, 0.0, 0.1, 0.1, 0.0, 0.5, 0.8, 0.7, 0.3, 0.4]
}

Motif = "TCGTGGATTTCC"
Pr(Motif, Profile)

0.0

Given a profile matrix Profile, we can compute the probability of every k-mer in a string Text and find a Profile-most probable k-mer in Text, i.e., a k-mer that was most likely to have been generated by Profile among all k-mers in Text. For the NF-κB profile matrix, "ACGGGGATTACC" is the Profile-most probable 12-mer in "ggtACGGGGATTACCt". Indeed, every other 12-mer in this string has probability 0. In general, if there are multiple Profile-most probable k-mers in Text, then we select the first such k-mer occurring in Text.
```
Profile-most Probable k-mer Problem: Find a Profile-most probable k-mer in a string. 
 Input: A string Text, an integer k, and a 4 x k matrix Profile.
 Output: A Profile-most probable k-mer in Text.
```
Code Challenge (3 points): Solve the Profile-most Probable k-mer Problem by writing a function ProfileMostProbablePattern(Text, k, Profile). (Hint: make sure to use the function Pr(Text, Profile) as a subroutine.)

In [4]:
# Write your ProfileMostProbableKmer() function here.
# The profile matrix assumes that the first row corresponds to A, the second corresponds to C,
# the third corresponds to G, and the fourth corresponds to T.
# You should represent the profile matrix as a dictionary whose keys are 'A', 'C', 'G', and 'T' and whose values are lists of floats
def ProfileMostProbableKmer(text, k, Profile):
    most_prob = ""
    high_prob = -1.0
    for index in range(len(text)-k+1):
        Motif = text[index:index+k]
        Motif_prob = Pr(Motif, Profile)
        if Motif_prob > high_prob:
            high_prob = Motif_prob
            most_prob = Motif
    return most_prob
    

In [5]:
ProfileTestCase0= { 'A': [0.2, 0.2, 0.3, 0.2, 0.3],'C': [0.4, 0.3, 0.1, 0.5, 0.1], 'G': [0.3, 0.3, 0.5, 0.2, 0.4], 'T': [0.1, 0.2, 0.1, 0.1, 0.2]}
text0=    "ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT"
k0=    5
out0="CCGAG"
assert (ProfileMostProbableKmer(text0,k0,ProfileTestCase0)==out0)

ProfileTestCase1= { 'A': [ 0.7, 0.2, 0.1, 0.5, 0.4, 0.3, 0.2, 0.1],'C': [ 0.2, 0.2, 0.5, 0.4, 0.2, 0.3, 0.1, 0.6], 'G': [0.1, 0.3, 0.2, 0.1, 0.2, 0.1, 0.4, 0.2], 'T': [ 0.0, 0.3, 0.2, 0.0, 0.2, 0.3, 0.3, 0.1]}
text1=     "AGCAGCTTTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATCTGAACTGGTTACCTGCCGTGAGTAAAT"
k1=     8
out1="AGCAGCTT"
assert (ProfileMostProbableKmer(text1,k1,ProfileTestCase1)==out1)
ProfileTestCase2= { 'A': [ 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.1, 0.2, 0.3, 0.4, 0.5],'C': [ 0.3, 0.2, 0.1, 0.1, 0.2, 0.1, 0.1, 0.4, 0.3, 0.2, 0.2, 0.1], 'G': [0.2, 0.1, 0.4, 0.3, 0.1, 0.1, 0.1, 0.3, 0.1, 0.1, 0.2, 0.1], 'T': [ 0.3, 0.4, 0.1, 0.1, 0.1, 0.1, 0.0, 0.2, 0.4, 0.4, 0.2, 0.3]}
text2=     "TTACCATGGGACCGCTGACTGATTTCTGGCGTCAGCGTGATGCTGGTGTGGATGACATTCCGGTGCGCTTTGTAAGCAGAGTTTA"
k2=     12
out2="AAGCAGAGTTTA"
assert (ProfileMostProbableKmer(text2,k2,ProfileTestCase2)==out2)
ProfileTestCase3= { 'A': [ 1.0 ,1.0, 1.0],'C': [ 0.0, 0.0, 0.0], 'G': [ 0.0, 0.0, 0.0], 'T': [ 0.0, 0.0, 0.0]}
text3=     "AACCGGTT"
k3=     3
out3="AAC"
assert (ProfileMostProbableKmer(text3,k3,ProfileTestCase3)==out3)
ProfileTestCase4= { 'A': [ 0.2, 0.2, 0.3, 0.2, 0.3], 'C': [ 0.4, 0.3, 0.1, 0.5, 0.1], 'G': [ 0.3, 0.3, 0.5, 0.2, 0.4], 'T': [0.1, 0.2, 0.1 ,0.1, 0.2]}
text4=     "TTACCATGGGACCGCTGACTGATTTCTGGCGTCAGCGTGATGCTGGTGTGGATGACATTCCGGTGCGCTTTGTAAGCAGAGTTTA"
k4=     5
out4="CAGCG"
assert (ProfileMostProbableKmer(text4,k4,ProfileTestCase4)==out4)

In [6]:
#Quiz
Profile = {
    "A":[ 0.4, 0.3, 0.0, 0.1, 0.0, 0.9],

"C": [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],

"G": [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],

"T":[ 0.3, 0.1, 0.0, 0.4, 0.5, 0.0]
    
}

Pr("GAGCTA",Profile)

0.0054

In [8]:
Pr("TCGGTA",Profile)

0.00405