Finally, we can form a consensus string, denoted Consensus(Motifs), from the most popular nucleotides in each column of the motif matrix (ties are broken arbitrarily). If we select Motifs correctly from the collection of upstream regions, then Consensus(Motifs) provides a candidate regulatory motif for these regions. For example, as shown below, the consensus string for the NF-κB binding sites is "TCGGGGATTTCC".

In [22]:
# Input:  A list of kmers Motifs
# Output: the profile matrix of Motifs, as a dictionary of lists.
def Profile(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    profile = {}
    counts = Count(Motifs)
    for nucleotide in ["A","C","G","T"]:
        #everything must divided byt the total number of Motifs  
        profile[nucleotide] = [ count/float(t) for count in counts[nucleotide]]  
    return profile

# Input:  A set of kmers Motifs
# Output: Count(Motifs)
def Count(Motifs):
    count = {} # initializing the count dictionary
    #Initialize each nucleotide with an empty list, 
    for nucleotide in ["A","C","G","T"]:
        count[nucleotide] = []     
    for ind in range(len(Motifs[0])):
        for nucleotide in ["A","C","G","T"]:
            count[nucleotide].append(0) #everything must have a 0 initially  
        for motif in range(len(Motifs)): #For each Motif, loop through chars
            count[Motifs[motif][ind]][ind] += 1 # FOr each nuc, increment its count for that Motif
    return count

# Input:  A set of kmers Motifs
# Output: A consensus string of Motifs.
def Consensus(Motifs):
    consensus = "" #empty string.
    counts = Count(Motifs)

    for j in range(len(Motifs[0])):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if counts[symbol][j] > m:
                m = counts[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol # Add most frequent symbol
    return consensus, counts

def Score(Motifs):
    #This really seems inefficient. We could have consensus return both values...
    #counts = Count(Motifs)
    consensus, counts = Consensus(Motifs)
    score = 0
    for char in range(len(consensus)):
        nucleotide = consensus[char] #Our nucleoutide
        keys = [key for key in ['A','C',"G","T"] if key != nucleotide ] # What are the non-consensus nucleotides
        for key in keys:# For each key
            score += counts[key][char] # add the number of times we were incorrect to our score
    return score
            

In [21]:
motifs1=    ["AACGTA",
    "CCCGTT",
    "CACCTT",
    "GGATTA",
    "TTCCGG"]
#Output: Basically, each list contains the number of Nucleotides in the nth position of all motifs in that list.
out1= "CACCTA"
print (Consensus(motifs1))
assert (Consensus(motifs1)[0]==out1)
"""Test 1 # Full dataset
Input:"""
motifs2=    ["GTACAACTGT",
    "CAACTATGAA",
    "TCCTACAGGA",
    "AAGCAAGGGT",
    "GCGTACGACC",
    "TCGTCAGCGT",
    "AACAAGGTCA",
    "CTCAGGCGTC",
    "GGATCCAGGT",
    "GGCAAGTACC"]
out2= "GACTAAGGGT"
assert (Consensus(motifs2)[0]==out2)

('CACCTA', {'A': [1, 2, 1, 0, 0, 2], 'C': [2, 1, 4, 2, 0, 0], 'G': [1, 1, 0, 2, 1, 1], 'T': [1, 1, 0, 1, 4, 2]})


Finally, we can compute Score(Motifs) by first constructing Consensus(Motifs) and then summing the number of symbols in the j-th column of Motifs that do not match the symbol in position j of the consensus string. We leave this task to you as an exercise.

In [3]:
out1=14
assert (Score(motifs1)==out1)
out2=57
assert (Score(motifs2)==out2)

Biologists also commonly use a motif logo, a diagram for visualizing motif conservation that consists of a stack of letters at each position (see the figure below). The relative sizes of letters indicate their frequency in the column, i.e., highly conserved columns in the motif matrix correspond to tall symbols in the motif logo. (For more on motif logos, see DETOUR: Motif Scoring Functions).




Now that we have a good grasp of scoring a collection of k-mers, we are ready to formulate a computational problem for motif finding. 
```
Motif Finding Problem:  Given a collection of strings, find a set of k-mers, one from each string, that minimizes the score of the resulting motif. 
 Input: A collection of strings Dna and an integer k. 
 Output: A collection Motifs of k-mers, one from each string in Dna, minimizing Score(Motifs) among
    all possible choices of k-mers.
```
Brute force search (also known as exhaustive search) is a general problem-solving technique that explores all possible candidate solutions and checks whether each candidate solves the problem. Such algorithms require little effort to design and are guaranteed to produce a correct solution, but they may take an enormous amount of time, and the number of candidates may be too large to check.

A brute force algorithm for the Motif Finding Problem, BruteForceMotifSearch, considers every possible choice of k-mers Motifs from Dna (one k-mer from each string of n nucleotides) and returns the collection Motifs having minimum score.

STOP and Think: Do you see any potential issues with the proposed BruteForceMotifSearch algorithm?

Throughout this chapter, we will benchmark our motif finding algorithms by using a Subtle Motif Problem that refers to implanting a 15-mer with four random mutations in ten randomly generated 600 nucleotide-long strings (the typical length of many upstream regulatory regions). The instance of the Subtle Motif Problem that we will use has the implanted 15-mer "AAAAAAAAGGGGGGG".

To benchmark BruteForceMotifSearch, note that there are n-k+1 choices of k-mers in each of t strings, so that there are (n-k+1)t different ways to form Motifs. For each choice of Motifs, the algorithm calculates Score(Motifs), which requires k⋅t steps. Thus, assuming that k is much smaller than n (as is the case for biological datasets), the overall running time of the brute force motif finding algorithm is on the order of `((n-k+1)t)⋅k⋅t` steps. For the Subtle Motif Problem, this is on the order of `10^29` steps. You may recall that the naive algorithm we developed to generate a symbol array in Chapter 1 took several days to carry out an algorithm with just `10^13` steps. In this case, the earth will have been destroyed by the sun long before BruteForceMotifSearch will terminate. It goes without saying that we need to devise a faster algorithm!

We have also thus far assumed that the value of k is known in advance, which is not the case in practice. As a result, we are forced to run our motif finding algorithms for different values of k and then try to deduce the correct motif length. Since some regulatory motifs are rather long, BruteForceMotifSearch will be too slow to find them.

In [23]:
#Quiz

Motifs = ["AGGTGA","AAGCTA","AAGCCA","AGGTCA","AAGTGA","TCGCGA"]
#Motifs = ["AAGTGA","TCGCGA"]


prof= {"A": [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],

"C":[ 0.2, 0.3, 0.0, 0.4, 0.0, 0.1],

"G": [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],

"T":[ 0.3, 0.1, 0.0, 0.4, 0.5, 0.0],}

Consensus(Motifs)
Profile(Motifs)

{'A': [0.8333333333333334, 0.5, 0.0, 0.0, 0.0, 1.0],
 'C': [0.0, 0.16666666666666666, 0.0, 0.5, 0.3333333333333333, 0.0],
 'G': [0.0, 0.3333333333333333, 1.0, 0.0, 0.5, 0.0],
 'T': [0.16666666666666666, 0.0, 0.0, 0.5, 0.16666666666666666, 0.0]}

NameError: name 'Pr' is not defined