# Finding the Regulatory Regions/Genes

LHY, CCA1, and TOC1 are able to control the transcription of other genes because the regulatory proteins that they encode are transcription factors, or master regulatory proteins that turn other genes on and off. A transcription factor regulates a gene by binding to a specific short DNA interval called a regulatory motif, or transcription factor binding site, in the gene's upstream region, a 600-1000 nucleotide-long region preceding the start of the gene. For example, CCA1 binds to AAAAAATCT in the upstream region of many genes regulated by CCA1.

The life of a bioinformatician would be easy if regulatory motifs were completely conserved, but the reality is more complex, as regulatory motifs may vary at some positions, e.g., CCA1 may instead bind to AAGAACTCT. But how can we locate these regulatory motifs without knowing what they look like in advance? We need to develop algorithms for motif finding, the problem of discovering a “hidden message” shared by a collection of strings.

In [7]:
import import_ipynb
from Week_1 import *
from Week_2 import *

## A brute force algorithm for motif finding
Given a collection of strings Dna and an integer d, a k-mer is a (k,d)-motif if it appears in every string from Dna with at most d mismatches. For example, the implanted 15-mer in the strings above represents a (15,4)-motif.

Implanted Motif Problem: Find all (k, d)-motifs in a collection of strings.

Input: A collection of strings Dna, and integers k and d.

Output: All (k, d)-motifs in Dna.

Brute force (also known as exhaustive search) is a general problem-solving technique that explores all possible solution candidates and checks whether each candidate solves the problem. Such algorithms require little effort to design and are guaranteed to produce a correct solution, but they may take an enormous amount of time, and the number of candidates may be too large to check.

In [8]:
def motif_enumeration(dna, k, d):
    dna = " ".join(dna)
#     patterns =[]
    patterns_in_all = {}
    first_str = dna.split(" ")[0]
    for i in range(len(first_str)- k+1):
        kmer = first_str[i:i+k]
        # 1st str's kmer neighbors
        for kmer_nbr in  neighbors(kmer, d):
            found_in_all = True
#             inall = {}
            inall = []
            # look in each dna str
            for dna_str in dna.split(" "):
                # look in each dna str for kmer neighbor with atleast d mismatches
                found = False
                # copy of motif found in str
                nbr_in_str = ""
                for j in range(len(dna_str)-k+1):
                    if hamming_distance(kmer_nbr, dna_str[j:j+k]) <= d:
                        # if found: look in next dna_str
                        found = True
                        # update the found motif
                        nbr_in_str = dna_str[j:j+k]
                        break
                # if kmer is not found in any of the dna_strs, look for next kmer
                if found == False:
                    found_in_all = False
                    break
                # add the
                else:
                    # if motif is found: add its version in a list
                    inall.append(nbr_in_str)
            if found_in_all == True:
                # if motif confimed: add it to lists
#                 patterns.append(kmer_nbr)
                patterns_in_all[kmer_nbr] = inall
    return patterns_in_all
#     return " ".join(list(patterns_in_all.keys()))
#     return list(patterns_in_all.keys())
a = """ATTTGGC TGCCTTA CGGTATC GAAAATT""".split(" ")

Given a k-mer Pattern and a set of strings Dna = {Dna1, … , Dnat}, we define d(Pattern, Dna) as the sum of distances between Pattern and all strings in Dna,

d(\textit{Pattern},\textit{Dna}) = \sum\limits_{i=1}^t d(\textit{Pattern},\textit{Dna}_i).
d(Pattern,Dna)= 
i=1
∑
t
​	
 d(Pattern,Dna 
i
​	
 ).

In [9]:
def consensus(seqs):
    cons_seq = ""
    for i in range(len(seqs[0])):
        count = {"A":0, "C":0, "G":0,"T":0}
        for seq in seqs:
            count[seq[i]] +=1
        count_max = max(count.values())
        for key,value in count.items():
            if value == count_max:
                cons_seq += key
                break
    return cons_seq

In [10]:
# returns miniumum sum of distances between pattern and string in DNA (using scoring: Hamming Distance)
def d(pattern, dna):
    k = len(pattern)
    distance = 0
    motifs = []
    for dna_str in dna:
        str_dist = k
        motif_str = ""
        for i in range(len(dna_str)-k+1):
            if str_dist >= hamming_distance(pattern, dna_str[i:i+k]):
                str_dist = hamming_distance(pattern, dna_str[i:i+k])
                motif_str = dna_str[i:i+k]
        distance += str_dist
        motifs.append(motif_str)
    return distance

## Median String Problem
Our goal is to find a k-mer Pattern that minimizes d(Pattern, Dna) over all k-mers Pattern, the same task that the Equivalent Motif Finding Problem is trying to achieve. We call such a k-mer a median string for Dna.

Median String Problem: Find a median string.

Input: A collection of strings Dna and an integer k.
Output: A k-mer Pattern that minimizes d(Pattern, Dna) among all possible choices of k-mers.

In [11]:
def median_string(dna, k):
    if dna[0].islower():
        for i in range(len(dna)):
            up = dna[i].upper()
            dna[i] = up
    distance = len(dna[0])
    median = ""
    for i in get_all_kmers(k):
        if distance > d(i, dna):
            distance = d(i, dna)
            median = i
    return median



mtf = """CTCGATGAGTAGGAAAGTAGTTTCACTGGGCGAACCACCCCGGCGCTAATCCTAGTGCCC
GCAATCCTACCCGAGGCCACATATCAGTAGGAACTAGAACCACCACGGGTGGCTAGTTTC
GGTGTTGAACCACGGGGTTAGTTTCATCTATTGTAGGAATCGGCTTCAAATCCTACACAG
""".split("\n")

# median_string(mtf, 7)
# mtf
# d(mtf, "AATCCTA")

In [12]:
def profile_gen(motifs):
    profile = []
    for pos in range(len(motifs[0])):
#         nc_count = [0 for i in range(4)]
        # as per laplase law
        nc_count = [1 for i in range(4)]

        for motif in motifs:
            if motif[pos] == "A":
                nc_count[0] +=1
            elif motif[pos] == "C":
                nc_count[1] +=1
            elif motif[pos] == "G":
                nc_count[2] +=1
            else:
                nc_count[3] +=1
        nc_prob = [round(i/sum(nc_count), 3) for i in nc_count]
        profile.append(nc_prob)
    # to transpose the profile use this code
    profile_t = []
    for i in zip(*profile):
        profile_t.append(list(i))
    profile = profile_t
    return profile

In [13]:
def profile_most_probable(dna, k, pr):
    scores = []
    nc_list = {"A":0, "C":1, "G":2 ,"T":3}
    for i in range(len(dna) - k + 1):
        score = 1
        for pos in range(k):
            nc = dna[i:i+k][pos]
            score *= pr[nc_list[nc]][pos]
        scores.append(score)
    maxscore = max(scores)
    return dna[scores.index(maxscore):scores.index(maxscore)+k]


prfl = """0.4 0.3 0.0 0.1 0.0 0.9
0.2 0.3 0.0 0.4 0.0 0.1
0.1 0.3 1.0 0.1 0.5 0.0
0.3 0.1 0.0 0.4 0.5 0.0""".split("\n")


## Greedy Motif Search
Our proposed greedy motif search algorithm, GreedyMotifSearch, starts by forming a motif matrix from arbitrarily selected k-mers in each string from Dna (which in our specific implementation is the first k-mer in each string). It then attempts to improve this initial motif matrix by trying each of the k-mers in Dna1 as the first motif. For a given choice of k-mer Motif1 in Dna1, it builds a profile matrix Profile for this lone k-mer, and sets Motif2 equal to the Profile-most probable k-mer in Dna2. It then iterates by updating Profile as the profile matrix formed from Motif1 and Motif2, and sets Motif3 equal to the Profile-most probable k-mer in Dna3. In general, after finding i − 1 k-mers Motifs in the first i − 1 strings of Dna, GreedyMotifSearch constructs Profile(Motifs) and selects the Profile-most probable k-mer from Dnai based on this profile matrix. After obtaining a k-mer from each string to obtain a collection Motifs, GreedyMotifSearch tests to see whether Motifs outscores the current best scoring collection of motifs and then moves Motif1 one symbol over in Dna1, beginning the entire process of generating Motifs again.

In [14]:
def score(motifs):
    total_score = 0
    for pos in range(len(motifs[0])):
        nc_count = [0 for _ in range(4)]
        for motif in motifs:
            if motif[pos] == "A":
                nc_count[0] +=1
            elif motif[pos] == "C":
                nc_count[1] +=1
            elif motif[pos] == "G":
                nc_count[2] +=1
            else:
                nc_count[3] +=1
        total_score += sum(nc_count) - max(nc_count)
    return total_score

In [22]:
def greedy_motif_search(dna, k, t):
    best_motifs = [dna_str[:k] for dna_str in dna]
    for i in range(len(dna[0])-k+1):
        # move through first dna string
        motifs = [dna[0][i:i+k]]
        # make profile of motifs (to be updated)
        motif_profile = profile_gen(motifs)
        for j in range(1,t):
            # improve profile by adding more motifs by finding most_probable motifs in next dna strings
            motifs.append(profile_most_probable(dna[j], k , motif_profile))
            # update the profile
            motif_profile = profile_gen(motifs)
        if score(motifs) < score(best_motifs):
            best_motifs = motifs
    return best_motifs

m = """ttACCTtaac
gATGTctgtc
acgGCGTtag
ccctaACGAg
cgtcagAGGT""".upper().split("\n")
# m = """GGCGTTCAGGCA
# AAGAATCAGTCA
# CAAGGAGTTCGC
# CACGTCAATCAC
# CAATAATATTCG""".split("\n")
mot = """GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG""".split("\n")

mot = """TATTCACCCTTTAGGCTTACTGCTCGGAGTCAGAAGGCGAGACGCAACATCTTTAGCTATCGTGGTAAGGTATTGTTGCTACTAGCACCGGTGCTACCAGCCCGGTGTCAGCACCTGTGCTATTTACCCCCTATAGCTGCGACGTCTAATTAGCAG
CGTCAGGCGGGCCTACGGTATCCATATAAGCAGTAATTCGATTATCTTTGATTTGCCGCTTAGGATTCGAGTATAGATGCGAGGACTCCTCTTGCTCCGTCAGATATTTAAACTGGAAAACGTATCCTCGGATATTTATCAGATGACATTGTTCGT
TCCGGTAGCTAGTGTGAGGCCTGCTAAGGCGTCGAGTACCATCCAAACGTAGATGTGCTACCTCGCTCCATACAACTCGGGCATACTACCGATACTCTATCAAGAAAGGGCTGTTGCCCGGATTTAAATGACATACCTGCGACTAGGACCTCGCCT
AGACTGACCGATCATGGAGAGAAGTATCGACATTTTTCGCTGACCCTCAGGGGAGGGAGCAAATCCATCGCGAGAGCCCGTCGCGATGTCGTGAGCGCTAGCTTCACGGTGTGGGAACTGGGCGCAGTGACTATATTTGCGACCTCCCTTATAGGG
CCGGACGTAACTGTTGGGCTTGGAGCGGCTTGTCGGGAGTTGCGCTGAGATGCCATTGCCATCCAGGATCTAATGTATTAATGCACTTACATGCCCGCGTTTTAGTCTTAACGATGGACGGCTACAATTCACATACTTGCGATCAGGATTAGGATT
TATTTTCCCCACAATGTGAATCGGATAGATGCGACCGGACTTACAGCGTCCCATTGCAATCTTGATATGGGGTTTAGAGATCTATGTTCATTATGCTTTCTCATGGGAGTAAAGCATGGATACTCGTAATCTCAAGTAAATGCTAAGTCGCATTAC
ACTTGCGTCAGGTCGTTGATATCATCTCATCCTGCCAATACATAACTACATCAGCCACCCACAGGTAATAAGGCTAGCTCGTCAAAACAACAGCTGTAAAATCGGATTGAGTGTGAAGTAAGAGAGGGTGACGAGCTATCGTCGATACGTGCGAAC
GAAGAGGGCTTCGCCAGCTGAACTATAGTTGCGACTACTGAAAGCGCAAAGATCCGATATAAGCCAAAGAAAGCCTTGATCTCCTTTGCCCTGCCTCACCGGACTCCACACTCATCTGAACGGGAGGTCAGTCTAAGGAAGCTACCCTGTTCATTC
GTCTATCTGCCGGGGTTGAGTTACATGCTCGCGGTATCCAGGACCGAAATACATGCGAAACGAGCCGCTCGGGTCCTTGGGCCAGTCGGCTCCTACGTAATGAACCTATCCGCACCTACTCGTAGGCTTACATTGGTTTTTTATCTTATCGTAGTT
AGGGGATACCTCTGGGGGAATGGGCTTATTGGGTTTGGCATGACTCATCATTGTTCCCCGTAATCCCTGGGAGGGCTTCCCCAATGCCGCTAGTTCTCCGCTGAGTTCTTCTGCCGGTACGCAACGCGTAGTATAGCTGCGACTCCATGATGATCC
TAGAAATACCCGGAGCAGGGCCAGGGCGGTTACCTCACAGATAGAGCTGCTGCAGCGCTGATCCAAGACAAGATAGATGCGAAGCGTATCAACCTCCTGCTTGGGGGACGATTTCCTCCGCTGCGACTCAGTTAGCTTGGTAGCAACATCCTACAA
CCCTAGGATTCGATAACTGCGATCGCCGAGCTTTACATTAAGTACTCGCCTCTACATTTAAAGCGAGGTGATAAGGACAGCTAGCATCCGAAGAATACAGTTATTCTCAACAACCACCACTTCATTCCTACTCGTACGTCGTGCCGTACTTTTTAT
TATGCGTCGCCTCGATACATGCCCGGAAGGACGTTCACACACTAAAAATTGTGCAAGGTAACGATCATGCGCGGACTTCGTGTTAACCGACAAGGCCAGGAATAAAGTCTGAATATCGTAATAGATGCGATGCAGTGCGCGAAACTACCTCTCGTG
CGTGAACCTGCCAAGTTAAGCTACGCGCCGCATGACTCGAACACAATAATAGATATTGAGGAGCTTAAAGGGCAGACCGATAGGTGACGGTGGTTTCAGGTGGCTCACCTAGTTCGTACAACTGCTGGCCTGATAAATGCGAAGAATGGTCACGGT
TGTGGAAATGGTCTGGTACGCTTGACGCTATGCTCGTCGTAGTTGCATCCACATATAGAGATACGTGCGATGCCTTCGACCGTTAAGGGGCATTCTCGCCACGCAACTCCATGTTTCTACAAAGCCGTTCATGTTAACAAGTTGGTTTTCTGTACG
AGAAGTACTCCATAACCATTTCGGGTATTAGAAAGCATACCTGCGAGGACATGCTGGTGGTCATAAATAGGGGACTACGTACGACGAGATAGTTGTCTTTCCGCAAACACGGGTCACTGAGGACCCCCGTCGGATTCTGACCGACATTCTATTTTT
CTATTGCCCTCTTTGCCACCAGCCGATACTAGGATTCACGTAACGCTAACGTGGCGGGCGATACTTGCGACCGTGCACGCCTACGAGTTAGTCTAGTTGGGGCGTTAGCTGACATCAACAGTGTCCATGGGGTTTCCCCGACTCTGGCGCGGCTAA
TTTATTCTAGGTGAACCGAGGCGCATATATGCGACTCTGGGACACACAAAACTGCAAGAGCTGGCGATCTATGGGACCCTGTACGCTAACCCGGTTACACTGTCGACCATACAGCAATGGTAAGTCTAAAGGGAGGGGCTGAGCGCGAAATGTCCA
GTCATGCTGGGCGCATTGAGTGAAATAGCTGCGACTTGGCGACGTCACGAGACGCGGTATGTGTTGAAGGCAACTGGCTGTCAGACTTTTATGAATTGGAACGAGGAACCCTGTAGCCATATACAGGTCACGATATGTGCGGTCCTAACTTCCTCC
CACGACTCACCCTGACTAGGTCATAGATGCCACCCGCGTTCTACGAAAGCCCTCCGGGTGGCAAATGCATTGACCTTCCCTGCTCTCTCCAAAAGTAGTTTCGAGAAGCTAGATTGATTGCTGTATAGGGTTATAAGTGCGAGTAGGCATCATGAG
GGTCCACGTTGTAACCAGCTGGTAGTTTTATAATTTGGAAGCCTAATCCTGCGCGCACAGCAACAGTTAGATCAAGCCTAGTGTAACACCTGGTTGCGCATGGCTGTGTGTAGTAACTTGAAAGCATGGAACGACAAAAGTAGTATACATGCGAGA
CCCACACCATCACAGCGGTGCGGTGACAATGTTTCAACTACCTTCCTCCACGGTAGGAAGGGGCAGATACAGAGTTACCGGCGGACTATCCGCGGAACCCACAAACGACATACGCCGCGTTCAGTGGAGCGGTCTGCTTGGGTCATACCTGCGATA
CCGCATCGGTAGGTTCAACCCCCGAAGTATGACAAAACATTTAGGCGCATAGATGCGAGGCACAATAAGAGCGCATTGCTCAAGCAGACTACCTGAACACTACAGTAATATGAAAATCGTGCAAAGTCTGTCAGGTTAGGCCATTGATTAACGGTC
ATACTATGGCACCTCGGAGCTGCGTCCGAATTGCAAGATCCTGGTTAACAGAAGTGCAACTAATTCTGGCACGTATCATACCGTGTTTCCAGGCCCATAGAAGCTGATATATTTGCGAAGCAACCTTCGCGAACAGGGATGCCTCTTCGTCGTACT
TCCAAATGGTATCTTCAGCGGATTACGTGACGTCCTGATCTTGTGTGAGCGTTAGCCGTGCATCCAGCATACTCGGAACGTATGTCGTCGACTCTGTCTCAGTACGAACGCATATTCGTCCATGAGATCACGACTCGCTCCCGAATAGATGCGAGA""".split("\n")
mot ="""TGAGATAGCGGATGGTCTGTCGCCTTCTTGTCTGATACCGCCTTCTTCTACCGACAGAAACTAAGGCCCAGCGACATGCTCAGTCAAATTGGGTCAAATTTCGTGGTATAGTTGGAAGACTAGATTTGGGTGCGTCAATCGATGAGCCAACCTCGC
GGTCGGTATTTGCCTTTTGATGTAGACTATATGTGGTCTCAACTATAGTTATCACGGTGTGGCGGCCTGTGGCCAAGGGCCACATGAGGTCAGGTTTTACAGTCCACGACCGCCTTCGGTACTGTGAACAGTAGGAATCGGTCATGGTAGGTTTAT
GGGCTTGTACCGCGCGGTTGTTCTCCGATTCGTGAGCCCGTCCGGAGCTATGTATGTTCGTGGGTATCTCAACCGCGCGCCGGCTGCACCCCTTTCCGAAGGGCCAGGACGGTGACGATTTGAGTATGAGGGATGTAACCCTATCAATGACTGTCC
CGAAGGACCAGATCACCTACGGGCAGGAGAGATCGCAACTAAGGGAAGGTCGCGTGCTGCGCGCCAGAATACATACGCGAGCTCCCGATGGCACATGGATTGGTAAGCCTAGATCATCGGTCGTGCAGAGCCGAATACGGCCTGTCCGCACTGTTA
ATATGCCCTAAATTGAATATACTTGGCGCCTTACTGCCAAGGTCCACTGGCGGGCGGTTTCTCGGATAGGGGCACAATGTCGAGCAGGCCATAAGCTGGCGTTGCAACGTACAAGAAGATGTGGAACTCACACAAACTTACAATAGCGCGTGAGCA
CGCGAGGCGATGGGGGGGATCATCGCTGAATGCGCCGGCATACGAAGGTATCTATGCAAATATAACCATAAATCACTCGATGTGGCTAGGCCGCTGACTGCGGTACACTGTTCCGCGACCCAAAGGCCCACCCGGTCAGTCGTTCTCGGAGTATGT
TTCCGTGGGTGTCTATCGTTGGCTCAAAGGCCCACTCGTTACAGCCCATTAGAATAGTCCCATGGTGGCAGGCGCTTCTATGGCGCTATCGCCCAGCAGTCCTTTGCGAACTGGAGACAATGAGGACTCGCGTCAGGGTACACGTATGCGACCGCT
CAGTAATGGCGACTAAGGGCCATCATGGTCAGGCGCATGGAATGTTGGTCATTTTACTTTTGATTGTATGACCCGTAACCGTGACAAAGGCTCGGACACCATACAATGTGGAGGAGAAATCACTTCTACAGCGGATGGGCGGGTTTTAAAAGAGAC
GTCCCGCATGCACAAAGGACCAAGCCAGTACGTTCGAGGGTCATCTATATGAACTTGACAGTCTTTACCGCAGGTTACCGTTCGAACTGGGGTTTGAAGTTTTTCTGACGTAATAAACACTTCTCGGTAGCGTAGGTCTCCAGACACCGCCGCTGT
TACGTTGTCAAGACATAAGAACCTAGTGATTATTAAACCGCTGGCCAAACCGAAGCGCCCTCCCGATCCTCTGTCTCCCATGGTGCTTTCATTAAGCGTCAAGCCCAACGAAGGGCCACGTCGTCTTTCGATGCGCGTGGAATGAGTTGGAGCCTG
GATCATTGCAGCGACTATCAGCTTAATCTTCTCGACAGGACACCATCGCAGATTCTCCGACCCTCGGCCGCTGAAGCAACCAACGCTGTAGCTGCAGTAGAGTGGAGAGGTGGCGAATGGGTGTCTCAGCACCAAAGGTCCAAAGAACTAGCTCAA
ATAAACGAGATGCTCGCAAGTGGAACTCCATCCTATCGAAGGTCCACCAGAACTGCAGAGGCCGACATGCTATTCCTCGCACATTGCAAAGTATGATGGAAAAGAATGAGTCGAGAGTAAACAAGTAGCGATATATTCTGCTAACGTCCCCTGGGG
GTTGCTGATTATAGGCCTGGGAAGCCAAGGGCCACGTAAATGATCACTGTTGTACCGGATGTGTAATATTCGTTACGGACACTGTTAGCTGATACACCGCAAGTAATAGTGTCTGGTACCATCGTGAGTGGGATATACTTGATGTTCCAAAGTATT
AGGAGCACCGGTCAAAGGACCATTTCGTTCGAATGGGAAACACTTAGGGCCAGGGAGATTCTCGGACTAGTGCAAGCACTTCTTTGCCTTAGCACAGACAGATTTATACCGGATGACGTGTTTGACTTCAAAAAACTTCCTGAAGTCGACAGGGTA
GTATCAACATCGCTAAGGGCCACTTATGACTTCAAGCATCGCAATCTCGTTCCGAGCTCGCTGAGCTCTCCTATGTTGCCTCCACAACTTGGTAAAGTTCCGGTTGTCCTGGCTATATTTGGTACCTTTCGAATAACAATCGCAAAATCTATTTAT
TGTAATTTCGTCCGAAGGCCCAGCTGTATACTTCCCCTGCGGACCATGATCAGGCAAGTGGCGATACTTTAATTCTCTCACGGGTTAGTGCTGGGCGCTAGTTGTTACGTCCAGGAACATCCTACGCAGGAAGATCAGTAGCTCCATTTCAGGCGC
AATCGGACATTTAACGCCAACTTGTTGCCATTTCAGCTCTCCCCAAGAAACCGCAGCTACTACGAGCCTCGGCAAAGGTCCACCCCCTCAAACATTTTCTCTGGCAAACGATCCATTAGCCAGAAGAGGTACAAAAAAGACAAACTCTCACGATCG
ATTCTTCTGGATTCTGTGAAGGAATGTCATATTTACCACTCGAATACACTGAGGTACACCCTGTAAGAATGACAAAGGGCCAGTCTTTGGAAGGGAGATGATGGTTCCAATAGTCAGACCTCAGTTACAGTGAGGCTCGAACCTCACCACGCTCAG
AGTCCTGTGCACCTAAGGGCCAGATGGCCTGGAGCCTTCTTTGGGCCGCTTTCCGGCGCGATCTGGGTGGCTTGAGGGGGTTCACTTCTAAGCTGACGGTAACTATTATATAAACATGCACTAGTGACATTCAGCGTTTCACTTAACCCCATTCGA
CCGGAGCGGGAGCAGACTCGGTTCTCAACATGGGAGGAAAATCCGCCATCTATCGTGCTGGCTCGCCGAATGCTGGGTCCCATAGCTAAGGAGGATAGGACGTGCTAGAATCTCTCGGGACTAAGGACCATGCAAGTGCCGGCAAATGGCGTGCAG
AGAGGCTCTACACGAGAAGCGGAGCGGCTCTAGGCCGTGCAGGGATTGCGGACTTATGGGCTGTAAAGTAAGTTAGCGCGTGCAGGTTCACAAGTCCTAAGGTCCACACTACAGAATCTCAATGAGCGTGGCTCATATCAGTCCGGTGGCTGACAT
ACAACGCGAGGAAGGTCCGTATCCTGGCGATGATGTGTATGGCCATGAGGCCAACGAGACCTTAAAGTGCGTCACACCGCCAAAGGGACCAACCTGCCGGCATGGAAAATACACATTTGCCTAAGGACCAACGCTGGCTATGTATCGCGGGCGCGT
TTTGAAACACCTCCTCAGTCCATCCATATGATGCTGTGGTCGTATCGACCGTCTAAAAGGGAGGCAAGAGTGCCAAGGTCCATTTACAATGTTCACCGTTACAGCCAAATCGTCTTTACTCCGTTGACATCTAAGCTGGTGTCTTCGACGCCAACT
TTAGGTCATTTGAAGCTCAGGCAGGCTCCCATTGAGATCAGCAAAGCTTCAGTGTTAAAACGTTCCCTCCAGCTGTTCTTAAGATCAGCACCTGTCCGAAGGCCCAGATTACACGCCGACTCACGCTCAGAAAGCGAGCGTTTCTTATTTACATGC
CGCGCCACAGCGTAATGATCGTAAGCGATAATACCCTCTTGGGAGCGAGGTAAGTTTATGGGTATCCTCGCGGGTTTCGCCGATCTGCCTTTTTTGCTAGCGCACTCCTCCCCTAATCCTCAAAGGTCCAAAGAAATTATGGACACTATCAAGAGG""".split("\n")

dnaa = """GCGCCCCGCCCGGACAGCCATGCGCTAACCCTGGCTTCGATGGCGCCGGCTCAGTTAGGGCCGGAAGTCCCCAATGTGGCAGACCTTTCGCCCCTGGCGGACGAATGACCCCAGTGGCCGGGACTTCAGGCCCTATCGGAGGGCTCCGGCGCGGTGGTCGGATTTGTCTGTGGAGGTTACACCCCAATCGCAAGGATGCATTATGACCAGCGAGCTGAGCCTGGTCGCCACTGGAAAGGGGAGCAACATC
CCGATCGGCATCACTATCGGTCCTGCGGCCGCCCATAGCGCTATATCCGGCTGGTGAAATCAATTGACAACCTTCGACTTTGAGGTGGCCTACGGCGAGGACAAGCCAGGCAAGCCAGCTGCCTCAACGCGCGCCAGTACGGGTCCATCGACCCGCGGCCCACGGGTCAAACGACCCTAGTGTTCGCTACGACGTGGTCGTACCTTCGGCAGCAGATCAGCAATAGCACCCCGACTCGAGGAGGATCCCG
ACCGTCGATGTGCCCGGTCGCGCCGCGTCCACCTCGGTCATCGACCCCACGATGAGGACGCCATCGGCCGCGACCAAGCCCCGTGAAACTCTGACGGCGTGCTGGCCGGGCTGCGGCACCTGATCACCTTAGGGCACTTGGGCCACCACAACGGGCCGCCGGTCTCGACAGTGGCCACCACCACACAGGTGACTTCCGGCGGGACGTAAGTCCCTAACGCGTCGTTCCGCACGCGGTTAGCTTTGCTGCC
GGGTCAGGTATATTTATCGCACACTTGGGCACATGACACACAAGCGCCAGAATCCCGGACCGAACCGAGCACCGTGGGTGGGCAGCCTCCATACAGCGATGACCTGATCGATCATCGGCCAGGGCGCCGGGCTTCCAACCGTGGCCGTCTCAGTACCCAGCCTCATTGACCCTTCGACGCATCCACTGCGCGTAAGTCGGCTCAACCCTTTCAAACCGCTGGATTACCGACCGCAGAAAGGGGGCAGGAC
GTAGGTCAAACCGGGTGTACATACCCGCTCAATCGCCCAGCACTTCGGGCAGATCACCGGGTTTCCCCGGTATCACCAATACTGCCACCAAACACAGCAGGCGGGAAGGGGCGAAAGTCCCTTATCCGACAATAAAACTTCGCTTGTTCGACGCCCGGTTCACCCGATATGCACGGCGCCCAGCCATTCGTGACCGACGTCCCCAGCCCCAAGGCCGAACGACCCTAGGAGCCACGAGCAATTCACAGCG
CCGCTGGCGACGCTGTTCGCCGGCAGCGTGCGTGACGACTTCGAGCTGCCCGACTACACCTGGTGACCACCGCCGACGGGCACCTCTCCGCCAGGTAGGCACGGTTTGTCGCCGGCAATGTGACCTTTGGGCGCGGTCTTGAGGACCTTCGGCCCCACCCACGAGGCCGCCGCCGGCCGATCGTATGACGTGCAATGTACGCCATAGGGTGCGTGTTACGGCGATTACCTGAAGGCGGCGGTGGTCCGGA
GGCCAACTGCACCGCGCTCTTGATGACATCGGTGGTCACCATGGTGTCCGGCATGATCAACCTCCGCTGTTCGATATCACCCCGATCTTTCTGAACGGCGGTTGGCAGACAACAGGGTCAATGGTCCCCAAGTGGATCACCGACGGGCGCGGACAAATGGCCCGCGCTTCGGGGACTTCTGTCCCTAGCCCTGGCCACGATGGGCTGGTCGGATCAAAGGCATCCGTTTCCATCGATTAGGAGGCATCAA
GTACATGTCCAGAGCGAGCCTCAGCTTCTGCGCAGCGACGGAAACTGCCACACTCAAAGCCTACTGGGCGCACGTGTGGCAACGAGTCGATCCACACGAAATGCCGCCGTTGGGCCGCGGACTAGCCGAATTTTCCGGGTGGTGACACAGCCCACATTTGGCATGGGACTTTCGGCCCTGTCCGCGTCCGTGTCGGCCAGACAAGCTTTGGGCATTGGCCACAATCGGGCCACAATCGAAAGCCGAGCAG
GGCAGCTGTCGGCAACTGTAAGCCATTTCTGGGACTTTGCTGTGAAAAGCTGGGCGATGGTTGTGGACCTGGACGAGCCACCCGTGCGATAGGTGAGATTCATTCTCGCCCTGACGGGTTGCGTCTGTCATCGGTCGATAAGGACTAACGGCCCTCAGGTGGGGACCAACGCCCCTGGGAGATAGCGGTCCCCGCCAGTAACGTACCGCTGAACCGACGGGATGTATCCGCCCCAGCGAAGGAGACGGCG
TCAGCACCATGACCGCCTGGCCACCAATCGCCCGTAACAAGCGGGACGTCCGCGACGACGCGTGCGCTAGCGCCGTGGCGGTGACAACGACCAGATATGGTCCGAGCACGCGGGCGAACCTCGTGTTCTGGCCTCGGCCAGTTGTGTAGAGCTCATCGCTGTCATCGAGCGATATCCGACCACTGATCCAAGTCGGGGGCTCTGGGGACCGAAGTCCCCGGGCTCGGAGCTATCGGACCTCACGATCACC""".split("\n")


for k in range(8,13):
    ans = greedy_motif_search(dnaa,k,10)
#     print(ans)
    print(k ,consensus(ans), score(ans))
# consensus(['ACCT', 'ATGT', 'ACGG', 'ACGA', 'AGGT'])
# consensus()

8 CCGGCGGG 12
9 CCATCGGCC 17
10 CTATCGGCCC 19
11 GGACTTCCGGC 20
12 GGACTTCCGGCC 24


### Input: DNA Strings and a profile matrix
### Output: Most probable motifs in all the DNA-strings

In [66]:
def get_motifs(dna, profile):
    motifs = []
    for dna_str in dna:
        motif = profile_most_probable(dna_str, len(profile[0]), profile)
        motifs.append(motif)
    return motifs

# prfl = """0.4 0.3 0.0 0.1 0.0 0.9
# 0.2 0.3 0.0 0.4 0.0 0.1
# 0.1 0.3 1.0 0.1 0.5 0.0
# 0.3 0.1 0.0 0.4 0.5 0.0""".split("\n")
# prfl = [i.split(" ") for i in prfl] 
# prfl1 = []
# for i in prfl:
#     l = []
#     for j in i:
#         l.append(float(j))
#     prfl1.append(l)

dnas = ["ttaccttaac", "gatgtcgtc", "acggcgttag","ccctaacgag","cgtcagaggt"]
dnas = [i.upper() for i in dnas]
p = [[0.8, 0 ,0 ,0.2],[0, 0.6,0.2,0],[0.2,0.2,0.8,0],[0,0.2,0,0.8]]

get_motifs(dnas, p)

['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT']