# Week 3: Which DNA patterns server as molecular clock?

## 1.2 Code Challenge: Implement `MotifEnumeration`

Input: Integers k and d, followed by a space-separated collection of strings Dna.
Output: All (k, d)-motifs in Dna.

```
MotifEnumeration(Dna, k, d)
    Patterns ← an empty set
    for each k-mer Pattern in Dna
        for each k-mer Pattern’ differing from Pattern by at most d mismatches
            if Pattern' appears in each string from Dna with at most d mismatches
                add Pattern' to Patterns
    remove duplicates from Patterns
    return Patterns

```

### Sample Input:

```
3 1
ATTTGGC TGCCTTA CGGTATC GAAAATT
```

### Sample Output:

```
ATA ATT GTT TTT
```

In [1]:
def change_one(sequence, position):
    bases = ['A', 'C', 'G', 'T']
    seq = list(sequence)
    i = position
    variants = [''.join(seq[:i] + [b] + seq[i+1:]) for b in bases if b != seq[i]]
    return variants

def change_many(sequence, positions):
    seqs = [sequence]
    while positions:
        p = positions.pop()
        seqs.extend([v for s in seqs for v in change_one(s, p)])
    return seqs


from itertools import combinations

def neighbors(pattern, threshold):
    neighbors = set([pattern])
    for i in range(1, threshold + 1):
        for c in combinations(range(len(pattern)), i):
            neighbors.update(change_many(pattern, list(c)))
    return list(neighbors)

In [2]:
def motif_enumeration(dna, k, d):
    patterns = list()
    for chunk in dna:
        kmers = [chunk[i:i+k] for i in range(len(chunk) - k + 1)]
        extended_kmers = set([kmer_prime for kmer in kmers for kmer_prime in neighbors(kmer, d)])
        patterns.append(extended_kmers)
    return list(set.intersection(*patterns))      

In [4]:
k = 5
d = 1

dna = [
    'CTGCGGGCTGCCATAGATATCTCGA', 
    'TATATTTCTGGTGAGGCGGGGGCTG', 
    'CCAATACGTCAGGCGTGAATGGCTG', 
    'CGTGTGGCTGTCCCACCTGGTTCGG', 
    'TTTCAGAAATCGCTTGACTGACCCA', 
    'AGCCTACCCCGGCTGTTAAGTACGG'
]

' '.join(motif_enumeration(dna=dna, k=k, d=d))

'GTCTG CTGGC GACTG GGCTG GCCTG GGTTG TGGCT CGCTG GGCTT'

## 1.4 Code Challenge: Implement MedianString

Input: An integer k, followed by a space-separated collection of strings Dna.
Output: A k-mer Pattern that minimizes d(Pattern, Dna) among all possible choices of k-mers. (If there are multiple such strings Pattern, then you may return any one.)

```
MedianString(Dna, k)
    distance ← ∞
    for each k-mer Pattern from AA…AA to TT…TT
        if distance > d(Pattern, Dna)
             distance ← d(Pattern, Dna)
             Median ← Pattern
    return Median
```

### Sample Input:
```
3
AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTTCGGGACAG
```

### Sample Output:

```
GAC
```

In [1]:
def hamming_distance(seq, seq_prime):
    diff = [int(a != b) for a, b in zip(seq, seq_prime)]
    return sum(diff)

def distance(kmer, dna):
    d = 0
    for chunk in dna:
        d += min([hamming_distance(kmer, chunk[i:i+k]) for i in range(len(chunk) - k + 1)])
    return d

In [2]:
from numpy import inf

def median_string(dna, k):
    median = None
    d = inf
    kmers = list(set([chunk[i:i+k] for chunk in dna for i in range(len(chunk) - k + 1)]))
    for kmer in kmers:
        d_prime = distance(kmer, dna)
        if d > d_prime:
            d = d_prime
            median = kmer
    return median

In [3]:
k = 3
dna = [
    'AAATTGACGCAT',
    'GACGACCACGTT',
    'CGTCAGCGCCTG',
    'GCTGAGCACCGG',
    'AGTTCGGGACAG',
]
sample_in = (dna, k)
sample_out = 'GAC'

In [4]:
assert median_string(*sample_in) == sample_out

In [5]:
k = 6
dna = [
    'ACGCCAGCCTGCTCCTTGGAAGGGGCTCTACTATATCGTAGT',
    'TCTCTACACAATACTTATGCTTGAGCCTGATAAGTAAGGTAT',
    'AATCCAACTCTACCTAAATGTATACGCAGTAACATAGCCGGC',
    'TCTTTCATAGATTGCAGAGGTTGGTCTCTAATCTTAGCGATC',
    'CATAGCCCTCTAAACAAACGCATTGTCCTTCCACTTTAAAGG',
    'ATTCCGTGACATCCTCTATTGAGCGGTGGCACGTAGCACACG',
    'AACCTATCATTGACAATAGCTCTAGAGATAGTGTGTGTGGTC',
    'GGTGCTGCAAGAGCTCTAAAGGGTATGTAATGTTAGATCCGT',
    'AACAGCGCTCTAACTGATAACCGTCTTCGAGAGAAAACGCGT',
    'TATGATAGATGACTGGATGGCCGGATGCTATATGGTTCTCTA',
]

median_string(dna, k)

'GCTCTA'

In [27]:
k = 7
dna = [
    'CTCGATGAGTAGGAAAGTAGTTTCACTGGGCGAACCACCCCGGCGCTAATCCTAGTGCCC',
    'GCAATCCTACCCGAGGCCACATATCAGTAGGAACTAGAACCACCACGGGTGGCTAGTTTC',
    'GGTGTTGAACCACGGGGTTAGTTTCATCTATTGTAGGAATCGGCTTCAAATCCTACACAG',
]

median_string(dna, k)

'GAACCAC'

## Code Challenge: Solve the Profile-most Probable k-mer Problem.
Profile-most Probable k-mer Problem: Find a Profile-most probable k-mer in a string.

Input: A string Text, an integer k, and a 4 × k matrix Profile.
Output: A Profile-most probable k-mer in Text.

### Sample Input:

```
ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT
5
0.2 0.2 0.3 0.2 0.3
0.4 0.3 0.1 0.5 0.1
0.3 0.3 0.5 0.2 0.4
0.1 0.2 0.1 0.1 0.2
```


### Sample Output:

```
CCGAG
```

In [6]:
def profile_most_probable_kmer(text, k, profile):
    bases = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    probabilities = dict()
    for i in range(len(text) - k + 1):
        kmer = text[i:i+k]
        if kmer not in probabilities.keys():
            probability = 1
            for j, base in enumerate(kmer):
                probability *= float(profile[bases[base]][j])
            probabilities[kmer] = probability
    most_probable, _ = max(probabilities.items(), key=lambda x:x[1])
    return most_probable

In [7]:
text = 'ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT'
k = 5
profile_ = """
0.2 0.2 0.3 0.2 0.3
0.4 0.3 0.1 0.5 0.1
0.3 0.3 0.5 0.2 0.4
0.1 0.2 0.1 0.1 0.2
"""
profile = [line.split(' ') for line in profile_.split('\n') if line]

sample_in = (text, k, profile)
sample_out = 'CCGAG'

In [8]:
assert profile_most_probable_kmer(*sample_in) == sample_out

In [11]:
text = 'TCTCATGGGGGAGTCTAGTACTTTAAACAGATTTCTTTTGCACACCCGTAGCCGCTGATCAAAATACAATGGACTTCTCAAAAGATCGAGGGACCCTCACCCGGGGGAAATGATAAAGTCTGATAGGAAGCAATCCGGAGTTGAATCACGAAGACGCCCGAAACACCCGATCTAAATAGAATCTTAGGCATAACCCGCACGTGCAACTGGGGAACATTCTGCGAAGTTGAGGGGGAATGTGATTCCGCAATCCGATATAATAACTTTTAAGAGTGACTGGGGACTGCACAGCCACCTTCATGAACACCCTAGGGTTTAGGACCGAAGCAGTAATCTCTCAGCCGTACCGATTTATTGAGGCGGCCAGTTACCGCCCTAGTGGCTCCTTTAACAAGCACTTCTCTTGTTTCTCAATCGGTTAGGTAATTTCACCCAAATACTCTGGTAGTAACAGTACTCACCCGCCAACCCACATACACTCATTAGCTTCGAGTTGTAGTGGCGGATCTAAGAGTTCGGGTCGAAACCTGTTTCTTCTCTGTTTTCCAGGCTCGGTGAAGAAAGCTAAAGGGGTCACTTCCCTCTTGCACTCCTTGATTAGCACCCACATTTCTCTCTGTATTTATCCAATCAATAGCTCGTAGCCTAAGTATGTATTTGAATTGCGCAATTGGGGCTACCCCGTCCGTGCACGTCGGAGGTACTGTCCTCGGAGGCAAAACCAACGTGTATTCCGCACCCATCGAAATACTGTATGTGACATGAAGACCAAAGAAAGGCTAATGCTGTTGCGTCTAATCCGCTTAACGTTATGTTGCTACGCGGTAACCTATGGGATTTGGGCAGGACTCTCTTGCTGATTATACCAATTATTCTGAGCGCAACGCCGCCCGGATCTTCAGTCTCTTATGTGACGATAGACAATTGACAGCAAAACTTGGGATCAAATAGTCATCACCGTGACTAATCGAGGTTTGAAGTTGCAGCGTTTAGCTGAGGT'
k = 15
profile_ = """
0.152 0.273 0.303 0.227 0.197 0.136 0.242 0.258 0.167 0.242 0.258 0.121 0.242 0.242 0.227
0.258 0.197 0.197 0.212 0.348 0.212 0.318 0.318 0.258 0.167 0.288 0.288 0.288 0.182 0.258
0.258 0.227 0.318 0.212 0.182 0.333 0.197 0.182 0.258 0.379 0.167 0.273 0.273 0.273 0.227
0.333 0.303 0.182 0.348 0.273 0.318 0.242 0.242 0.318 0.212 0.288 0.318 0.197 0.303 0.288
"""

profile = [line.split(' ') for line in profile_.split('\n') if line]

profile_most_probable_kmer(text, k, profile)

'GACTCTCTTGCTGAT'

## Code Challenge: Implement GreedyMotifSearch.

Input: Integers k and t, followed by a space-separated collection of strings Dna.
Output: A collection of strings BestMotifs resulting from applying GreedyMotifSearch(Dna, k, t). If at any step you find more than one Profile-most probable k-mer in a given string, use the one occurring first.

```
GreedyMotifSearch(Dna, k, t)
    BestMotifs ← motif matrix formed by first k-mers in each string from Dna
    for each k-mer Motif in the first string from Dna
        Motif1 ← Motif
        for i = 2 to t
            form Profile from motifs Motif1, …, Motifi - 1
            Motifi ← Profile-most probable k-mer in the i-th string in Dna
        Motifs ← (Motif1, …, Motift)
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
    return BestMotifs
```

### Sample Input:

```
3 5
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
```

### Sample Output:
```
CAG CAG CAA CAA CAA
```

In [12]:
import numpy as np

def form_profile(motifs):
    bases = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    counts = np.zeros((4, len(motifs[0])))
    for line in motifs:
        for j, base in enumerate(line):
            counts[bases[base]][j] += 1
    freqs = counts / counts.sum(axis=0, keepdims=True)
    return freqs.tolist()

def score_motifs(motifs):
    bases = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    n = len(motifs[0])
    counts = np.zeros((4, n))
    for line in motifs:
        for j, base in enumerate(line):
            counts[bases[base]][j] += 1
    score = sum(len(motifs) - counts.max(axis=0))
    return score

In [13]:
assert score_motifs([
    'TC',
    'CC',
    'AC',
    'TT',
    'AA',
    'TT',
    'TC',
    'TA',
    'TC',
    'TC',
]) == 7

In [14]:
def greedy_motif_search(dna, k, t):
    best_motifs = [chunk[0:k] for chunk in dna]
    for i in range(len(dna[0]) - k + 1):
        kmer = dna[0][i:i+k]
        motifs = [kmer]
        for j, chunk in enumerate(dna[1:]):
            profile = form_profile(motifs[:j+1])
            motifs.append(profile_most_probable_kmer(chunk, k, profile))
        if score_motifs(motifs) < score_motifs(best_motifs):
            best_motifs = motifs
    return best_motifs

In [15]:
k = 3
t = 5
dna = [
    'GGCGTTCAGGCA',
    'AAGAATCAGTCA',
    'CAAGGAGTTCGC',
    'CACGTCAATCAC',
    'CAATAATATTCG',
]
sample_in = (dna, k, t)
sample_out = ['CAG', 'CAG', 'CAA', 'CAA', 'CAA']

assert greedy_motif_search(*sample_in) == sample_out

In [17]:
k = 12 
t = 25
dna = [
    'TTACCTGGCTGTACCTTTACTCTATCTCGGCTCTGGAATATGTCGGCTTCAAGGAGACTGACCCTGCTGGTCAGCGAAATAGACACCGCTCCTTGCAGAGGACTCTCGCGTCATGCCCGGACCGGACTCGTAAATGACCGGCTTGTAAGTAGGTAT', 
    'CATCCCGCAACGAGGAGCGCGTGGCAGGACGTCGCGTCGTTGATAGGTAATATAAGGATTATTCTGATTACTGTTAGTATGTTAACTGCACTAGTAAAGTGCAGATCCGCCGTGTTAGGCTCGCTTATGTTGTACCTCGGGTACCCCACAACTAGT', 
    'TCTCCAGTTCGTGAGTAATATATAAATGAGGCAGGGCAAGGACATGCAGCCGGTCCCGCCGGCTCACAAAGGGGGATTACGCAACCAGCACTGGTAATTTGGAGAGTTGGACTTTGGGGAAGCGGCTCAGGGTCTGGAACTATAAACCGGTTTGAG', 
    'TATCATGTGTAATACCACCGACCACAATAATGATCCGTGCGCCCTCATAACGCGAGCTTAACGGTACTCGTACCCACTTATGTAGCCGTAACCCCAATGCGTATCTAATGGATGCAAGTGTCAGCGTGTAAAGGCTGAGTCTCTGACAGATACGTG', 
    'TGATTCAAGTTCGTACGATCGAAGGCTTTCACCCTCCAGACACCATATAGTCGCAAAACAGCGCTGGTTCGACTCAGGAGGTCTTGTAAGTAACGTGGTTCTAAGACAAGAGGTGAGAACGCGGGACTGGTAGGTAGAGGTGTAACACCTATCTCT', 
    'TATACATCATACCCGCGTGGCGCTGTTTACAGGCAGGCTGAACTCGTAGCCTTATGTCACTCCCAGACAATCCGCTGGTCACACAGGCTCCAGGGTATCCCAGGGCACATAGCCCGTCAAAGTGCGGGTCCTCCAACGTGTCCCGTAAGCACAGCC', 
    'TGAAGATGTGGCTGTAGCTTCCCACGTTCTATGCAAACTCTCCGTAGCAAGCTAATGATCTACACGACTCAGGCGTACTCTCATTCAGTACTGGTATGGAGACATCGACCAATGATTCACGAGCACGAATGTTGGCATAATCTAGTGAATCGCCTG', 
    'ACGCCATCCTCATTAATGGTCCGCCATCAGCTAAATTCTAGCCTCAAGCTGTTACTTAGCGAACGAAGGTCCTGGAAGAGATGGATGCTGCAGGGCGCCGAACTCGTATGAAACGTTTCCACCAGATACGATTAGCACTTAACGGGATAGTGACTA', 
    'GATGTTGCCGGTCGCATCATGTTGTATCAAATTTCTGCAGATATTGCATCAGTGCTAACGGTTTCAAAAGGGGCCGGACTGGTATTTTGTTGCCCGATAATTCATGGCGGGCGCCTCACCTTTTAGATCCTTTCTGGCCGCTCTTGCGAAAGATCT', 
    'CCTGCACTGGTAGCGGTACAGCTGATATGATGGATAGGTCGGGTCTCCTTCAGAAGAAAAGCGGGACTATTGCAGAGTGGCGGACTATACGATAAGTGCAGAGAGGTTGTTAAGAGTCGTACGCGCGGAGGTGTTCACAACTGTTCCCCATTTGCC', 
    'GACCGGTAACGTACTATTTCGAAATAACAGTGAAAATATTAATTGGTACCAATCTTTCGCCAGAGCAGGGGCTGGCGAACGACTGCACCCCATCTCCCCGGTCGTGTGTCGTTAGAGACGTAGAATACTAATCACTCCGCAATCACTGGACTGGTA', 
    'CACATCTCTTGGATCTAATTTGCTGCGACAATCATTGCGTGCGGCGAACGAACAGAACGGTCGTTTACCTGGTGCCCCAAACTTACAGTGCTTTCCGATTACCCGTGTGCCGGACTCGTATAATGAAGGTGGGTAAATAATTCCGTTTGAAAACAC', 
    'ACGGCACTTGTACTTCTGACGTGAGGCGAGAAGTAGGCCTACGCTAATCCTCCCTACGCCCAGGTGTCACAGGTAAAGGAAGTATCGGCTCAGTATTTTTATCGGCTTTCCGGAGTCATCACACCAACTGAACGTTCACCTTTCTCTAAGGTGAGT', 
    'ACCGTACTTGTATTCTGCTGAGTCTTGTTCCCTCCAACTGCCATCTAATACGTCCCAATATGCCCCGGAGCAGCCGACCCGCCAGCTCCAGGGGCCGGTATTTTACTATTAGCGGTAGTATGTAGTTGTTTTTTGACCGTCTATATTTGAACGAGT', 
    'TTTGAAATAGGTAGGCATACGCCCAAAACCACGCGAAAGGTCTGACCACAACGCAACGCGCTTGGCTCGTGGGCTCTCATCTGTCTTGTCGGCTTACCAGGACTGGTAGTAGCATTCTCGAGGATTTCTAATTGCTATGGTTGTTATTCTTGCCTG', 
    'ACCTTACCGTAAAGCTTTAGGCGCAAGCCCGTTTGCAAAGGGCTGGGGCCAGTACTAGTAAGAATCGCACACCGAGGATTCCCCCAGCTGTCCCGGCACGGATAACGTCGCTAGCTGGTCCGAAGATAGAACCCAACGGACCTGATATAAGCGGAC', 
    'TGGGAAGGCTATCGTCACTAGGATAATTAGAGACAGCATTAGTGAAGCTGACGGAGTAAGAAGAGTGCGTGAATTAGTTGCGCCATCTCTTAATCTTACGGGAGTGGATGCAACTGGACCATACGGTTCCCTAGGTTAACTTATACTGGACTTGTA', 
    'GGAGCACAGTTCTCTCAAAACGGTGTGACGTGAAAGGATATCTAAATTCCGGGACTGGTAGTGTCCCCTAATCACTCCATAAACGTGAATATCCAGCGTATAATTCATCAACCGCGCCAGGAATAAAAGGGCATGGGTATCCAGTGCCTCGAAATC', 
    'TAGACCGAGTCCCACTTGTTAAAACAATGTGTGGTGTCTGCACTAGTAGTTCGATGATACCCCCTTCGTGGTATCCAGACGCCTTAGTGCTGCCCACGGGCGGCTTGAGCCCATGTTAGGCTGAATGTTTAGGTGTCTTTCATGAAAATGAACCGG', 
    'ACATGTACGTCCCAGACCGGAGGTATAGAGGTGGTCGGGGATGCTCGCATTCGCCAGTAAGGAACTGTCGATGCGGTACTTGTAACTTCTTTTATAATCGTCATCACGCACTTGACTTCCCTTCTATAAATCACACCTCCACTCCATATTTAATGT', 
    'GATTATTTCCTCATACAGGCGGCGGGTGCTGGAGACCGCTTCAGTGCCTGAAAGTTTCAGCCCATAATAATCCCTGCCCACCCCCTGAGGATGGACTCGAACCAGCACTCAGTACTGGTAAATCCACCCACAAGATTCACTATTTGAGGACAATAT', 
    'TAGTTGACATGATTCATCTGGATACTTGCAGCTGACAATTCCCGGACGGATTCCACCGGGCATTGAAAGAATCACGTGAATAAGACAAAATACTCTAGGCTGCGGGACCCAGCACTCGTAAGGGTCTTTCTGTACCGATGTCACGTCGTGGACCTT', 
    'GTCACCCGTATTTGACCACGAGATGCATCATTGGGAAAGTAACGCCCGGGATTGCAAGCCGTATTTTCTAACGCCACCCCATAGCCGGGACTTGTATTAGGATCATTGCTTCGTCACATGGCCACTGGGTTCATTTCTATCAGTCATTTCGCCTTG', 
    'TTGACCACCGCAAGATGCAGTTATAGTACTGTTGTTCTTGTCTCCTTGAGGGAAACTTTACCAGAACTGGTACGCCGTACTATTGTTACGGGGATAAGAATCGCATTCAAACAGCCGTGTGCTTACTTAAGAGGCCCCAGAGTCGGCGAACTTACG', 
    'CGCCTCAACTTCCGTGGCTCGAGTTGAACCAAACAGACGCCGTCGATCCATCAAGGTTCGGTCAAACGCTGGTCTGTGGAAAGTAGGTTGACTGTGTCGGTACTGGTAAAGTGTTGCACGCTCCTATGCAATAATCCAGCTTCGTCTAAGTTCGGC',
]

greedy_motif_search(dna, k, t)

['GCCCGGACCGGA',
 'CATCCCGCAACG',
 'TCTCCAGTTCGT',
 'TATCATGTGTAA',
 'TGATTCAAGTTC',
 'TGTCCCGTAAGC',
 'TCACGAGCACGA',
 'GAACTCGTATGA',
 'TCTTGCGAAAGA',
 'CCTTCAGAAGAA',
 'TCTTTCGCCAGA',
 'GGACTCGTATAA',
 'TCACAGGTAAAG',
 'CGACCCGCCAGC',
 'CCACGCGAAAGG',
 'GAACCCAACGGA',
 'GCATTAGTGAAG',
 'TGACGTGAAAGG',
 'TCATGAAAATGA',
 'TCACGCACTTGA',
 'CCACCCACAAGA',
 'GCACTCGTAAGG',
 'TCACCCGTATTT',
 'GCATTCAAACAG',
 'GAACCAAACAGA']

## Code Challenge: Implement GreedyMotifSearch with pseudocounts.

Input: Integers k and t, followed by a space-separated collection of strings Dna.
Output: A collection of strings BestMotifs resulting from applying GreedyMotifSearch(Dna, k, t) with pseudocounts. If at any step you find more than one Profile-most probable k-mer in a given string, use the one occurring first.

### Sample Input:

```
3 5
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
```

### Sample Output:

```
TTC ATC TTC ATC TTC
```

In [18]:
import numpy as np

def pseudo_form_profile(motifs):
    bases = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    counts = np.ones((4, len(motifs[0])))
    for line in motifs:
        for j, base in enumerate(line):
            counts[bases[base]][j] += 1
    freqs = counts / counts.sum(axis=0, keepdims=True)
    return freqs.tolist()

def pseudo_greedy_motif_search(dna, k, t):
    best_motifs = [chunk[0:k] for chunk in dna]
    for i in range(len(dna[0]) - k + 1):
        kmer = dna[0][i:i+k]
        motifs = [kmer]
        for j, chunk in enumerate(dna[1:]):
            profile = pseudo_form_profile(motifs[:j+1])
            motifs.append(profile_most_probable_kmer(chunk, k, profile))
        if score_motifs(motifs) < score_motifs(best_motifs):
            best_motifs = motifs
    return best_motifs

In [19]:
k = 3
t = 5
dna = [
    'GGCGTTCAGGCA',
    'AAGAATCAGTCA',
    'CAAGGAGTTCGC',
    'CACGTCAATCAC',
    'CAATAATATTCG',
]
sample_in = (dna, k, t)
sample_out = ['TTC', 'ATC', 'TTC', 'ATC', 'TTC']

assert pseudo_greedy_motif_search(*sample_in) == sample_out

In [20]:
k = 12 
t = 25
dna = [
    'ATTTTCCATGGTCAGGTGGCGGTTGGTGAAACTAACCATCGTACAGCTCTGAATAACCTTCATTGTCTTGAACCTCTGATCCCTCTCCGGAACCTAAGATCCACTCGAGCGATTATAAGTGGCATCTCGGCAGACACTAGGCGTTTCGTTATCATC',
    'ACGATGTTGTACTGATAGGCAAATCGTCGCACGGTGGAACCACACGCCGGCCAATGCCTCAATCGTTCGGCCGATCAACTATTAGTGAGACAGCAGGCCAGCCAGAGAGTCCGGGGCCCAGGGGTACAAAAACTTACAGCGATTCTGGGAGTCCGC',
    'ACTCACCCGGTGTTCCGGCTCCGATCGACGATCCTCAATCTTGTAAGTAATTGTGACTCCGGTTTCAACAAACGCTCTCGAATAGCGGATATTCTACTACAACGATTTAATCACGCTGGTAAGGAGTTGGGCAGGCTACGTGCCTGGCTGTATAGG',
    'AGTAGGTTGTTTGTAAATCTTCTGGCCAGGTAGGTGCCTCTTTGGTATTCGTCGAGGCAGGCGTTTGGGCCACCCTGGGTGCTCTGAAAATCCCCGTGTTTCGAGGAGAGCCGCATAACCTGCAACGCTACTCTCCGGAACCCACCTTGACTAGGA',
    'TGTAACGTTTAAGCGCGACTTTAGCTCTCGAGGGCTGCTGTTACAAGGACTCCCATACTCGAGCCAGTTAAGACTCCGCAGCGCGTCCGGTACCAACCACAGCAGCTTTTCAGGTCAGCGAACAGCTCAGGTGAGAGCGCATAGGGCTCTCATGGC',
    'CACAGATAGCGTACGCCAATTTGTCCTCGACACTAACTGTTTCTCGCGTAGTCATTTAGGCTCCGCTTATGTCAACTACGATTCTTCCGGAGCCGAGGTTTCCCTGAACACTGCGATGCTTCCTGGTAACTCACCCTCTTATTTGGTCTCCGGCTT',
    'GCGTATCTTCAGTTCCGGAACCGAGGTGTGGATAAACTCAGTCTGCACTTCATCGGTGAGCTAACGTTGTAATACCGGACAAAGGATGACTCGACGCGTTCCTGTCTTGCTTATAGGTGGCTCAGATCTGACTGAGATCATAAGGGTCCTGTAGCC',
    'GCTCTGCAGGCGGGAATTTCGGGTCTGGGCTGTGGAGGCTCCTTGCACACGAGTGTCGCGCGTACGAGCGGCGAATTAAGAATATTCCGGTGCCTAGGTCCAGCCGAAGAATGTACTATTACGTAATATTGTATCGGCCAAGGATTTGACCCAAGT',
    'GTGTAGATCTGTTTCAATGCTCGTGAAGGACACGGAGTCCGGGGCCTACCTAGCACCAAAGCGCTTTCCCAACGAGTTGGGTGCTCCGTCGTACCACGTCGGCAACTTTTATTCCACGCCTCGGTGGATGCCCGACTGTCCGCTCAACGTACTGCC',
    'ACAGCGCCAATTACATTAGCATTGCGCACCACGGCGTTCCGGACCCAAGCACGAGCGTGGGTCCAATTGGTGAAAAAACCGCGATTCCAAGGAGCCTTTGTAGGATGCGAAGAAGCTTACAATTAGCCACTGAGAAGCATGGTACAAACGCTCCAC',
    'CTCCGGGACCTACTCGTGTATAGTATATTTCAATATCGTTTTACGGCTCTCGAGACCGTTCTAGCAGGACTGGACGATAACACGGTTGTTTATGACAGTGTTGAATACGTGATGATACGGTCGGGCTCTCTTGGCTAAATCCCACATTAGTGTGCC',
    'AAGGCAATGAGGAGTTTATGGGAAGGGGGTCTAGACGGAGGCGCCGTCGAGACGAGGGTTGTCCGGTACCCAACCCAATGCCTTCAAGGCCACTGCTTTGCGAGGTTCTAGTCGCTGAATTTTTGAACAGGGCACCTGGCAATTGCTGTGTGTGTG',
    'ACGACCGGGGAATATGTCTGATATTATGGGCGACCTATCCGGCCCCTAGCGCGACACCCAGAACTTAATAATCAGGTAGCGGGGACCACGGAATTTCCCAGGTACCAGTACTGTCCTCGGCAGGATAACGCTCCCATTGTCGCCTGATTCACTTCT',
    'TCTGGGCTCTTTGACAGGTGCAGCATCCGGGTCCCACTTCTGATTACGTGTGAAAAAATGCGTATTTCGTTTACCTGCCATTGAGAAGTCCCAGCTACGTTGGAGACGTATGCGTAGTCGGGTTCGTGAACGTACACGCGTCAAAGAAGTTGTAGC',
    'CTATCTTTCGTGAAAAACGCGGCTCTCCGGGACCTAATAGTCACGGGTATGTATTAAACTAATTCTGGGGTATTTAACCCTCGTACCGTTCGGAAGATAGTCCTGTGGGCATAGGATACCGAGTTACACCTGTGTGCTTGCTGTGCTGGGAAGGCT',
    'CCCGCATTGTATGAAAGCGTTTGGCTCGAGCGGAACCCAGCTTCGTTGTGCAAGCGGCGGACGTGTACAGAGGCGCGGGGCGCTTGCTGACCATCAAATACAATCGGACTCCGGCACCTAAGACCTCCGCGTTCATAGGGGTTATGCGTTCCGCCC',
    'CTCAAAGCTACGCTTGGGCAACTTAAGAATCCGTCACTCCGGAACCAAAACATGCGGCTTCTGAGATTCCTACGATACATCTTTACTAATACCGGAAATAGATCGGCTCTGTTCTCCTATTAAAGTGGCTGTGAGCTAATCCTGGCGTCAGGGATC',
    'GTCAAACCCGGATTAACGCATCGGAGTGTCCCAATGTAAACTTAGACTCGTCAACCACGAAGTCAATATTTTGAGAACATAGCAGGCAAAGACACGGTAATTCTCGTCCGCTGTTACAGTTATTTATTCCTTGTCCGGTGCCTAAAAAAGCGATTT',
    'CTTGCGATCCTCATATTTCGGTATTCAGAGTTGACCGACTTGGTCGCATACGGGGCCAGGGGCCTAACTTCTTCGTAGAGTATCTTCCGGTGCCGATCGACTGACAATTCGGTCTTCTGCTTCAATTGCGTACGCCAATATTTGGTCTGGTTTGCA',
    'TCACCGATTTCCTATATGCGAGTCTGCGGTAACGCAAATAGCAAATATACTGATGGTTACTTCCGGAACCAATGGGAGGTAAACGAGGCAAGATCTACATTTCTGTGCACATCGGCCACGTTTGCGATGCCGCACCCCATCTCCTGGCCATGGTAT',
    'CTCCGGCGCCAATGATAAGTCTAATCAACTAGTAATGATCTGCAATCGTTGCACTCCTCAACAGTCTTATCAGCTGTCGAGGGCTCTGCAGAACCTGTCCACCTGTACCCTGCATGGGATGCGTGGGGTCTCCTACAACTTCCGTGCCATGGAAAT',
    'GGCAGATCGCCCTGGGAGTCACAAAAGCAACAGGACAACATCCGCTAGTATGGCAAGAACAGAATGCATTTTTAAAGTTTCAAGGAATATAACTAGTATGCATGGTCGGACGTACGCGTATCCTCGCGTTCCGTCCGGATCCAACGTATTAAAAAA',
    'TCCGCCGTCGGCTGTCTATGCATTTGACCATACTCAGTCCGGTCCCTACTGTGGCACAATGCCACGAGCTTGATCAGAAAATAGCTCGCTCGGTCTGACAACGATTGATGCATGAACGAGGTCCTCATACTCTCGTCGAACGGAGGTCGGATCCGA',
    'CTTCATCGGCTATTAGACTAGTATGCGCAAATCGTGTCGTCATAGACACGCCGGTACTTGAACCGTAGGTAGGGGTATCTGAGGCTACAGCAGTGTGGTATGCGTAATTTCCGGTTCCGAAGAAACTCGAAGCATTTAAGGCCTCAATTTTATGTG',
    'CGTCAATTTACGGTCGCCTCGGGCTTACAAGCGAAGACTCGGGTGTGCGCCCGAATACACCTTACTTCTATGCTATGAATATCCCTCCGGGGCCAAGGGGCTAAGCCAAGCCCTAGATAGTCGGGACTGATCAATAGCCAGGTTACAGTTGATGGA',
]

print('\n'.join(pseudo_greedy_motif_search(dna, k, t)))

CTCCGGAACCTA
GTCCGGGGCCCA
TTCCGGCTCCGA
CTCCGGAACCCA
GTCCGGTACCAA
TTCCGGAGCCGA
TTCCGGAACCGA
TTCCGGTGCCTA
GTCCGGGGCCTA
TTCCGGACCCAA
CTCCGGGACCTA
GTCCGGTACCCA
ATCCGGCCCCTA
ATCCGGGTCCCA
CTCCGGGACCTA
CTCCGGCACCTA
CTCCGGAACCAA
GTCCGGTGCCTA
TTCCGGTGCCGA
TTCCGGAACCAA
CTCCGGCGCCAA
GTCCGGATCCAA
GTCCGGTCCCTA
TTCCGGTTCCGA
CTCCGGGGCCAA


In [22]:
def PatternStringDistance(pattern, dna):
    k = len(pattern)
    distance = 0
    for seq in dna:
        l = len(seq)
        hd = float('inf')
        for i in range(l - k +1):
            hdCurr = hamming_distance(pattern, seq[i:i+k])
            if hd > hdCurr:
                hd = hdCurr
        distance += hd
    return distance

In [24]:
pattern = "AAA" 
dna = ["TTACCTTAAC", "GATATCTGTC", "ACGGCGTTCG", "CCCTAAAGAG", "CGTCAGAGGT"]

PatternStringDistance(pattern, dna)

5

In [25]:
pattern = 'CCCTACA'
dna = [
    'CCCATTCGGCATGGAAGGAGCATTCGCAATCTGCTTGCCAACGGTGTCAAGACGCCGCAGAAGCGGTCATAGGTGCAGCGGCTTCACCTG',
    'CTTAGTTTTAAGGGTAACCGACACTGCTTGTATGCCCGGACATAAGCGTCAAAGGCTTTAACCGGATCAGTTATTCAATAAATCCTTTGC',
    'GAAAGGGACGGGGCCGTAGCTCAACGTGCGAGCCTCGAGACAGTTGGAATCAATTCTGGGGACTGTTGTGCCTGAGTACCCGCCTGCGTT',
    'CCCCATTCTCACTGAGCTCCTATATCGTAGGACCGTCATCCTAACTATAGCTCCAACCCCTCGACTTTAGCTGTTCCTCAGGTGAACACC',
    'AGATCCCGACGGAAACGGGAAACTAGGTGTATGGAGACTCCATAATGTGCAAGATCTGACATACCTTCCTAAACAGTCTATGCTAGGTGG',
    'AAGAGCAGCTCACTAAACCCCTCAAAGGTGCAACCGAAAACAGTGGAAATCGGACTCTATGTTCCGTCCATAACCAGCTCTTTGTATTGC',
    'TCATTATGACGTATGCCTGATAGGCCGTGTAAAGCAAGAGCACCTAGGGGTATCGAGCTACATGAACGTAAGTGAAGGATTCCCTAGACG',
    'AACCTAAGGGTGGCAGCGCAAATGCCTTCTAAGATCATCAAGGAGGAAGAATTCTCAGTTCCGAATCCAACCTCGAGCGTCGTGCACCAG',
    'ATATACGTCCTGAGTAGACCAGTGTGTGCGCCGAATAGCTTCCGTGGTGAGTTAAATGCTGCTCCTCCTTCGGTGGCATCCCCGCACAGT',
    'CAAAGACCGAGGAGCGGAGGGAAAAGGTCACTATAGCAAGAAAGAGCTCCGCCGAGCTGCGAGCGATTACCGTTCCCTGAAAACTCCGTG',
    'CTACAAGCTACTATGCCGCTCGCTAAACGGAACTGCCCGGTAGTGAGAGTATTACAGAACATCGTTTTAATAGACGTGTTATCTGCAGAG',
    'CAGTTTTTCCGTATTCCCGCACCTTGGTTAAGAAACAATCGTGGCGTGCCCAGGTAAAAGTACAATGAGAATCTTTGTTTAGTAGTACGA',
    'TTTAAGTGTTGTTTGTCGGAAATCCGCTAGGAGGTGTTTCCGAGCACATACTTCGGTGCACTTTGATACTTAATGAGGATGTTAGATTCA',
    'TTAGACTGCCGGCTGGCCCTGTCCAGCGGACATGTACACGAGTCATGAGCTGGGTAAGTTCGGACGAACAAGGCCACAACATACGGTTCT',
    'TTGACCCCCGCCAGCAAATTAATTGGTCCGGTTATTTGGCCGGGCACACTGGGAGGAGGGATTGTGGATAGCGTATTTTCTGCGACGAAT',
    'GTTCAACCCTAGGTGAAAGAGTTGAAACTTCCGTGAGAAGTGCACCTGGATCATCCCATCTTGACGGTTGTAGGTAATTACGGTTGGTTT',
    'TCGCAACGGTTATTCCGCTAAACGCCCAATATGACACCTATTGCATCTTACGTAGCGACATGACTGAACGCCGGCCTATTCGGATCGAAG',
    'GGAGCTGCACTGTATGGCTAGAAGCCACATGGAATCTGCGCTGCTCTCTAGCACATGTTTCTGGTTACTCTCTAGTTCACACAATTATCA',
    'GCCTTCGGGACAGCATATCCATCACGATGAGAGGCGTAATGAATGGGCTCTGTTAGCTTGGACATCGTACTTCCCATCGTATGTGTTGTA',
    'TGGCCGGCCGACTTTCCAGACGGAGCTGAGTTCGAGGACTGTCGCCTCCCGACGTCATCGTCTTAGTAGGGCGTCTTGACAGGACACCCA',
    'TAGGTATGTCCTGTTTTCTATATAAGTTTACCCTGGAGCACGCCGTACGTCTGTAGTCTTAACATACTTCGGTCTTTGTTGCTGTTAAGG',
    'ACTGCGAGTTATCACACTTCCTGATGTCAGTTATCTATTGATAACTTGATTACATCGCATTCCATTGATTTAAGTACTTTAATGCGCGGA',
    'CCCGTCGGCTTGTTCCGTTCCTAGCGACGTCGGGATAGAATCCGAATCATCCTGCAGGGTAGCCTACTTCTTTTTGCGACTACGCCCCAG',
    'TGCTTTAAATGTCGTCCGTTTCACCGCGATAAACGTAGCGACGCCCCCGAAACTGGCGTATGAGAAGTTATAGGCGTCGACGCGCGAGAG',
    'CACATATCAATATCGATAATGCTAGTGGTTGGCCGGCCCCGAATTCTCATGCACTCAAGTACGCTGAATGCGCAGAGGTTGAACGCTCTC',
    'TATGCTCGATCCAAGATTGTTCACTACCGTGGACTGGGCTCACGCGGCATGATAGGGCGCTCTGAGATGCTCCGCTTAACGTTGCTTTGA',
    'TGTGATATATAAGGTGGATCCGCATACGCCGAGTATATAGTTGTGTCCGCCTATGGAAAACACTCAGCCTAGCTCGGCTGTAGGCACGCC',
    'CTATTCGAAGACAAGAATTCCTCAAGGCTCGAATCGCCCAGACCGTACCCGAGCCGAAACATAATCGACATAAAAATTAATCGGCGCCGT',
    'CAAATCAAACGCTTGTTACTGATGCCCTACGCCTTCGAGGGGATGAATATTATAACGACTCTCCAGAGTAGGGAGTCCTAACCTCTCACA',
    'CTGTTCTCAAGACCTGATACCGCCATACTTGCGATCCCAAACCCAAAACCCGTCTGTTGCCTGTATGTTACTGATGATGTGAATCGTCTC',
    'GTTGTATCGAAAGCCAAGCTGCGGCTGCCTGCTGTAGACAGTAAACCTATCGGAACAATGGCTCGCGTGTTGGTGACCCAGCTCTGATTG',
]

In [26]:
PatternStringDistance(pattern, dna)

71