# Chapter 2


**Brute force algorithm for motif finding**  

Given a collection of strings Dna and an integer d, a k-mer is a (k, d)-motif if it appears in every string from Dna with at most d mismatches. 

**Implanted motif problem**  
Find all (k, d)-motifs in a collection of strings.  
Input: A collection of strings Dna, and integers k and d.  
Output: All (k,d)-motifs in Dna.

In [14]:
# from ch 1
def HammingDist(patternA, patternB):
    "This function calculates the number of mismatches between two strings of equal length"
    # set number of mismatches to 0, convert input stirngs to list
    number_mismatch = 0
    assert len(patternA) == len(patternB), 'please input strings of equal length'
    for i in range(len(patternA)):
        if patternA[i] != patternB[i]:
            number_mismatch += 1
    return number_mismatch

In [11]:
def ApproxPattMatch(Pattern, Text, d):
    '''
    - This function returns the starting positions of a kmer
    of interest within a DNA string. The kmer may have up to
    d mismatches. Indices are 0 based.
    - Pattern: motif of interest
    - Text: DNA string 
    - d: tolerated number of mismatches
    '''
    # define range of Text to work in
    overlap = len(Text) - len(Pattern) + 1
    matches = []
    # define variables to calculate Hamming Distance
    # for every position in overlap:
    for i in range(overlap):
        # start is the position in overlap we loop over
        start = i
        # end is the position in overlap + the length of pattern
        end = i + len(Pattern)
        # string2 is the subset of text we are working with
        subset_Text = Text[start:end]
        
        # calculate Hamming Distance between two strings
        # If there are fewer mismatches than d, append
        if HammingDist(Pattern, subset_Text) <= d:
            matches.append(i)
        
    return matches

In [25]:
def MotifEnumerate(Dna, k, d):
    '''
    Dna: a list of DNA strings
    k: k-mer motif
    d: tolerated number of mismatches
    '''
    
    Patterns = set()
    # loop over the first DNA string in the list
    # for Dna_string in Dna:
    
    # only need kmers that appear in the first DNA
    # string, as each kmer must appear across all
    # DNA strings?
    
    Dna_string1 = Dna[0]
    overlap = len(Dna_string1) - k + 1
    # get a kmer
    for i in range(overlap):
        start = i
        end = i + k
        kmer = Dna_string1[start:end]
        for Dna_other in Dna:
            print(kmer, ": ", Dna_other, ApproxPattMatch(Pattern = kmer, Text = Dna_other, d = d))


In [21]:
DNA = ['TATCGA', 'ATGCA', 'ACGGT']

In [26]:
MotifEnumerate(DNA, 3, 1)

TAT :  TATCGA [0]
TAT :  ATGCA []
TAT :  ACGGT []
ATC :  TATCGA [1]
ATC :  ATGCA [0]
ATC :  ACGGT []
TCG :  TATCGA [2]
TCG :  ATGCA []
TCG :  ACGGT [0]
CGA :  TATCGA [3]
CGA :  ATGCA []
CGA :  ACGGT [1]
