# Day 4 Lab - Motif Finding
The goal of today's lab is to write some code to find and score motifs.

***
## Step 1: Count, Profile and Entropy

#### (1) Write two functions, `Count` and `Profile`, that will take in a collection of DNA strings and have it return either the count or profile matrix.
You will want to use a flag parameter inside of your function definition to indicate whether or not you want the count matrix or the profile matrix returned.

Use numpy arrays to store your profile (or count) matrix.


In [11]:
import numpy as np
import math
from Day2_Lab import *

In [2]:
def count(dna):
    l=len(dna[0])
    matrix = np.zeros((4,l))
    for i in range(l):
        dict = {'A':0, 'C':0, 'G':0, 'T':0}
        for seq in dna:
            dict[seq[i]]+=1
        j=0
        for key in dict:
            matrix[j,i]=dict[key]
            j+=1
    return matrix

def profile(dna):
    return count(dna)*(1/4)
x=["ATA","ATT","GTT","TTT"]
print(count(x))
print(profile(x))
            

[[2. 0. 1.]
 [0. 0. 0.]
 [1. 0. 0.]
 [1. 4. 3.]]
[[0.5  0.   0.25]
 [0.   0.   0.  ]
 [0.25 0.   0.  ]
 [0.25 1.   0.75]]


#### (2) Write a function that will calculate the entropy for each column of your profile.

A call of `Entropy(profile)` would yield a 1D array of entropy values.

In [3]:
def entropy(profile):
    col = len(profile[0])
    matrix = np.zeros((1,col))
    for i in range(col):
        temp = []
        for j in range(4):
            if profile[j,i]>0:
                temp.append(profile[j,i])
        sum = 0
        for k in temp:
            sum += k * (math.log2(k))
        if sum == 0:
            matrix[0,i] = sum
        else:
            matrix[0,i] = -sum
    return matrix
print(entropy(profile(x)))

[[1.5        0.         0.81127812]]


## Step 2: A Brute Force Algorithm: 
The goal of this step is to implement a brute force solution that looks through all possible motifs to come up with a consensus motif:

$$Motifs \rightarrow Consensus(Motifs)$$

#### (3) Write a brute force algorithm to find Motifs:

`MotifEnumeration(["ATTTGGC","TGCCTTA","CGGTATC", "GAAAATT"], 3, 1)` would yield a result of `["ATA", "ATT","GTT","TTT"]`

All patterns found by this algorithm represent what is called a (k,d)-motif. Each pattern found exists in every string of DNA with at most d mismatches. These are our possible consensus motifs.

In [4]:
def get_kmers(dna,k):
    return list(set([dna[i:i+k] for i in range(len(dna)-k+1)]))
def MotifEnumeration(Dna, k, d):
    # Brute force algorithm for motif finding.
    # Given a collection of strings Dna and an integer d,
    # a k-mer is a (k,d)-motif if it appears in every string from
    # Dna with at most d mismatches.
    patterns = set()
    for pattern in get_kmers(Dna[0],k):
        for pat in neighbors(pattern,d):
            match_all = True
            for dna in Dna[1:]: # Check each string of DNA
                # Need to see if any neighbors of our pattern are in t
                match_all = match_all and any([neighbor in dna for neighbor in neighbors(pat,d)])
            if match_all: # if
                patterns.update([pat])
    return patterns
# def motifEnum(dna, k, d):
#     patterns = set()
#     bigList = []
#     for entry in dna:
#         tempList = []
#         for key in FrequencyTable(dna[entry],k):
#             tempList.append(key)
x=["ATTTGGC","TGCCTTA","CGGTATC", "GAAAATT"]
print(MotifEnumeration(x, 3, 1))
#CHECK LECTURE NOTES
        

{'ATA', 'GTT', 'TTT', 'ATT'}


#### (4) Write a function to compute $d(Pattern, Dna) = \sum_{i=1}^{t}HammingDistance(Pattern,Dna_i)$

`DistanceBetweenPatternAndStrings("AAA",['TTACCTTAAC', 'GATATCTGTC', 'ACGGCGTTCG', 'CCCTAAAGAG', 'CGTCAGAGGT'])` will yield an output of `5`

In [5]:
def distBetweenPatternAndStrings(pattern, dnaList):
    k = len(pattern)
    dist = 0
    for dna in dnaList:
        hDist = math.inf
        for kmer in get_kmers(dna,k):
            if hDist > hammingDist(pattern, kmer):
                hDist = hammingDist(pattern, kmer)
        dist+=hDist
    return dist
y=['TTACCTTAAC', 'GATATCTGTC', 'ACGGCGTTCG', 'CCCTAAAGAG', 'CGTCAGAGGT']
print(distBetweenPatternAndStrings("AAA", y))
        
        

5


#### (5) Use the function from part 4 to score all consensus motifs found using `MotifEnumeration(["ATTTGGC","TGCCTTA","CGGTATC GAAAATT"], 3, 1)`. Which consensus motif had the best score? 

In [6]:
def score(motifList, dnaList):
    dict = {}
    for motif in motifList:
        dict[motif]=distBetweenPatternAndStrings(motif, dnaList)
    return dict
print(score(MotifEnumeration(x,3,1),x))

{'ATA': 4, 'GTT': 4, 'TTT': 3, 'ATT': 2}


## Step 3: Median Strings Algorithm:
This is a variation of a brute force algorithm. Now, instead of looking through all possible Motifs, we are going to be looking through all possible consensus patterns to come up with a collection of motifs. This algorithm will return the consensus pattern with the lowest score.  

$$ Consensus(Motifs) \rightarrow Motifs$$

#### (6) Write a function called `AllStrings` that will generate a list of all possible $4^k$ k-mers.
Can you think of how to use the `Neighbors` function we defined yesterday to generate all k-mers from AA..AA to TT..TT?

In [7]:
def allString(k):
    string = ""
    for _ in range(k):
        string+="A"
    return(list(neighbors(string, k)))
print(allString(3))

['TCG', 'GTA', 'GAA', 'TGA', 'TGT', 'CGC', 'ACG', 'TAG', 'CTG', 'TTC', 'GCT', 'AAA', 'AAC', 'AGG', 'AGC', 'CAA', 'CCC', 'TGC', 'TAC', 'GAC', 'ATA', 'GGA', 'CCG', 'GGT', 'GCG', 'CGT', 'ACT', 'CAG', 'AAT', 'TCA', 'GAT', 'TAT', 'TTA', 'GGG', 'AGT', 'AAG', 'GAG', 'TAA', 'CGA', 'TGG', 'GTT', 'GCC', 'ATC', 'CAC', 'ATT', 'CCA', 'TCT', 'ACA', 'AGA', 'CTT', 'GGC', 'TTT', 'ATG', 'CTC', 'CAT', 'CTA', 'CGG', 'TCC', 'CCT', 'GTG', 'GCA', 'GTC', 'TTG', 'ACC']


#### (7) Write a function that will find the solve the Median Strings Problem.
`MedianString(['AAATTGACGCAT','GACGACCACGTT','CGTCAGCGCCTG','GCTGAGCACCGG','AGTTCGGGACAG'],3)` will yield an output of `'GAC'`

In [8]:
def medianString(dna, k):
    dist = math.inf
    patterns = allString(k)
    for i in range(len(patterns)):
        pattern = patterns[i]
        if dist > distBetweenPatternAndStrings(pattern, dna):
            dist = distBetweenPatternAndStrings(pattern, dna)
            median = pattern
    return median
z=['AAATTGACGCAT','GACGACCACGTT','CGTCAGCGCCTG','GCTGAGCACCGG','AGTTCGGGACAG']
print(medianString(z,3))

GAC


***
## Step 4: Lets look at how long it takes two algorithms side by side.

Download the `subtle_motif_dataset.txt` from Moodle. This dataset has 10 DNA strings of 600 nucleotides each. Inside of the dataset there is an implanted variation of the motif **AAAAAAAAGGGGGGG**. These motifs are currently marked with an `*`. 

#### (8) Read the file in and pull out all 15-mers marked with the `*`. Store them in a list. 

Regular expressions will be useful for this step.

In [10]:
import re
motif = open("subtle_motif_dataset.txt", 'r').read()
motifs = re.findall(r"(?<=[*]).{15}", motif)

#### (9) Before we move further we should spend a few minutes cleaning our dataset. Use appropriate string methods to remove all occurrences of the `*` and make sure each DNA string is stored in a list.
At the end of this problem, you should have a list of length 10 containing strings of length 600. 

In [None]:
def cleanDna(dnaList):
    

#### (10) Call your brute force algorithm `MotifEnumeration` to look for motifs of length 15 (k = 15) with up to 4 mismatches (d = 4). Time how long it takes to run. 

From the time module, the time function can be used.

Store the output of your algorithm into a variable.

#### (11) How long did it take for your brute force algorithm to run?

#### (12) Look at the output from the algorithm. Was it able to find the correct motifs?

#### (13) Now run your `MedianString` algorithm. Time how long it takes to run. 

#### (14) How did the time for the `MedianString` algorithm compare to the `MotifEnumeration` algorithm? 

#### (15) Was your `MedianString` able to find the motif `AAAAAAAAGGGGGGG`?