## Day 5 Lab - Randomized Motif Finding:

Yesterday we wrote two different brute force algorithms to find the best consensus sequence given a set of DNA. These algorithms gave us the correct answer but were too slow to truly be useful. 

Now we are going to use a Monte Carlo randomized algorithm. This algorithm will quickly give us an approximation of the correct answer. The algorithm is quick so we can run it many times to get the best possible approximation. 

#### (1) Import your relevant functions from the last several days here.

In [1]:
from Day2_Lab import *

#### (2) Write a new `ProfilePseudocounts` function that will calculate the profile for provided motifs using pseudocounts.

A call of `ProfilePseudocounts(['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT'])` would yield an output of `np.array([[0.55555556, 0.11111111, 0.11111111, 0.22222222], [0.11111111, 0.44444444, 0.22222222, 0.11111111], [0.22222222, 0.22222222, 0.55555556, 0.11111111], [0.11111111, 0.22222222, 0.11111111, 0.55555556]])`

In [3]:
def countPseudo(dna):
    return(count(dna)+1)
x=['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT']
print(countPseudo(x))

def profilePseudo(dna):
    return count(dna)*(1/8)
print(profilePseudo(x))
    

[[5. 1. 1. 2.]
 [1. 4. 2. 1.]
 [2. 2. 5. 1.]
 [1. 2. 1. 5.]]
[[0.5   0.    0.    0.125]
 [0.    0.375 0.125 0.   ]
 [0.125 0.125 0.5   0.   ]
 [0.    0.125 0.    0.5  ]]


#### (3) Write a function that will generate a consensus sequence for provided motifs.
A call of `Consensus(['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT'])` would return `'ACGT'`.

If two or more nucleotides are most common at a position, pick one to move forward with.

In [7]:
def consensus(seqList):
    pro = profilePseudo(seqList)
    l = len(pro)
    final = ""
    for i in range(l):
        posMax = 0
        pos = ""
        for j in range(4):
            if pro[j,i] > posMax:
                posMax = pro[j,i]
                if j==0:
                    pos="A"
                elif j==1:
                    pos="C"
                elif j==2:
                    pos="G"
                elif j==3:
                    pos="T"
        final+=pos
    return final
print(consensus(x))

ACGT


#### (4) Write a scoring function. It is up to you to decide what score you want to use. Column score, row score, or entropy score. Take a minute to note if a higher or lower score is better for your function, this distinction will be important when implementing your randomized algorithm.

#### (5) Write a `KmerProbability` that will calculate the probability that a provided k-mer would occur given a specific profile. `KmerProbability(Profile,kmer)`.

A call of `KmerProbability(np.array([[0.2, 0.2, 0.3],[0.4, 0.3, 0.1],[0.3, 0.3, 0.5],[0.1, 0.2, 0.1]]), 'CGG')` would return `0.06`. 

#### (6) Write a function that will find the best, most probable, k-mer from a single DNA sequence. `KmerMostProbable(Profile, DnaSeq, k)`

An call of `KmerMostProbable(np.array([[0.2, 0.2, 0.3],[0.4, 0.3, 0.1],[0.3, 0.3, 0.5],[0.1, 0.2, 0.1]]), 'CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA', 3)` would return `CGG`

If multiple most probable kmers exist, pick the first to move forward with. 

#### (7) Write your `Motifs(profile, Dna,k)` function. 
This function will return a set of k-mer motifs that are most probable given your profile. 

`Motifs(np.array([[0.4, 0.2, 0.2,0.2],[0.2, 0.4, 0.2,0.2],[0.2, 0.2, 0.4,0.2],[0.2, 0.2, 0.2,0.4]]), ['TTACCTTAAC','GATGTCTGTC','CCGGCGTTAG','CACTAACGAG','CGTCAGAGGT'], 4)` would yield an output of `['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT']`

#### (8) Write your `RandomizedMotifSearch` algorithm.

#### (9) Our RandomizedMotifSearch is a Monte Carlo algorithm. This should execute it *many* times to get the best approximation possible. Write a driver function that will call `RandomizedMotifSearch` `N` times.  Your function should keep track of the best approximation.

In [1]:
import re
motif = open("subtle_motif_dataset.txt", 'r').read()
motifs = re.findall(r"(?<=[*]).{15}(?=[*])",motif)

#### (10) Download the `subtle_motif_dataset.txt` from Moodle. This dataset has 10 DNA strings of 600 nucleotides each. Inside of the dataset there is an implanted variation of the motif **AAAAAAAAGGGGGGG**. These motifs are currently marked with an `*`. 