The daily schedules of animals, plants, and even bacteria are controlled by an internal timekeeper called the circadian clock. Anyone who has experienced the misery of jet lag knows that this clock never stops ticking. Rats and research volunteers alike, when placed in a bunker, naturally maintain a roughly 24-hour cycle of activity and rest in total darkness. And, like any timepiece, the circadian clock can malfunction, resulting in a genetic disease known as delayed sleep-phase syndrome (DSPS).

The circadian clock must have some basis on the molecular level, which presents many questions. How do individual cells in animals and plants (let alone bacteria) know what time it is? Is there a “clock gene”? Can we explain why heart attacks occur more often in the morning, while asthma attacks are more common at night? And can we identify genes that are responsible for “breaking” the circadian clock to cause DSPS?

In the early 1970s, Ron Konopka and Seymour Benzer identified mutant flies with abnormal circadian patterns and traced the flies’ mutations to a single gene. Biologists needed two more decades to discover a similar clock gene in mammals, which was just the first piece of the puzzle. Today, many more circadian genes have been discovered; these genes, having names like timeless, clock, and cycle, orchestrate the behavior of hundreds of other genes and display a high degree of evolutionary conservation across species.

We will first focus on plants, since maintaining the circadian clock in plants is a matter of life and death. Consider how many plant functions depend on when the sun rises and sets. Indeed, biologists estimate that over a thousand plant genes are circadian, including the genes related to photosynthesis, photo reception, and flowering. But what does it mean for a gene be circadian?

The Central Dogma of Molecular Biology states that “DNA makes RNA makes protein.” The DNA corresponding to a gene is first transcribed into a strand of RNA composed of four ribonucleotides: adenine, cytosine, guanine, and uracil (which replaces thymine in DNA). Then, the RNA transcript is translated into the amino acid sequence of a protein, which performs some function in the cell.

Much like DNA replication, the chemical machinery underlying transcription and translation is fascinating, but from a computational perspective, both processes are straightforward. Transcription simply replaces all occurrences of T in a DNA string with U. The resulting strand of RNA is translated into an amino acid sequence as follows. During translation, the RNA strand is partitioned into non-overlapping 3-mers called codons. Then, each codon is converted into one of 20 amino acids via the genetic code; the resulting sequence can be represented as an amino acid string over a 20-letter alphabet. As illustrated in the figure below, each of the 64 RNA codons encodes its own amino acid (some codons encode the same amino acid), with the exception of three stop codons that do not translate into amino acids and serve to halt translation (see DETOUR: Discovery of Codons and Split Genes). For example, the DNA string "TATACGAAA" transcribes into the RNA string "UAUACGAAA", which in turn translates into the amino acid string "YTK". 

The key point is that cells are able to transcribe different genes into RNA at different rates. This variance in the production of a gene’s transcripts, or gene expression, explains how a brain cell and a skin cell can have the same DNA but perform vastly different functions. Variation in gene expression through the day also accounts for how the cell can keep track of time.

It turns out that every plant cell keeps track of day and night independently of other cells, and that just three plant genes, called LHY, CCA1, and TOC1, are the clock’s master timekeepers. Such genes, and the regulatory proteins that they encode, are often controlled by external factors (e.g., nutrient availability or sunlight) in order to allow organisms to adjust their gene expression.

For example, regulatory proteins controlling the circadian clock in plants coordinate circadian activity as follows. TOC1 promotes the expression of LHY and CCA1, whereas LHY and CCA1 repress the expression of TOC1, resulting in a negative feedback loop. In the morning, sunlight activates the transcription of LHY and CCA1, triggering the repression of TOC1 transcription. As light diminishes, so does the production of LHY and CCA1, which in turn do not repress TOC1 any more. Transcription of TOC1 peaks at night and starts promoting the transcription of LHY and CCA1, which in turn repress the transcription of TOC1, and the cycle begins again.

LHY, CCA1, and TOC1 are able to control the transcription of other genes because the regulatory proteins that they encode are transcription factors, or master regulatory proteins that turn other genes on and off. A transcription factor regulates a gene by binding to a specific short DNA interval called a regulatory motif, or transcription factor binding site, in the gene’s upstream region, a 600-1000 nucleotide-long region preceding the start of the gene. For example, CCA1 binds to "AAAAAATCT" in the upstream region of many genes regulated by CCA1.

The life of a bioinformatician would be easy if regulatory motifs were completely conserved, but the reality is more complex, as regulatory motifs may vary at some positions, e.g., CCA1 may instead bind to "AAGAACTCT". But how can we locate these regulatory motifs without knowing what they look like in advance? We need to develop algorithms for motif finding, the problem of discovering a “hidden message” shared by a collection of strings.

In 2000, Steve Kay used DNA arrays (see DETOUR: DNA Arrays) to determine which genes in the plant Arabidopsis thaliana are activated at different times of the day. He then extracted the upstream regions of nearly 500 genes that exhibited circadian behavior and looked for frequently appearing patterns in their upstream regions. If you concatenated these upstream regions into a single string, you would find that "AAAATATCT" is a surprisingly frequent word, appearing 46 times.

Kay named "AAAATATCT" the evening element and performed a simple experiment to prove that it is indeed the regulatory motif responsible for circadian gene expression in Arabidopsis thaliana. After he mutated the evening element in the upstream region of one gene, the gene lost its circadian behavior. 

STOP and Think: What is the possible downside of concatenating all the upstream regions into a single string and looking for frequent words in order to find a motif?

Whereas the evening element in plants is very conserved, and thus easy to find, motifs having many mutations are more elusive. For example, if you infect a fly with a bacterium, the fly will switch on its immunity genes to fight the infection. Thus, some of the genes with elevated expression levels after the infection are likely to be immunity genes. Indeed, some of these genes have 12-mers similar to "TCGGGGATTTCC" in their upstream regions, the binding site of a transcription factor called NF-kB that activates various immunity genes in flies. However, NF-κB binding sites are nowhere near as conserved as the evening element. The figure below shows ten NF-κB binding sites from the Drosophila melanogaster genome; the most popular nucleotides in every column are shown by upper case colored letters.
```
 1  T C G G G G g T T T t t
 2  c C G G t G A c T T a C
 3  a C G G G G A T T T t C
 4  T t G G G G A c T T t t
 5  a a G G G G A c T T C C
 6  T t G G G G A c T T C C
 7  T C G G G G A T T c a t
 8  T C G G G G A T T c C t
 9  T a G G G G A a c T a C
10  T C G G G t A T a a C C
```
Figure: The ten candidate NF-κB binding sites appearing in the Drosophila melanogaster genome. The colored upper case letters indicate the most frequent nucleotide in each column.

Our aim is to turn the biological challenge of finding regulatory motifs into a computational problem. Below, we have implanted a 15-mer hidden message at a randomly selected position in each of ten randomly generated DNA strings. This example mimics a transcription factor binding site hiding in the upstream regions of ten genes.
```
 1 "atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg"
 2 "acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga"
 3 "tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga"
 4 "gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga"
 5 "tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag"
 6 "gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa"
 7 "cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat"
 8 "aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta"
  9 "ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag" 
10 "ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga" 
```

This is a simple problem: applying the FrequentWords algorithm that we developed in the previous chapter to the concatenation of these strings will immediately reveal the most frequent 15-mer shown below as the implanted pattern. Since these short strings were randomly generated, it is unlikely that they contain other frequent 15-mers.
```
 1 "atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg"
 2 "acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa"
 3 "tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga"
 4 "gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga"
 5 "tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag"
 6 "gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa"
 7 "cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat"
 8 "aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta"
  9 "ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag" 
10 "ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa" 
```

Now imagine that instead of implanting exactly the same pattern into all strings, we mutate the pattern before inserting it into each string by randomly changing the nucleotides at four randomly selected positions within each implanted 15-mer, as shown below.
```
 1 "atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg"
 2 "acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa"
 3 "tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga"
 4 "gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga"
 5 "tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag"
 6 "gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa"
 7 "cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat"
 8 "aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta"
  9 "ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag" 
10 "ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa" 
```
FrequentWords is no longer going to help us, since AAAAAAAAGGGGGGG does not even appear in the strings above. We could adapt the Frequent Words Problem into a “Frequent Words with Mismatches Problem”. However, concatenating all the strings into a single string is inadequate because it does not correctly model the biological problem of motif finding. A DnaA box is a pattern that clumps, or appears frequently, within a DNA string. In contrast, a regulatory motif is a pattern that appears at least once in each one of several different regions that are scattered throughout the genome. 

Furthermore, when Steve Kay used a DNA array to infer the set of circadian genes in plants, he did not expect that all genes in the resulting set would have the evening element (or its variants) in their upstream regions. Similarly, biologists do not expect that all genes with an elevated expression level in infected flies must be regulated by NF-κB. DNA array experiments are inherently noisy, and some genes identified by these experiments have nothing to do with the circadian clock in plants or immunity genes in flies.



A computational problem formulation for motif finding would score individual instances of motifs depending on how similar they are to an “ideal” motif (i.e., a transcription factor binding site that binds the best to the transcription factor). However, since the ideal motif is unknown, we attempt to select a k-mer from each string and score these k-mers depending on how similar they are to each other.

To define scoring, consider a list of t DNA strings Dna, where each string has length n, and select a k-mer from each string to form a collection Motifs, which we represent as a t x k motif matrix. In the figure below, which shows the motif matrix for the NF-κB binding sites from the figure below, we indicate the most frequent nucleotide in each column of the motif matrix by upper case letters. If there are multiple most frequent nucleotides in a column, then we arbitrarily select one of them to break the tie. Note that positions 2 and 3 are the most conserved (nucleotide G is completely conserved in these positions), whereas position 10 is the least conserved.


Figure: The NF-κB binding sites form a 10 x 12 motif matrix, with the most frequent nucleotide in each column shown in upper case letters and all other nucleotides shown in lower case letters.

By varying the choice of k-mers in each string, we can construct a large number of different motif matrices from a given sample of DNA strings. Our goal is to select k-mers resulting in the most “conserved” motif matrix, meaning the matrix with the most upper case letters (and thus the fewest number of lower case letters). Leaving aside the question of how we select such k-mers, we will first focus on how to score the resulting motif matrices, defining Score(Motifs) as the number of unpopular (lower case) letters in the motif matrix Motifs (see updated figure below). Our goal is to find a collection of k-mers that minimizes this score (for more on motif scoring functions, see DETOUR: Motif Scoring Functions).

For a given choice of Motifs, we can construct a 4 x k count matrix, denoted Count(Motifs), counting the number of occurrences of each nucleotide in each column of the motif matrix; element (i,j) of Count(Motifs) stores the number of times that nucleotide i appears in column j of Motifs. (See updated figure below).




To generate a count matrix from an arbitrary list of strings Motifs, we need to first initialize the count matrix, represented as a dictionary:

    `count = {}`
We then range over all nucleotides symbol and create a list of zeroes corresponding to count[symbol].
```
    k = len(Motifs[0])
    for symbol in "ACGT":
        count[symbol] = []
        for j in range(k):
             count[symbol].append(0)
```
Note that the first line above sets k equal to the length of Motifs[0], the first string in Motifs, which is the length of every string in Motifs. Also, note the difference between the line count = {} (which forms an empty dictionary) and the line count[symbol] = [] (which forms an empty list). Finally, we need to range over all elements symbol = Motifs[i][j] of the count matrix and add 1 to count[symbol][j].

```
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
```
Code Challenge (1 point): Write a function Count(Motifs) that takes a list of strings Motifs as input and returns the count matrix of  Motifs (as a dictionary of lists). Then place this function into a new Python file for this chapter called "Motifs.py".

In [20]:
# Input:  A set of kmers Motifs
# Output: Count(Motifs)
def Count(Motifs):
    count = {} # initializing the count dictionary
    #Initialize each nucleotide with an empty list, 
    for nucleotide in ["A","C","G","T"]:
        count[nucleotide] = []     
    for ind in range(len(Motifs[0])):
        for nucleotide in ["A","C","G","T"]:
            count[nucleotide].append(0) #everything must have a 0 initially  
        for motif in range(len(Motifs)): #For each Motif, loop through chars
            count[Motifs[motif][ind]][ind] += 1 # FOr each nuc, increment its count for that Motif
    return count

In [23]:
"""Test 0 # Sample Dataset (your code is not run on this dataset)
Input:"""
motifs1=    ["AACGTA",
    "CCCGTT",
    "CACCTT",
    "GGATTA",
    "TTCCGG"]
#Output: Basically, each list contains the number of Nucleotides in the nth position of all motifs in that list.
out1= {'A': [1, 2, 1, 0, 0, 2], 'C': [2, 1, 4, 2, 0, 0], 'G': [1, 1, 0, 2, 1, 1], 'T': [1, 1, 0, 1, 4, 2]}

assert (Count(motifs1)==out1)
"""Test 1 # Full dataset
Input:"""
motifs2=    ["GTACAACTGT",
    "CAACTATGAA",
    "TCCTACAGGA",
    "AAGCAAGGGT",
    "GCGTACGACC",
    "TCGTCAGCGT",
    "AACAAGGTCA",
    "CTCAGGCGTC",
    "GGATCCAGGT",
    "GGCAAGTACC"]
#Output:
out2= {'A': [2, 3, 3, 3, 6, 4, 2, 2, 1, 3], 'C': [2, 3, 4, 3, 2, 3, 2, 1, 3, 3], 'G': [4, 2, 3, 0, 1, 3, 4, 5, 5, 0], 'T': [2, 2, 0, 4, 1, 0, 2, 2, 1, 4]}
assert (Count(motifs2)==out2)

In [17]:
Count(motifs)

{'A': [3, 5, 3, 4, 2, 1, 5, 1, 2, 3],
 'C': [2, 2, 3, 1, 4, 3, 2, 4, 2, 3],
 'G': [2, 1, 2, 4, 3, 3, 2, 3, 4, 3],
 'T': [3, 2, 2, 1, 1, 3, 1, 2, 2, 1]}

As shown below, we will further divide all of the elements in the count matrix by t, the number of rows in Motifs. This results in a profile matrix Profile(Motifs) for which element (i,j) is the frequency of the i-th nucleotide in the j-th column of the motif matrix (i.e., the number of occurrences of the i-th nucleotide divided by t, the number of nucleotides in the column). Note that the elements of any column of the profile matrix sum to 1.


In [31]:
# Input:  A list of kmers Motifs
# Output: the profile matrix of Motifs, as a dictionary of lists.
def Profile(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    profile = {}
    counts = Count(Motifs)
    for nucleotide in ["A","C","G","T"]:
        #everything must divided byt the total number of Motifs  
        profile[nucleotide] = [ count/float(t) for count in counts[nucleotide]]  
    return profile

In [33]:
out1 = {'A': [0.2, 0.4, 0.2, 0.0, 0.0, 0.4], 'C': [0.4, 0.2, 0.8, 0.4, 0.0, 0.0], 'G': [0.2, 0.2, 0.0, 0.4, 0.2, 0.2], 'T': [0.2, 0.2, 0.0, 0.2, 0.8, 0.4]}
assert(Profile(motifs1)==out1)
out2 = {'A': [0.2, 0.3, 0.3, 0.3, 0.6, 0.4, 0.2, 0.2, 0.1, 0.3], 'C': [0.2, 0.3, 0.4, 0.3, 0.2, 0.3, 0.2, 0.1, 0.3, 0.3], 'G': [0.4, 0.2, 0.3, 0.0, 0.1, 0.3, 0.4, 0.5, 0.5, 0.0], 'T': [0.2, 0.2, 0.0, 0.4, 0.1, 0.0, 0.2, 0.2, 0.1, 0.4]}
assert(Profile(motifs2)==out2)