## Some Hidden Messages are More Surprising than Others.
### Step 1

Recall that nucleotides `A` & `T` are complements of each other, as are `G` & `C`.

- The beguining and  end of a DNA strand are denoted `5'`( pronounced 'five prime') and  `3'`(pronounced 'three prime').
- The figure below shows a tempalte strand `AGTCGATAGT`  and its complementary strand `ACTATGCGACT`
- eACH STRAND IS READ IN TEH `5' -> 3'` direction


![](http://bioinformaticsalgorithms.com/images/Replication/reverse_complement.png)


### Step 2

Given a nucleotide `p`, we denote its complementary nuclotide as `p*`.

The **reverse complement** of a string $Pattern=p_1...p_n$ is the string $Pattern_{rc}=p_n^{*}...p_1^{*}$ formed by taking the complement of each nucleotide in `Pattern`, then reversing the resulting sstring.

---
***code challenge***

**Reverse Complement Problem**: Find the revese complement of a DNA string.

**Input**: A DNA strng

**Output**: $Pattern_{rc}$, the reverse complemnt of `Pattern`

sample input:`AAACCCGGT`

sample output: `ACCGGTTTT`

In [1]:
#input:  AAAACCCGGT
#output: ACCGGGTTTT
# T -> A, A -> T
# G -> C, C -> G
#code = {'A': 'T','T':'A','C':'G', 'G':'C'}
def reverse_code(text):
    code = {'A': 'T','T':'A','C':'G', 'G':'C'}
    new_text = ''
    rev_t = text[::-1]
    for i in (rev_t):
        if i in code.keys():
            new_text += code.get(i)
        else:
            print('not!')
    return new_text

In [2]:
s_in='AAAACCCGGT'
reverse_code(s_in)

'ACCGGGTTTT'

## Step 3
Look again at the four most frequent `9-mers` in the *ori* of *Vibrio choleare*
 - ATGATCAAG
 - CTTGATCAT
 - TCTTGATCA
 - CTCTTGATC
 
 
```
atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacCTCTTGATCATcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagCTCTTGATCATgtt tccttaaccctctattttttacggaagaATGATCAAGctgctgCTCTTGATCATcgtttc 
```

## Step 4
Among those four:
 - ATGATCAAG
 - CTTGATCAT

Are reverse complements of each other!.  Resulting in the siz total occurrences of these strings show below:


```text
atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc
```

In [3]:
reverse_code('ATGATCAAG')

'CTTGATCAT'

In [4]:
reverse_code('CTTGATCAT')

'ATGATCAAG'

Finding a 9-mer that appears six times (either as itself or as its reverse complement) in a DNA string of length 500 is far more surprising than finding a 9-mer that appears three times (as itself). This observation leads us to the working hypothesis that $ATGATCAAG$ and its reverse complement $CTTGATCAT$ indeed represent DnaA boxes in Vibrio cholerae. 

This computational conclusion makes sense biologically because the DnaA protein that binds to DnaA boxes and initiates replication does not care which of the two strands it binds to. Thus, for our purposes, both $ATGATCAAG$ and $CTTGATCAT$ represent DnaA boxes.

However, before concluding that we have found the DnaA box of Vibrio cholerae, the careful bioinformatician should check if there are other short regions in the Vibrio cholerae genome exhibiting multiple occurrences of $ATGATCAAG$ (or $CTTGATCAT$). After all, maybe these strings occur as repeats throughout the entire Vibrio cholerae genome, rather than just in the ori region. To this end, we need to solve the following problem.

#### Code Challenge: Solve the Pattern Matching Problem.
- Input: Two strings, Pattern and Genome.
- Output: A collection of space-separated integers specifying all starting positions where Pattern appears as a substring of Genome.

example:
- Input:
    - pattern:ATAT
    - genome: GATATATGCATATACTT
- Output:
    - 1 3 9

In [5]:
# clean but not good for large db
pat = 'ATAT'
gen = 'GATATATGCATATACTT'

def find_pat_loc(pattern, genome):
    loc = [n for n in range(len(genome)) if gen.find(pattern, n) == n]
    return loc
        
        
find_pat_loc(pat,gen)

[1, 3, 9]

## Step 6
Exercise Break(HT): REturn a spaced-separated list of starting postions (in increasing order) whre `CTTGATCAT` appears as a substing in the *Vibrio cholerae* genome.

In [6]:
# load file
import os
filename = "/home/nbuser/Vibrio_cholerae.txt"

with open(filename) as f:
    vibr_c = f.read()
f.closed

True

In [7]:
pat = 'ATAT'
gen = 'GATATATGCATATACTT'

def v_patt(genome,pattern):
    pat_len = len(pattern)
    result = []
    for i in range(len(genome)):
        pat_fin = genome[i:i+pat_len]
        if pat_fin == pattern:
            result.append(i)
        #print(i,pat_fin)
    return result

v_patt(gen,pat)

[1, 3, 9]

In [13]:
# 16 instances
pat_6 = 'CTTGATCAT'
v_patt(vibr_c,pat_6)

{60039: 'CTTGATCAT',
 98409: 'CTTGATCAT',
 129189: 'CTTGATCAT',
 152283: 'CTTGATCAT',
 152354: 'CTTGATCAT',
 152411: 'CTTGATCAT',
 163207: 'CTTGATCAT',
 197028: 'CTTGATCAT',
 200160: 'CTTGATCAT',
 357976: 'CTTGATCAT',
 376771: 'CTTGATCAT',
 392723: 'CTTGATCAT',
 532935: 'CTTGATCAT',
 600085: 'CTTGATCAT',
 622755: 'CTTGATCAT',
 1065555: 'CTTGATCAT'}

In [14]:
# 17 instances
pat_7 = 'ATGATCAAG'
v_patt(vibr_c,pat_7)

{116556: 'ATGATCAAG',
 149355: 'ATGATCAAG',
 151913: 'ATGATCAAG',
 152013: 'ATGATCAAG',
 152394: 'ATGATCAAG',
 186189: 'ATGATCAAG',
 194276: 'ATGATCAAG',
 200076: 'ATGATCAAG',
 224527: 'ATGATCAAG',
 307692: 'ATGATCAAG',
 479770: 'ATGATCAAG',
 610980: 'ATGATCAAG',
 653338: 'ATGATCAAG',
 679985: 'ATGATCAAG',
 768828: 'ATGATCAAG',
 878903: 'ATGATCAAG',
 985368: 'ATGATCAAG'}

In [10]:
pat_6 == reverse_code(pat_7)

True

In [11]:
pat_7 == reverse_code(pat_6)

True

In [12]:
pat = 'ATAT'
gen = 'GATATATGCATATACTT'

def v_patt(genome,pattern):
    pat_len = len(pattern)
    result = []
    dic ={}
    for i in range(len(genome)):
        pat_fin = genome[i:i+pat_len]
        if pat_fin == pattern:
            if pat_fin not in dic.keys():
                dic[i] = pat_fin
            result.append(i)
        #print(i,pat_fin)
    return dic

v_patt(gen,pat)

{1: 'ATAT', 3: 'ATAT', 9: 'ATAT'}

After solving the Pattern Matching Problem, we discover that ATGATCAAG appears 17 times in the following starting positions of the Vibrio cholerae genome:

```python
[116556, 149355, 151913, 152013, 152394, 186189, 194276, 200076, 224527,
307692, 479770, 610980, 653338, 679985, 768828, 878903, 985368]
```
With the exception of the three occurrences of `ATGATCAAG` in ori at starting positions **151913, 152013, and 152394**, no other instances of `ATGATCAAG` form clumps, i.e., appear close to each other in a small region of the genome.

You may check that the same conclusion is reached when searching for `CTTGATCAT`. We now have strong statistical evidence that `ATGATCAAG/CTTGATCAT` may represent the hidden message to DnaA to start replication.