# Lab Day 3 - Origin of Replication Part 2
The goal of lab today is to write some code to help us find the origin of replication for any bacterial genome.

*Adapted from Chapter 1 of Bioinformatics Algorithms*

***
Since we are picking up where we left off yesterday, you should import your functions from yesterday into your workspace. I have included `Day2_Lab.py` on Moodle which includes all relevant functions. You may import that document here if you wish.

In [1]:
from Day2_Lab import * # Load all functions directly into your workspace

*** 

## Step 3: Use Skew to Find Ori

We found hundreds of possibilities and are no closer to finding the Ori in E. coli. Let's try using our skew calculations.

*** 

#### (1) Write a function that will calculate the skew at each position of a genome.

$$skew_i(genome) = count_i(G) - count_i(C)$$

$skew_0(genome) = 0$ <br><br>
$ skew_{i+1}(genome) = \left\{
\begin{array}{lr}
    skew_i(genome),       genome_i \subset A, T \\
    skew_i(genome)+1,     genome_i = G \\
    skew_i(genome)-1,     genome_i = C \\
\end{array}
\right. $ 

`Skew("CATGGGCATCGGCCATACGCC")` would have an output of `[0, -1, -1, -1, 0, 1, 2, 1, 1, 1, 0, 1, 2, 1, 0, 0, 0, 0, -1, 0, -1, -2]`

#### (2) Try plotting E. Coli's skew.

The package matplotlib has many plotting capabilities. `import matplotlib.pyplot as plt` (convention in the field) and use the plot function to visualize the skew. For help, try `help(plt.plot)` or search google. 

#### (3) Calculate and plot the skew for the E.coli genome.

#### (4) Now, write a function to find all the locations that skew is minimized. These are our candidates for Ori. 
`MinimizeSkew("TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT")` would have an output of `[11, 24]`

#### (5) What position(s) minimize the skew in E.coli?

#### (6) Use that position and a window size of 500 to see if any 9-mers, and their reverse complements, appear together 3 or more times?

***
## Step 4: Account for Mismatches
It seems that we are stuck again. We have an approximate location for the Ori but we haven't found the DnaA Box yet. We need to turn back to biology for some help here. 

Small genetic variations occur frequently and biology is often flexible enough to handle the differing situations. That being said, DnaA will bind imperfect DnaA boxes. We need to account for that in our code.

We can measure how close two k-mers are using the <font color = #800080><b>Hamming Distance</b></font>. The Hamming Distance will calculate the number of mis-matches (locations where the two nucleotides are not exact) between two k-mers. For example, `CGAAT` and `CGGAC` have a two mismatches. 

#### (7) Write a function to calculate the hamming distance between two strings.

An input of `HammingDist("GGGCCGTTGGT", "GGACCGTTGAC")` will have an output of `3`

#### (8) Now, lets re-write our `PatternCount` function to allow for approximate matches - matches with a Hamming Distance $\le$ d when compared to the pattern. Your function will take in a pattern, a genome and a Hamming Distance cut off of d.

`ApproximatePatternCount("TTTAGAGCCTTCAGAGG", "GAGG", 2)` will have an output of 4.

#### (9) Write a function that will do Approximate Pattern Matching. We want to find the locations of all approximate occurrences of a pattern in a string. Your function will take in a genome, pattern and a Hamming Distance cutoff `d`.

`ApproximateMatch("CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT", "ATTCTGGA", 3)` will have an output of `[6, 7, 26, 27]`

#### (10) Write a `Neighbors` function to generate all the k-mers that are in the <i>d</i>-neighborhood of a given pattern.

This function uses recursion to generate a collection of k-mers in the <b><i>d</i>-neighborhood</b> of the pattern. The k-mers generated by `Neighbors(Pattern,d)` should be within a Hamming Distance of d from the pattern.

`Neighbors('AA',1)` should yield an output of `{'AA', 'AC', 'AG', 'AT', 'CA', 'GA', 'TA'}`
`Neighbors('ACG',1)` should yield an output of `{'AAG', 'ACA', 'ACC', 'ACG', 'ACT', 'AGG', 'ATG', 'CCG', 'GCG', 'TCG'}`

#### (11) Next up is the re-writing of our Frequent Words problem. Now, we want to find Frequent Words with Mismatches. 
To do this efficiently, we will use a frequency table  (dictionary) to count the number of times a string has an approximate match in `Text`.  

Our `Neighbors` function will come in handy here.

`FrequentWordsWithMismatches("ACGTTGCATGTCGCATGATGCATGAGAGCT", 4, 1)` will generate a list similar to `["ATGC", "ATGT", "GATG"]`

#### (12) Finally, can you edit your function from part 11 to account for the reverse complement? 

You want your function to find all k-mers that maximize `ApproximatePatternCount(Text, Pattern, d) + ApproximatePatternCount(Text, Pattern_rc, d)` over all possible k-mers. 

`FrequentWordsWithMismatchesRC("ACGTTGCATGTCGCATGATGCATGAGAGCT", 4, 1)` should give an output of `["AATT", "GGCC"]`

#### (13) Lets try finding the DnaA box for E.coli again. Using the position we found in problem 17 that minimized skew and a window size of 500, look for 9-mers (with 1 mismatch and reverse complement).

#### (14) As a part of this Ori finding exercise, we looked at three different bacterial genomes. Inside the Ori of these different genomes, we found frequent k-mers, other than our DnaA box, that occurred surprisingly frequently. Why do you think this is? 