In the table containing nucleotide counts for T. petrophila (reproduced below), we noted that not just C but also G has peculiar statistics on the forward and reverse half-strands.

In practice, scientists use a more accurate approach that accounts for both G and C when searching for ori. As the above figure illustrates, the difference between the total amount of guanine and the total amount of cytosine is negative on the reverse half-strand and positive on the forward half-strand.






Thus, our idea is to traverse the genome, keeping a running total of the difference between the counts of G and C. If this difference starts increasing, then we guess that we are on the forward half-strand; on the other hand, if this difference starts decreasing, then we guess that we are on the reverse half-strand (see figure below).

Figure: Because of deamination, each forward half-strand has more guanine than cytosine, and each reverse half-strand has more cytosine than guanine. The difference between the counts of G and C is therefore positive on the forward half-strand and negative on the reverse half-strand.



We will keep track of the difference between the total number of occurrences of G and the total number of occurrences of C that we have encountered so far in Genome by using a skew array. This array, denoted Skew, is defined by setting Skew[i] equal to the number of occurrences of G minus the number of occurrences of C in the first i nucleotides of Genome (see figure below). We also set Skew[0] equal to zero.

Given a string Genome, we can form its skew array by setting Skew[0] equal to 0, and then ranging﻿ through the genome.  At position i of Genome, if we encounter an A or a T, we set Skew[i+1] equal to Skew[i]; if we encounter a G, we set Skew[i+1] equal to Skew[i]+1; if we encounter a C, we set Skew[i+1] equal to Skew[i]-1.

Code Challenge (3 points): Write a function SkewArray(Genome) that takes a DNA string Genome as input and returns the skew array of Genome in the form of a list whose i-th element is Skew[i]. Then add this function to Replication.py.

Click here for this problem's test datasets.


In [1]:
input_str = "CATGGGCATCGGCCATACGCC"

In [7]:
# Input:  A String Genome
# Output: The skew array of Genome as a list.
def SkewArray(Genome):
    # your code here
    Skew = []
    Skew.append(0)# Starts at 0
    for i in range(len(Genome)):
        if Genome[i] in ["A","T"]:
            Skew.append(Skew[i])# No skew is changed from previous entry
        elif Genome[i] == 'G':
            Skew.append(Skew[i]+1)
        else:
            Skew.append(Skew[i]-1)
    return Skew

def MinimumSkew(Genome):
    positions = [] # output variable
    Skew = SkewArray(Genome)
    min_val = min(Skew) 
    for i in range(len(Skew)):
        if Skew[i] == min_val:
            positions.append(i)
    return positions

In [3]:
SkewArray(input_str)

[0, -1, -1, -1, 0, 1, 2, 1, 1, 1, 0, 1, 2, 1, 0, 0, 0, 0, -1, 0, -1, -2]

In [4]:
import matplotlib.pyplot as plt
%matplotlib.inline()

UsageError: Line magic function `%matplotlib.inline` not found.


The skew diagram of Genome is defined by plotting i against Skew[i] as i ranges from 0 to len(Genome). The figure below shows the skew diagram for the genome from the previous step.



The figure below depicts the skew diagram for a linearized E. coli genome. The pattern is even stronger than the pattern observed when we visualized the symbol array! It turns out that the skew diagram for many bacterial genomes has a similar characteristic shape.

Figure: The skew diagram for E. coli achieves a maximum and minimum at positions 1550413 and 3923620, respectively.

Let’s follow the 5' → 3' direction of DNA and walk along the chromosome from ter to ori (along a reverse half-strand), then continue on from ori to ter (along a forward half-strand). In the figure below, we see that the skew is decreasing along the reverse half-strand and increasing along the forward half-strand. Thus, the skew should achieve a minimum at the position where the reverse half-strand ends and the forward half-strand begins, which is exactly the location of ori!




In [None]:
#Test 0 # Sample Dataset (your code is not run on this dataset)
#Input:
text=    "TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT"
#Output:
output=    [11, 24]
assert(MinimumSkew(text)==output)
#Test 1 # Check for index off-by-one errors (either indices are 1 too large or 1 too small)
#Input:
text=    "ACCG"
#Output:
output=    [3]
assert(MinimumSkew(text)==output)
##Test 2 # Check if you're missing the last value
#Input:
text=    "ACCC"
#Output:
output=    [4]
assert(MinimumSkew(text)==output)
#Test 3 # Check to make sure you're not finding maximum skew instead of minimum skew
#Input:
text=    "CCGGGT"
#Output:
output=    [2]
assert(MinimumSkew(text)==output)
#Test 4 # Check if you're not finding all of the indices
#Input:
text=    "CCGGCCGG"
#Output:
output=    [2, 6]
assert(MinimumSkew(text)==output)
#Test 5 # Full dataset
#Input:
text=    "AGCGTGCCGAAATATGCCGCCAGACCTGCTGCGGTGGCCTCGCCGACTTCACGGATGCCAAGTGCATAGAGGAAGCGAGCAAAGGTGGTTTCTTTCGCTTTATCCAGCGCGTTAACCACGTTCTGTGCCGACTTT"
#Output:
output=    [52]
assert(MinimumSkew(text)==output)

Solving the Minimum Skew Problem now provides us with an approximate location of ori at position 3923620 in E. coli. In an attempt to confirm this hypothesis, let’s look for a hidden message representing a potential DnaA box near this location. Solving the Frequent Words Problem in a window of length 500 starting at position 3923620 (shown below) reveals no 9-mers (along with their reverse complements) that appear three or more times! Even if we have located the position of ori in E. coli, it appears that we still have not found the DnaA boxes that jump-start replication in this bacterium . . . 



In [8]:
#Quiz
#text="GATACACTTCCCGAGTAGGTACTG"
#MinimumSkew(text) #I switched the function to max for this.

[1, 2, 3, 4]

In [9]:
x=0
for y in range(0,5):
    x+=y
print(x)

10
