In the last section, we saw that as the replication fork expands, DNA polymerase synthesizes DNA quickly on the reverse half-strand but suffers delays on the forward half-strand. We will explore the asymmetry of DNA replication to design a new algorithm for finding ori. 

How in the world can the asymmetry of replication possibly help us locate ori? Notice that since the replication of a reverse half-strand proceeds quickly, it lives double-stranded for most of its life. Conversely, a forward half-strand spends a much larger amount of its life single-stranded, waiting to be used as a template for replication. This discrepancy between the forward and reverse half-strands is important because single-stranded DNA has a much higher mutation rate than double-stranded DNA. In particular, if one of the four nucleotides in single-stranded DNA has a greater tendency than other nucleotides to mutate in single-stranded DNA, then we should observe a shortage of this nucleotide on the forward half-strand.

Following up on this thought, let’s compare the nucleotide counts of the reverse and forward half-strands. If these counts differ substantially, then we will design an algorithm that attempts to track down these differences in genomes for which ori is unknown. The nucleotide counts for Thermotoga petrophila are shown in the figure below.


Although the frequencies of A and T are practically identical on the two half-strands, C is more frequent on the reverse half-strand than on the forward half-strand, resulting in a difference of 219518 - 207901 = +11617. Its complementary nucleotide G is less frequent on the reverse half-strand than on the forward half-strand, resulting in a difference of 201634 - 211607 = -9973.




Let’s see if we can take advantage of these peculiar statistics caused by deamination to locate ori in a circular bacterial genome. Since we know that C is more frequent in half of the genome and less frequent in the other half, our idea is to slide a giant window of length len(Genome)//2 down the genome, counting the number of occurrences of C in each window. (Note: in Python, the double slash // indicates integer division, or eliminating any remainder; therefore, 11//2 is equal to 5, not 5.5.) Inspired by the nucleotide counts table in Vibrio cholerae (reproduced below), our hope is that the window having the fewest occurrences of C will roughly correspond to the forward half-strand and that the window having the most occurrences of C will roughly correspond to the reverse half-strand. And if we know where the forward and reverse half-strands are, then we have found ori!




Although most bacteria have circular genomes, we have thus far assumed that genomes were linear, a reasonable simplifying assumption because the length of the window is much shorter than the length of the genome. This time, because we are sliding a giant window, we should account for windows that “wrap around” the end of Genome. To do so, we will define a string ExtendedGenome as Genome+Genome[0:n//2]. That is, we copy the first len(Genome)//2 nucleotides of Genome to the end of the string (figure below).



We will keep track of the total number of occurrences of C that we encounter in each window of ExtendedGenome by using a symbol array. The i-th element of the symbol array is equal to the number of occurrences of the symbol in the window of length len(Genome)//2 starting at position i of ExtendedGenome. For example, see the figure below.




In [2]:
def SymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]
    for i in range(n):
        array[i] = PatternCount(symbol, ExtendedGenome[i:i+(n//2)])
    return array

def PatternCount(Pattern, Text):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count



In [3]:
#Test 0 # Sample Dataset (your code is not run on this dataset)
test = "AAAAGGGG"
symbol =   "A"
out = {0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 5: 1, 6: 2, 7: 3} #Basically, starting at which index do you get the largest count of symbols
assert(SymbolArray(test,symbol)==out)
#print (SymbolArray(test,symbol))
#Test 1 # Full dataset

test="AGCGTGCCGAAATATGCCGCCAGACCTGCTGCGGTGGCCTCGCCGACTTCACGGATGCCAAGTGCATAGAGGAAGCGAGCAAAGGTGGTTTCTTTCGCTTTATCCAGCGCGTTAACCACGTTCTGTGCCGACTTT"
symbol = "CC"

out =   {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7, 7: 6, 8: 6, 9: 6, 10: 6, 11: 6, 12: 6, 13: 6, 14: 6, 15: 6, 16: 6, 17: 5, 18: 5, 19: 5, 20: 4, 21: 4, 22: 4, 23: 4, 24: 4, 25: 3, 26: 3, 27: 3, 28: 3, 29: 3, 30: 3, 31: 3, 32: 3, 33: 3, 34: 3, 35: 3, 36: 3, 37: 3, 38: 3, 39: 3, 40: 3, 41: 3, 42: 3, 43: 2, 44: 2, 45: 2, 46: 2, 47: 2, 48: 2, 49: 2, 50: 3, 51: 3, 52: 3, 53: 3, 54: 3, 55: 3, 56: 3, 57: 3, 58: 2, 59: 2, 60: 2, 61: 2, 62: 3, 63: 3, 64: 3, 65: 3, 66: 3, 67: 3, 68: 3, 69: 3, 70: 3, 71: 3, 72: 3, 73: 3, 74: 3, 75: 3, 76: 4, 77: 4, 78: 4, 79: 4, 80: 4, 81: 4, 82: 4, 83: 4, 84: 4, 85: 4, 86: 5, 87: 5, 88: 5, 89: 6, 90: 6, 91: 6, 92: 6, 93: 6, 94: 7, 95: 7, 96: 7, 97: 7, 98: 7, 99: 7, 100: 7, 101: 7, 102: 7, 103: 7, 104: 6, 105: 6, 106: 6, 107: 7, 108: 7, 109: 7, 110: 7, 111: 7, 112: 8, 113: 8, 114: 8, 115: 8, 116: 7, 117: 7, 118: 7, 119: 7, 120: 7, 121: 7, 122: 7, 123: 7, 124: 7, 125: 7, 126: 7, 127: 8, 128: 7, 129: 7, 130: 7, 131: 7, 132: 7, 133: 7, 134: 7}
assert(SymbolArray(test,symbol)==out)

In [4]:
with open('E_coli.txt', 'r') as file:
    e_coli = file.read()

#SymbolArray(e_coli,"C") 
# Can take days to finish! But all we really care in teh sliding windo are the last two entries in our sliding window

In [1]:
def FasterSymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]

    # look at the first half of Genome to compute first array value
    array[0] = PatternCount(symbol, Genome[0:n//2])

    for i in range(1, n):
        # start by setting the current array value equal to the previous array value
        array[i] = array[i-1]

        # the current array value can differ from the previous array value by at most 1
        if ExtendedGenome[i-1] == symbol:
            array[i] = array[i]-1
        if ExtendedGenome[i+(n//2)-1] == symbol:
            array[i] = array[i]+1
    return array

In [5]:
FasterSymbolArray(e_coli,"C")

{0: 579589,
 1: 579589,
 2: 579590,
 3: 579589,
 4: 579589,
 5: 579590,
 6: 579591,
 7: 579591,
 8: 579590,
 9: 579591,
 10: 579591,
 11: 579592,
 12: 579591,
 13: 579592,
 14: 579592,
 15: 579592,
 16: 579591,
 17: 579591,
 18: 579591,
 19: 579590,
 20: 579590,
 21: 579590,
 22: 579590,
 23: 579591,
 24: 579591,
 25: 579591,
 26: 579590,
 27: 579591,
 28: 579591,
 29: 579591,
 30: 579592,
 31: 579593,
 32: 579594,
 33: 579594,
 34: 579594,
 35: 579594,
 36: 579593,
 37: 579593,
 38: 579593,
 39: 579593,
 40: 579593,
 41: 579593,
 42: 579593,
 43: 579594,
 44: 579594,
 45: 579594,
 46: 579594,
 47: 579595,
 48: 579595,
 49: 579595,
 50: 579595,
 51: 579595,
 52: 579595,
 53: 579595,
 54: 579595,
 55: 579596,
 56: 579596,
 57: 579596,
 58: 579597,
 59: 579597,
 60: 579596,
 61: 579596,
 62: 579596,
 63: 579596,
 64: 579596,
 65: 579596,
 66: 579597,
 67: 579596,
 68: 579597,
 69: 579597,
 70: 579596,
 71: 579597,
 72: 579598,
 73: 579597,
 74: 579597,
 75: 579597,
 76: 579597,
 77: 5795

The figure below visualizes the symbol array for E. coli and symbol equal to "C". Notice the clear pattern in the data! The maximum value of the array occurs around position 1600000, and the minimum value of the array occurs around position 4000000. We can therefore infer that the reverse half-strand begins around position 1600000, and that the forward half-strand begins around position 4000000. Because we know that ori occurs where the reverse half-strand transitions to the forward half-strand, we have discovered that ori is located in the neighborhood of position 4000000 of the E. coli genome, without ever needing to put on a lab coat!