However, before concluding that we have found the DnaA box of Vibrio cholerae, the careful bioinformatician should check if there are other short regions in the Vibrio cholerae genome with multiple occurrences of "ATGATCAAG" (or "CTTGATCAT"). After all, maybe these strings occur as repeats throughout the entire Vibrio cholerae genome, rather than just in the ori region. This discussion implies the following computational problem.

Pattern Matching Problem:  Find all occurrences of a pattern in a string. 
     Input: Strings Pattern and Genome.
     Output: All starting positions in Genome where Pattern appears as a substring.

We feel confident that you are ready to solve the Pattern Matching Problem on your own.  But we will give you a hint: note how similar this problem is to counting the number of occurrences of a pattern within a longer text.  In both problems, we range over the text with a window whose length is equal to the length of the pattern.  The only way in which our solution to the Pattern Matching Problem differs is that rather than counting the number of occurrences of the pattern, we first form an empty list and then append each starting position of the pattern to it when we encounter a match.

In [2]:
Pattern = "ATAT"
Genome = "GATATATGCATATACTT"

In [11]:
def PatternMatching(Pattern, Genome):
    positions = [] # output variable
    k = len(Pattern) #Determines window Size
    for i in range(len(Genome)-k+1):
        _sub = Genome[i:i+k]
        if _sub == Pattern:#If we find a match
            positions.append(i)
    return positions

In [12]:
PatternMatching(Pattern, Genome)

[1, 3, 9]

In [13]:
v_cholerae = "ATCAATGATCAACGTAAGCTTCTAAGCATGATCAAGGTGCTCACACAGTTTATCCACAACCTGAGTGGATGACATCAAGATAGGTCGTTGTATCTCCTTCCTCTCGTACTCTCATGACCACGGAAAGATGATCAAGAGAGGATGATTTCTTGGCCATATCGCAATGAATACTTGTGACTTGTGCTTCCAATTGACATCTTCAGCGCCATATTGCGCTGGCCAAGGTGACGGAGCGGGATTACGAAAGCATGATCATGGCTGTTGTTCTGTTTATCTTGTTTTGACTGAGACTTGTTAGGATAGACGGTTTTTCATCACTGACTAGCCAAAGCCTTACTCTGCCTGACATCGACCGTAAATTGATAATGAATTTACATGCTTCCGCGACGATTTACCTCTTGATCATCGATCCGATTGAAGATCTTCAATTGTTAATTCTCTTGCCTCGACTCATAGCCATGATGAGCTCTTGATCATGTTTCCTTAACCCTCTATTTTTTACGGAAGAATGATCAAGCTGCTGCTCTTGATCATCGTTTC"
Pattern = "CTTGATCAT" 
PatternMatching(Pattern, v_cholerae)

[397, 468, 525]

After solving the Pattern Matching Problem, we discover that "ATGATCAAG" appears 17 times in the following positions of the Vibrio cholerae genome:

116556, 149355, 151913, 152013, 152394, 186189, 194276, 200076, 224527,
307692, 479770, 610980, 653338, 679985, 768828, 878903, 985368

With the exception of the three occurrences of "ATGATCAAG" in ori at starting positions 151913, 152013, and 152394, no other instances of "ATGATCAAG" form “clumps”, i.e., appear close to each other in a small region of the genome. The preceding exercise verifies that the same conclusion is reached when searching for "CTTGATCAT". We now have strong statistical evidence that "ATGATCAAG"/"CTTGATCAT" may represent the hidden message to DnaA to start replication.

STOP and Think: Is it safe to conclude that "ATGATCAAG"/"CTTGATCAT" also represents a DnaA box in other bacterial genomes?

We should not jump to the conclusion that "ATGATCAAG"/"CTTGATCAT" is a hidden message for all bacterial genomes without first checking whether it even appears in known ori regions from other bacteria. After all, maybe the clumping effect of "ATGATCAAG"/"CTTGATCAT" in the ori region of Vibrio cholerae is simply a statistical fluke that has nothing to do with replication. Or maybe different bacteria have different DnaA boxes . . .



In [20]:
"""Let’s check the proposed ori region of Thermotoga petrophila, a bacterium that thrives in extremely hot environments;
its name derives from its discovery in the water beneath oil reservoirs, where temperatures can exceed 80℃ Celsius."""
def PatternCount(Text, Pattern):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count

Text = "aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactgaaactaaaatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaattacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaaacaaacctaccaccaaactctgtattgaccattttaggacaacttcagggtggtaggtttctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattcaagattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtatccaagccgatttcagagaaacctaccacttacctaccacttacctaccacccgggtggtaagttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaacctaccacctgcgtcccctattatttactactactaataatagcagtataattgatctga"
Pattern = "ATGATCAAG"
count_1 = PatternCount(Text, Pattern.lower())
print (count_1)

atgatcaag
0


In [21]:
Pattern = "CTTGATCAT"
count_2 = PatternCount(Text, Pattern.lower())
print (count_2)

0


In [22]:
#Test Output
Text = "GCGCG"
Pattern = "GCG"
PatternCount(Text, Pattern)

2

This region does not contain a single occurrence of "ATGATCAAG" or "CTTGATCAT"! Thus, different bacteria may use different DnaA boxes as “hidden messages” to the DnaA protein.

Application of the Frequent Words Problem to the ori region above reveals that the following six 9-mers appear in this region three or more times:

"AACCTACCA"  "AAACCTACC"  "ACCTACCAC"
"CCTACCACC"  "GGTAGGTTT"  "TGGTAGGTT"

Something peculiar must be happening because it is extremely unlikely that six different 9-mers will occur so frequently within the same short region in a random string. We will cheat a little and consult with Ori-Finder, a software tool for finding replication origins in DNA sequences. This software chooses "CCTACCACC" (along with its reverse complement "GGTGGTAGG") as a working hypothesis for the DnaA box in Thermotoga petrophila. Together, these two complementary 9-mers appear five times in the replication origin:

aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga aactaaaatggtaggtttGGTGGTAGGttttgtgtacattttgtagtatctgatttttaa ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa acaaaCCTACCACCaaactctgtattgaccattttaggacaacttcagGGTGGTAGGttt ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat ccaagccgatttcagagaaacctaccacttacctaccacttaCCTACCACCcgggtggta agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa CCTACCACCtgcgtcccctattatttactactactaataatagcagtataattgatctga

Now imagine that you are trying to find ori in a newly sequenced bacterial genome. Searching for “clumps” of "ATGATCAAG"/"CTTGATCAT" or "CCTACCACC"/"GGTGGTAGG" is unlikely to help, since this new genome may use a completely different hidden message! Before we lose all hope, let’s change our computational focus: instead of finding clumps of a specific k-mer, let’s try to find every k-mer that forms a clump in the genome. Hopefully, the locations of these clumps will shed light on the location of ori.

Our plan is to slide a window of fixed length L along the genome, looking for a region where a k-mer appears several times in short succession. The parameter value L = 500 reflects the typical length of ori in bacterial genomes.

We think of a k-mer as a “clump” if it appears many times within a short interval of the genome. More formally, given integers L and t, a k-mer Pattern forms an (L, t)-clump inside a (longer) string Genome if there is an interval of Genome of length L in which this k-mer appears at least t times. (This definition assumes that the k-mer completely fits within the interval.) For example, "TGCA" forms a (25, 3)-clump in the following Genome:

From our previous examples of ori regions, "ATGATCAAG" forms a (500, 3)-clump in the Vibrio cholerae genome, and "CCTACCACC" forms a (500, 3)-clump in the Thermotoga petrophila genome. We are now ready to formulate the following problem.

Clump Finding Problem:  Find patterns forming clumps in a string. 
     Input: A string Genome, and integers k, L, and t. 
     Output: All distinct k-mers forming (L, t)-clumps in Genome.

Don’t worry about writing an algorithm to solve the Clump Finding Problem; we have done it for you. When we used this algorithm to look for clumps in the Escherichia coli (E. coli) genome, the workhorse of bacterial genomics, we found hundreds of different 9-mers forming (500, 3)-clumps in this genome. It is absolutely unclear which of these 9-mers might represent a DnaA box in the bacterium’s ori region.

STOP and Think: Should we give up? If not, what would you do now?

In [23]:
#Quiz
PatternCount("CGCG", "CGCGATACGTTACATACATGATAGACCGCGCGCGATCATATCGCGATTATC")

0