# Where in the Genome Does DNA Replication Begin?

I omit these section: 
*1.1 The Simplest Way to Replicate DNA
*1.2 Asymmetry of Replication

Because it basically a revisit of how DNA replication works. 



## 1.3 Peculiar Statistics of the Forward and Reverse Half-Strands
### Minimum Skew Problem
It turned out, there's interesting phenomenon: GC Skew - It's a measure of the difference in the frequency of guanine (G) and cytosine (C) between the leading and lagging strands during DNA replication. In many organisms, the leading strand (which is synthesized continuously) tends to have a higher frequency of G and T, while the lagging strand (which is synthesized discontinuously) has a higher frequency of A and C.

In [None]:
def minimum_skew_problem(dna):
    skew = [0]
    for i in range(len(dna)):
        if dna[i] == 'C':
            skew.append(skew[i] - 1)
        elif dna[i] == 'G':
            skew.append(skew[i] + 1)
        else:
            skew.append(skew[i])
    min_skew = min(skew)
    return [i for i, x in enumerate(skew) if x == min_skew]

# Example usage
dna = 'TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT'
print(minimum_skew_problem(dna))  # Output: [11, 24]

[11, 24]


## 1.4 Some Hidden Messages are More Elusive than Others
Now, after we found (approximate) the position of the ori using the GC skew, we would like to confirm which k-mers is the ori.
In previous lecture, we used the frequent word searching to find frequency of the k-mers. However, turned out you might not found frequent word at all around the ori. What gives?

Turned out, the dna fragment where the DnaA binds doesn't have to be exactly the same. It allows some nucleotides mismatches!

### Hamming Distance Problem
That being said, we now have to alter our algorithm for searching ori k-mers to allow some mismatches. But first, let us start from calculating how far off the mismatched k-mers from the original. We use hamming distance calculation for this.

We say that position i in k-mers p1 … pk and q1 … qk is a mismatch if pi ≠ qi. For example, CGAAT and CGGAC have two mismatches. The number of mismatches between strings p and q is called the Hamming distance between these strings and is denoted HammingDistance(p, q).

In [1]:
def hamming_distance(p, q):
    return sum([1 for i in range(len(p)) if p[i] != q[i]])

# Example usage
p = 'GGGCCGTTGGT'
q = 'GGACCGTTGAC'
print(hamming_distance(p, q))  # Output: 3

3


### Approximate Pattern (Similar Pattern) Matching Problem
Now that we have the hamming distance, we can calculate the pattern matching with approximation.

We say that a k-mer appears as a substring of Text with at most d mismatches if there is some other k-mer from `Text` having `d` or fewer mismatches with `Pattern`. 

In [2]:
def approximate_pattern_matching(pattern, text, d):
    positions = []
    for i in range(len(text) - len(pattern) + 1):
        if hamming_distance(pattern, text[i:i + len(pattern)]) <= d:
            positions.append(i)
    return positions

# Example usage
pattern = 'ATTCTGGA'
text = 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT'
d = 3
print(approximate_pattern_matching(pattern, text, d))  # Output: [6, 7, 26, 27]

[6, 7, 26, 27]


### Counting the Approximate Pattern (Similar Pattern)
We also would like to count how many approximate pattern (similar pattern) that match with the original k-mer.

You can do this with the code below, or just get the length from the returned result of `approximate_pattern_matching` function above.

In [3]:
def approximate_pattern_count(pattern, text, d):
    count = 0
    for i in range(len(text) - len(pattern) + 1):
        if hamming_distance(pattern, text[i:i + len(pattern)]) <= d:
            count += 1
    return count

# Example usage
pattern = 'GAGG'
text = 'TTTAGAGCCTTCAGAGG'
d = 2
print(approximate_pattern_count(pattern, text, d))  # Output: 4

4


### Frequent Words with Mismatches
Given the function of `approximate_pattern_matching` and `approximate_pattern_count` above, we should be able to reformulate our frequent word searching from exact matches to similar matches.

The code below calculates the most frequent words (k-mer) and output the k-mer as the result. We utilize neighbors function to find the possible similar k-mers (in this course is called neighbor) that have at most `d` differences.

In [11]:
def frequent_words_with_mismatches(text, k, d):
    patterns = set()
    freq_map = {}
    n = len(text)
    max_count = 0
    for i in range(n - k + 1):
        pattern = text[i:i + k]
        neighborhood = neighbors(pattern, d)
        for neighbor in neighborhood:
            if neighbor in freq_map:
                freq_map[neighbor] += 1
            else:
                freq_map[neighbor] = 1
            if freq_map[neighbor] > max_count:
                max_count = freq_map[neighbor]
    for key in freq_map:
        if freq_map[key] == max_count:
            patterns.add(key)
    return patterns

def neighbors(pattern, d):
    if d == 0:
        return {pattern}
    if len(pattern) == 1:
        return {'A', 'C', 'G', 'T'}
    neighborhood = set()
    suffix_neighbors = neighbors(pattern[1:], d)
    for text in suffix_neighbors:
        if hamming_distance(pattern[1:], text) < d:
            for x in ['A', 'C', 'G', 'T']:
                neighborhood.add(x + text)
        else:
            neighborhood.add(pattern[0] + text)
    return neighborhood

# Example usage
text = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
k = 4
d = 1
print(frequent_words_with_mismatches(text, k, d))  # Output: {'ATGT', 'GATG', 'ATGC'}

{'GATG', 'ATGC', 'ATGT'}


### Frequent Words with Mismatches and Reverse Complements Problem
Now that we have all the necessary approach and functions, we can build another function that not only find the most frequent k-mers with mismatches, but also its reverse complements.

In [7]:
def reverse_complement(pattern):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return ''.join(complement[nucleotide] for nucleotide in reversed(pattern))

def frequent_words_with_mismatches_and_reverse_complements(dna, k, d):
    kmer_counts = {}
    for i in range(len(dna) - k + 1):
        kmer = dna[i:i+k]
        kmer_neighbors = neighbors(kmer, d)
        for neighbor in kmer_neighbors:
            reverse_neighbor = reverse_complement(neighbor)
            if neighbor not in kmer_counts:
                kmer_counts[neighbor] = 0
            if reverse_neighbor not in kmer_counts:
                kmer_counts[reverse_neighbor] = 0
            kmer_counts[neighbor] += 1
            kmer_counts[reverse_neighbor] += 1
    max_count = max(kmer_counts.values())
    return [kmer for kmer, count in kmer_counts.items() if count == max_count]


# Example usage
text = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
k = 4
d = 1
print(frequent_words_with_mismatches_and_reverse_complements(text, k, d))  # Output: {'ACAT', 'ATGT'}

['ATGT', 'ACAT']
