# Week 2

In [25]:
def BetterFrequentWords(Text, k):
    '''
    Function to find the most common k-mers of certain length in text
    '''
    
    frequentPatterns = list()
    freqMap = FrequencyTable(Text, k)

    max_freq = MaxFreq(freqMap)

    for pattern in freqMap:
        if freqMap[pattern] == max_freq:
            frequentPatterns.append(pattern)

    return frequentPatterns


def FrequencyTable(Text, k):
    '''
    Function to create dictionary of all k-mers and their counts in chosen Text
    '''
    freqMap = dict()

    for i in range(len(Text) - k + 1):
        Pattern = Text[i:i+k]
        if Pattern not in freqMap:
            freqMap[Pattern] = 1
        else:
            freqMap[Pattern] += 1
    return freqMap


def MaxFreq(Dictionary: dict):
    '''
    Function to return max value from dictionary
    '''

    maxValue = max(
        Dictionary, key=Dictionary.get
    )
    return Dictionary[maxValue]

    
def ReverseComplement(String):
    '''
    Function to return reverse complement of given DNA strand at 5->3 direction
    '''

    nucleotidesDict = {
        "A": "T", "T": "A", "G": "C", "C": "G"
    }

    reverseComplement = str()
    for i in String:
        reverseComplement += nucleotidesDict[i]
    
    return reverseComplement[::-1]

    
def SortPatternsDict(Dictionary):
    '''
    Function return dictionary sorted by value.
    '''

    sortedDict = {
        key: value for key, value in sorted(Dictionary.items(), key=lambda item: item[1], reverse=True)
    }
    return sortedDict

In our previous discussion, we saw that the skew is decreasing along the reverse half-strand and increasing along the forward half-strand. Thus, the skew should achieve a minimum at the position where the reverse half-strand ends and the forward half-strand begins, which is exactly the location of ori!

### Skew Plots

In [3]:
def Skew(String, print_skew=False):
    skew = 0
    skew_list = [0]

    for i in String:
        if i == 'C':
            skew -= 1
        elif i == 'G':
            skew += 1
        skew_list.append(skew)
    if print_skew:
        print(*skew_list)

    return skew_list


def SkewMinIndices(String, print_indices=False):
    skewList = Skew(String)
    min_skew = min(skewList)

    indices = [
        index for index, element in enumerate(skewList) if element==min_skew
    ]
    
    if print_indices:
        print(*indices)
    else:
        return indices


In [9]:
SkewMinIndices(
    String='TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT',
    print_indices=True
)

11 24


Solving the Minimum Skew Problem now provides us with an approximate location of ori at position of E. coli. 

In [10]:
with open("/Users/olegsuchalko/BioinformaticsAlgorithms/Part_1/E_coli.txt") as genome_file:
    ecoli_genome = genome_file.readline()
    
SkewMinIndices(String=ecoli_genome, print_indices=True)

3923620 3923621 3923622 3923623


In [11]:
BetterFrequentWords(
    Text=ecoli_genome[3923620-500:3923620+500],
    k=9
)

['AAGGATCCG', 'AGGATCCGG']

In [12]:
x = SortPatternsDict(
    FrequencyTable(
        Text=ecoli_genome[3923620-500:3923620+500],
        k=9
    )
)
for i in x:
    if x[i] == MaxFreq(x):
        print(i, '\t', x[i]) 

AAGGATCCG 	 2
AGGATCCGG 	 2


DnaA can bind not only to "perfect" DnaA boxes but to their slight variations as well!

We will calculate Hamming distance to compare, how similar are our patterns:

In [13]:
def HammingDistance(p, q):
    hammingDist = 0

    for i in range(len(p)):
        if p[i] != q[i]:
            hammingDist += 1

    return hammingDist


HammingDistance('GGGCCGTTGGT', 'GGACCGTTGAC')

3

In [14]:
HammingDistance(
    'CAGAGATGCACTTAGGAAGCCCTTAAGCTAGTTGCCTGACGAGGAGCGAATGGTAGACCGGTGGAACCACTGCGGGTACGATCCTCTGGGCAGGAACCCTCAAACTGCCCTTCCTGGGGGCGATCACCATAGGTCAATTTCCTGCCATCCTCAACGTTATTCGGTGATGCGTTATCTGGGAGGTTTCTGTGAAGAGAGCGGCGTCCTGTCGGCGATCGCGCCCGGTTATTGACTTTGGAAAATGTTCCAAGGATCACTCGTACGAATCAATGCATGGCTCTTGACCCCTAACCTGCAAGTTTACTAGCACGTAGTGAATTTGAACGAATGGTATGGGAGGTTACGTCTTGAATGGCCCGTATCAAGCCATCGGACCTTACAATAGTTCGATATAGTTTGTTAGTAATCGACAGCGGTAATGTTATGTAGCCACCGAATATCCGGCCTAATCTACCTCGTACCAAGGGTTCTGTCAGTGAGCCTTCATGAATCTGACGGGCTAGCAGTAGGACGGCCAGCGTAACTTTACAAGCTTCTAGAGCGTACGGAAATTGGTCGCATGGGTCTGCTCTGTCTAATCTGGATACAGGGGCCATCCGGCGGACTTTCGTGCTTGTAGTAGGTCACCTATTCTCCACAAAAGACTAACTTACTACAATTAGCTCAGAATCCTTGGACCCGAAGACACTAGCAGCTGGCATCAATCAGACTGCCTCGGACTCTAGTATCGAGCGTGGTCAGGCTCCAGTCTTTGGATCGTTTGCTAGCCCCTCCCCAAATATTTAGGATAGACGCGGCGCGGCCGGCAAGGCGTCTGACCGGGAGTATTTGTGACTAGTGGGTAGTTTAAATCTTTGAGGGCTTGTTCCTCGTAGATTGCACTACGCGTCGCAGAGTTCACCCATGACTTCCCAATCGTCTAGTCAAAACTGCTGGAAGTTCAGACTATCAGCTTATGCTCTGTTGTTCCTCGTTTACTCTTACATTTGATCAAGCAAGCGGTCAATACTTTGGACCGGAGTTAGCACAGGGCAATCGATGCGGACGCCTAACTAGCGTTTCTAATGAGCAAAATGTGAACAAATTGTCGCCGCACTACCGGTTGGATAAAGCCCTCTGCACCCTATAGCCGGAAAACGCGCAGAGCTCCCCACCCTGCCGACATCCCATAATAATAC',
    'GTTCGACTATGGTTAATACTTTGAGCGCGGGGGATGCTCCGCTATCTTCGTACGTACAGTACATTTAGTGCAGAGAGACCTTCGTCCGATGTGAATTGGGCATCTGTTGTGATGTAGACTAAACTATGATAATTCGCGTATGCCTAGCCTGCATATAATGGGCAGCTGCAAATCCGACGCCTGTGCTCCTTACTTAAGTAAACGGAACTAGCCTTCAGAAAAGCACGATGCGATGATTAGAGTAAAGAACGGGACAATTCGCGTTCTTCTAAAAAACTTTAACACTGCAGTATGCCTTTGGCTGGTTAGTACTCACTTTACGGTTCGGCAGACCCGTACAATTCAAGGTTTCTACTGGGATCCTACGCTTTGCGTCTACATGCCCCCTCAACTCAAATGGGATCCGCGAGATTATATCGCGCACCTGGGTGCACGCTGGTGTTGTTAAAAAATTCGCGGGCCCCCCTAACAAGCCAGCTGTTACCCGACTGCCTATATTCACCGCCCTGAATACGGCCCAGGAGCAGTTCGAGGCCGCGAAACAGAACCAGACAATTGAGCACTGCGGCGCTGAAATGGGGCCGTCCTTTTGCCTTTCTCAGAGAGTGCTGATCCTACTGTGCCGTCCCGTTGTGCGCAATATCGTGCGTATCTTCTCAAGGAGTGGCTACAGGATGCCGACGACCTCACACGCTCGAGCGAGAACGGGTGGTTTGAAAAAGACGCAGCCACTGCGGCTTGAAAGTACGGGTTAGTCCCAAAGTTTCAGATGAACCGTGCGTTCAACACAGAGTTGATTGACTGACGCTGAGGCTAGTGGAATAGTAGAATAGTTATTCTCCTGGACTTATCCATCAACTTGGACTACTGGTCGGAGCCCTTCTAGTGACTTACCTTCGAAGTCCACCTTGTTCGTTAATACAACCTACCCGAAGAGCAGCCCCTCTGATTCGCCACGGACCAGCTTGAGAATCCTCTGATAAAACTACCAATTCTCACGGGCTACGTCAGTTGTAGTCTAGCCTATTCGCCACCAGGCGCCTCGGATTGCATGAAGAGACAGGCTGCGATGCTCCGGAGCTCCCTGTGTATGCGGCCAGTAAACCTAGAAACAAGCAAAATGTTCACATCGCGTTCACCCCGGAACGGTTCGTCGCTAAAGAAGGGTAAAACGACTC'
)

896

Then we try to find Matches of Pattern or similar Patterns

In [15]:
def ApproximatePatternMatching(Text, String, d, print_values=False):

    patternLength = len(String)
    patternMatches = list()

    for i in range(len(Text) - patternLength + 1):
        Window = Text[i:i+patternLength]

        if Window == String:
            patternMatches.append(i)
        elif HammingDistance(Window, String) <= d:
            patternMatches.append(i)
    
    if print_values:
        print(*patternMatches)
    else:
        return patternMatches


def CountSimilarPatterns(Text, String, d):
    return len(ApproximatePatternMatching(Text, String, d))


In [16]:
ApproximatePatternMatching(
    String='ATTCTGGA',
    Text='CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT',
    d=3,
    print_values=True
)

CountSimilarPatterns(
    String='ATTCTGGA',
    Text='CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT',
    d=3
)

6 7 26 27


4

Now we want to count Pattern and similar Patterns occurance in Text

In [17]:
CountSimilarPatterns(
    Text='GTCACCATTGTCGCCCCTGTTTCATTGTTTGAAATTATAATAGGCGGGCGGATTAGACTCCAGGACGTAAATCTGGAAACCTCGATGTCCTGCGACTTTGCCTTAGAGGCCGAGTGGTTCTCAACGATCCTTACGCGGAAACGATGCCCTCGTAAAAGAACACTTTAATCAAGGCCGCGTCACGTTTTTGGCGGTGTTACAGTGTTGGCCAAGGCCGCAGATCCAATTGTTGCGAAACATCTCCCTGCTTTCGCAGGATCGCGACGAACGACTTTGGCCTTAGTTGGTGTCTCTAAGTGTGACTAACGGGAGGATGTCGTTACTTACACCCGCAACGTTCACTCTTTGTA',
    String='CTCCC',
    d=2
)

36

#### Frequent Words with Mismatches Problem: 
Find the most frequent k-mers with mismatches in a string.

The collection of all such k-mers is called the d-neighborhood of Pattern, denoted Neighbors(Pattern, d)

In [18]:
def Suffix(Pattern):
    return Pattern[1:]
    

def Neighbors(Pattern, d):
    if d == 0:
        return Pattern
    if len(Pattern) == 1:
        return {'A', 'T', 'G', 'C'}
    
    Neighborhood = set()
    NeighborSuffix = Neighbors(Suffix(Pattern), d)

    for Neighbor in NeighborSuffix:
        if HammingDistance(Suffix(Pattern), Neighbor) < d:
            for i in 'ATGC':
                Neighborhood.add(i+Neighbor)
        else:
            Neighborhood.add(Pattern[0] + Neighbor)
            
    return list(Neighborhood)


Use Recursion to create set of Neighbors with distance d

In [19]:
x = Neighbors('CGGCCGGC', 2)
print(*x)

CGGCCTGT CAGCCGCC CGCCCCGC CGGCGTGC CAGCCGGT CGGCCCCC CGGCAGGA TGGACGGC AGCCCGGC CGGTTGGC CGCCCGTC CGGCTGGG CAGTCGGC CGCCCGAC CGACCGGA TGGCAGGC CGACTGGC CGCCTGGC CCGCCGGA GCGCCGGC CGGAGGGC GGGACGGC GGGCCGGT CGGCCGAC AGGCCAGC CCGTCGGC CCCCCGGC CGTCCCGC CGGCTCGC CGGCGGAC AGGCGGGC CTGGCGGC CGGCCGCC CGCTCGGC CGGCCTCC GGCCCGGC CGGCCGTG CGTCCGGG AGGCCGGT CGGCAGGG CGGCTGTC CGGACCGC CCGCCGGC CGGCCAAC CGGCAGGC AGGCTGGC CGGCTGAC CGGGCGGA CTGCCGAC CGACAGGC CAGGCGGC CGACCGGT CGGCTGGA CGGGCCGC CCGCCGCC CGGTCGGT CGGGCGAC CGGCCCGC CGGCCGGC CGGGCGCC AGTCCGGC CGGACGGG CGGGCAGC CGGCGGTC CGGTCGGG CGGACGTC ACGCCGGC AGGCCGGC CGCGCGGC CGAACGGC CGTCCGGC CGGCCCGT TGGCCGAC CTGCCGTC CGGCCGAT CGGCGAGC CAGCCGGA CCGCCGGG TTGCCGGC CGGCCAGT CGGCAAGC GGGCCGCC CGAGCGGC CGGCCTGC CGGACGGT TGGCCGGA CGCCGGGC CGGCTGCC CGGCCATC CGTCCAGC CGGCGGGA TGGCCCGC CGGTCGCC CGGCCGCT TAGCCGGC GGGTCGGC CTGCTGGC CTTCCGGC CGGGCGTC CCACCGGC TGGCCGGC CGGCAGAC CGGGAGGC CGGTCGGA CGGCACGC CCGCCGAC CGGCCGGG GGGCCAGC CGACCGTC CGTCCGCC GGGCCGGC C

Now we are ready to combine generation of **Neighbors** of variance **d** with **Frequent Words Matching** problem.

In [20]:
def FrequentWordsWithMismatches(
    Text, k, d, print_output=False
):
    Patterns = list()
    freqMap = dict()
    n = len(Text)

    for i in range(0, n-k+1):
        Pattern = Text[i:i+k] # перебираем паттерны по строке
        neighborhood = Neighbors(Pattern, d) # создаем множество соседей - паттернов, которые отличаются от нашего паттерна на d

        for j in range(len(neighborhood) - 1): # проверяем всех соседей по словарю, добавляем +1 счет
            neighbor = neighborhood[j]

            if neighbor not in freqMap:
                freqMap[neighbor] = 1
            else:
                freqMap[neighbor] += 1

    maxPattern = MaxFreq(freqMap) # находим максимумальное значение по строке в словаре и выводим все паттерны, у которых столько совпадений со строкой
    for key in freqMap:
        if freqMap[key] == maxPattern:
            Patterns.append(key)

    if print_output:
        print(*Patterns)
    else:
        return Patterns

In [21]:
FrequentWordsWithMismatches(
    Text='CAACACAAGGCAGGCAGGCAGCGCAATCTTAGGCAGGCTCTTCAACACAAGCGAGCGTCTTAGGCTCTTCACAAGCGCAATCTTCACATCTTAGGCCACAAGCGAGGCAGCGTCTTCAAAGGCCACACAACAACAAAGGCAGCGCAATCTTCAACAACACACAATCTTTCTTTCTTTCTTCAAAGGCAGGCAGGCAGGCCACAAGCGAGGCCACAAGGCAGCGAGGCAGCGAGCGCACAAGCGTCTTAGCGAGGCAGCGCACAAGCGAGGCAGCGAGCGTCTTTCTTAGGCCAATCTTTCTTAGGCAGCGCACAAGGCAGGCTCTTAGGC',
    k=7, d=2
)

['AGCAAGC']

And finally we want to add ReverseComplement into previous function:

In [22]:
def FrequentWordsWithMismatchesAndReverseComplement(
    Text, k, d, print_output=False
):
    Patterns = list()
    freqMap = dict()
    n = len(Text)

    for i in range(0, n-k+1):
        Pattern = Text[i:i+k] # перебираем паттерны по строке
        neighborhood = Neighbors(Pattern, d) + Neighbors(ReverseComplement(Pattern), d) 
        # создаем множество соседей - паттернов, которые отличаются от нашего паттерна на d, а также множество соседей для RC

        for j in range(len(neighborhood) - 1): # проверяем всех соседей по словарю, добавляем +1 счет
            neighbor = neighborhood[j]

            if neighbor not in freqMap:
                freqMap[neighbor] = 1
            else:
                freqMap[neighbor] += 1

    maxPattern = MaxFreq(freqMap) # находим максимальное значение по строке в словаре и выводим все паттерны, у которых столько совпадений со строкой
    for key in freqMap:
        if freqMap[key] == maxPattern:
            Patterns.append(key+'_x'+str(freqMap[key]))
    if print_output:
        print(*Patterns)
    else:
        return Patterns

In [23]:
FrequentWordsWithMismatches(
    Text='CCTCCGCCTTGTTGTCCTAAACCTAAACCGTGTCCTTGTAAACCGCCTTGTCCTCCTAAACCGAAACCTTGTCCTCCTTGTAAATGTCCGCCGCCTCCTCCGTGTCCGCCTCCGCCGAAATGTTGTCCGCCTAAACCGAAACCTCCTAAATGTCCGCCTTGTCCGTGTCCTAAACCTTGTAAAAAATGTCCTCCTTGTAAATGTAAACCTTGTCCTCCTCCTTGTTGTTGTCCGCCT',
    k=7, d=2, print_output=True
)

CTCCTCC TCCTCCT


In [26]:
FrequentWordsWithMismatchesAndReverseComplement(
    Text='CCTCCGCCTTGTTGTCCTAAACCTAAACCGTGTCCTTGTAAACCGCCTTGTCCTCCTAAACCGAAACCTTGTCCTCCTTGTAAATGTCCGCCGCCTCCTCCGTGTCCGCCTCCGCCGAAATGTTGTCCGCCTAAACCGAAACCTCCTAAATGTCCGCCTTGTCCGTGTCCTAAACCTTGTAAAAAATGTCCTCCTTGTAAATGTAAACCTTGTCCTCCTCCTTGTTGTTGTCCGCCT',
    k=7, d=2, print_output=True
)

CTCCTCC_x26 GGAGGAG_x26 TCCTCCT_x26 AGGAGGA_x26


E.coli genome is still too big. But we can look at possible ori region, found via Skew

In [27]:
ecoli_indices = SkewMinIndices(String=ecoli_genome)
ecoli_indices

[3923620, 3923621, 3923622, 3923623]

In [28]:
print(*sorted(FrequentWordsWithMismatchesAndReverseComplement(
    Text=ecoli_genome[
        3923620:3924120
    ], k=9, d=1
)))

AAGAGATCT_x4 AAGGATCCT_x4 AATGATCCG_x4 AGAACAACA_x4 AGATCTCTT_x4 AGCTGGGAT_x4 AGGATCCTT_x4 ATCCCAGCT_x4 CAGAAGATC_x4 CCAGGATCC_x4 CGGATCATT_x4 CTGGGATCA_x4 CTGTTGATC_x4 GATCAACAG_x4 GATCCCAGC_x4 GATCTTCTG_x4 GCTGGGATC_x4 GGATCCTGG_x4 GGTTATCCA_x4 GTGGATAAC_x4 GTTATCCAC_x4 GTTGATCCT_x4 TCTGGATAA_x4 TGATCAACA_x4 TGATCCCAG_x4 TGGATAACC_x4 TGTGAATAA_x4 TGTGGATAA_x4 TGTTGATCA_x4 TGTTGTTCT_x4 TTATCCACA_x4 TTATCCAGA_x4 TTATTCACA_x4


#### Now we can compute the most frequent k-mers with mismatches and RC!

Thus, the moral of this chapter is that even though computational predictions can be powerful, bioinformaticians should collaborate with biologists to verify their computational predictions

## Final Challenge by working with the Salmonella enterica genome

### Find a DnaA box in Salmonella enterica.

In [29]:
with open('/Users/olegsuchalko/BioinformaticsAlgorithms/Part_1/Salmonella_enterica.txt') as genome_file:
    salmonella_genome = genome_file.readlines()

In [30]:
salmonella_genome_strip = [ line.strip() for line in salmonella_genome[1:]]
salmonella_genome_seq = ''.join(salmonella_genome_strip)


len(salmonella_genome_seq)

4809037

In [31]:
SkewMinIndices(salmonella_genome_seq)

[3764856, 3764858]

In origin region, defined by Skew plot and SkewMinIndices func we can find most common k-mers with length 9 and mm rate 1

In [32]:
FrequentWordsWithMismatchesAndReverseComplement(
    Text=salmonella_genome_seq[3764786-500:3764786+500],
    k=9, d=1, print_output=True
)

TTATCCACA_x6 TGTGGATAA_x6


And those are the same as fo E.coli !