1. Implementirati modifikaciju algoritma **Faster Frequent Words** koja koristi recnik (<code>dict</code>) za cuvanje brojaca samo onih <code>k</code>-grama koji se pojavljuju u niski <code>text</code>.

In [5]:
def computing_frequencies_dict(text, k):
    frequency_dict = {}
    
    n = len(text)
    for i in range(0, n-k+1):
        pattern = text[i : i+k]
        
        if pattern not in frequency_dict:
            frequency_dict[pattern] = 1
        else:
            frequency_dict[pattern] += 1
            
    return frequency_dict

In [12]:
def faster_frequent_words_dict(text, k):
    frequency_dict = computing_frequencies_dict(text, k)
    # print(frequency_dict)
    max_count = max(frequency_dict.values())
    # print(max_count)
    frequent_patterns = set([])
    
    for pattern, count in frequency_dict.items():
        if count == max_count:
            frequent_patterns.add(pattern)
            
    return list(frequent_patterns)

In [13]:
oriC = 'ATCAATGATCAACGTAAGCTTCTAAGCATGATCAAGGTGCTCACACAGTTTATCCACAACCTGAGTGGATGACATCAAGATAGGTCGTTGTATCTCCTTCCTCTCGTACTCTCATGACCACGGAAAGATGATCAAGAGAGGATGATTTCTTGGCCATATCGCAATGAATACTTGTGACTTGTGCTTCCAATTGACATCTTCAGCGCCATATTGCGCTGGCCAAGGTGACGGAGCGGGATTACGAAAGCATGATCATGGCTGTTGTTCTGTTTATCTTGTTTTGACTGAGACTTGTTAGGATAGACGGTTTTTCATCACTGACTAGCCAAAGCCTTACTCTGCCTGACATCGACCGTAAATTGATAATGAATTTACATGCTTCCGCGACGATTTACCTCTTGATCATCGATCCGATTGAAGATCTTCAATTGTTAATTCTCTTGCCTCGACTCATAGCCATGATGAGCTCTTGATCA'
k = 9

print(faster_frequent_words_dict(oriC, k))

['CTCTTGATC', 'GCATGATCA', 'TCTTGATCA', 'ATGATCAAG', 'AGCATGATC', 'AAGCATGAT']


2. Implementirati rekurzivno funkciju <code>neighbors</code> koja pronalazi sve niske koje su na najvise <code>d</code> Hamigovom rastojanju od zadate niske <code>pattern</code>. Algoritam se bazira na pronalaženju suseda dužine <code>k-1</code> (<code>k</code> je duzina niske <code>pattern</code>) a zatim nadovezivanjem prvog, izmenjenog, karaktera uzorka, ukoliko sused dužine k ima manje od <code>d</code> izmena ili, u suprotnom, nadovezivanjem prvog karaktera polaznog uzorka. Pseudokod algoritma se nalazi na 50. strani u knjizi.

In [14]:
def hamming_distance(string1, string2):
    n = len(string1)       
    
    distance = 0
    for i in range(n):
        if string1[i] != string2[i]:
            distance += 1
            
    return distance                  

In [15]:
def neighbors(pattern, d):
    if d==0:
        return {pattern}
    
    if len(pattern) == 1:
        return {'A', 'C', 'G', 'T'}
    
    neighborhood = set([])
    
    sufix_neighbors = neighbors(pattern[1:], d)
    
    for sufix_pattern in sufix_neighbors:
        if hamming_distance(pattern[1:], sufix_pattern) < d:
            for nucleotide in ['A', 'C', 'G', 'T']:
                neighborhood.add(nucleotide + sufix_pattern)
        else:
            neighborhood.add(pattern[0] + sufix_pattern)
            
    return list(neighborhood)

In [16]:
pattern = 'AAA'
d = 1
print(neighbors(pattern, d))

['AAG', 'GAA', 'AGA', 'TAA', 'AAA', 'ACA', 'AAT', 'ATA', 'CAA', 'AAC']


3. Implementirati modifikaciju algoritma **Frequent Words with Missmatches** koja koristi recnik (<code>dict</code>) za cuvanje brojaca samo onih <code>k</code>-grama koji se pojavljuju u niski <code>text</code> i njihovih <code>d</code>-suseda (<code>k</code>-grama koji se nalaze na najvise <code>d</code> Hamingovom rastojanju).

In [7]:
def approximate_pattern_count(text, pattern, d):
    n = len(text)
    k = len(pattern)
    count = 0
    
    for i in range(n-k+1):
        if hamming_distance(pattern, text[i : i+k]) <= d:
            count += 1
            
    return count

In [8]:
def frequent_words_with_missmatches_dict(text, k, d):   
    frequency_dict = {}
    
    n = len(text)
    for i in range(n-k+1):
        pattern = text[i : i + k]                    
        neighborhood = neighbors(pattern, d)         
        
        for neighbor in neighborhood:
            if neighbor not in frequency_dict:      
                frequency_dict[neighbor] = approximate_pattern_count(text, neighbor, d)
            
    max_count = max(frequency_dict.values())
    
    frequent_patterns = set([])
    
    for pattern, count in frequency_dict.items():
        if count == max_count:
            frequent_patterns.add(pattern)
    
    return list(frequent_patterns)

In [9]:
text = 'ATATGCTAGTGTCGATGTGCTA'
k = 4
d = 2

print(frequent_words_with_missmatches_dict(text, k, d))

['GTTG', 'GGTT', 'TTGG']
