# Week 4

## Randomized Motif Search

Some randomized algorithms are Las Vegas algorithms, which deliver solutions that are guaranteed to be exact, despite the fact that they rely on making random decisions. Yet most randomized algorithms, including the motif finding algorithms that we will consider in this chapter, are Monte Carlo algorithms. These algorithms are not guaranteed to return exact solutions, but they do quickly find approximate solutions. Because of their speed, they can be run many times, allowing us to choose the best approximation from thousands of runs.

In [1]:
from collections import Counter
import math


def CalculateProbability(Pattern, k, Profile:dict):
    return math.prod([Profile[Pattern[i]][i] for i in range(k)])



def CreateProfile(DnaList):
    Profile = { base: [ # делаем словарь основание-список вероятностей по позициям для каждой колонки нуклеотидов из списка последовательностей
        Nucleotides.count(base) / len(Nucleotides)
        for Nucleotides in [''.join([Dna[i] for Dna in DnaList]) for i in range(len(DnaList[0]))]
    ] for base in 'ACGT' }
    return Profile 


def CreateProfilePseudocounts(DnaList):
    Profile = { base: [ # делаем словарь основание-список вероятностей по позициям для каждой колонки нуклеотидов из списка последовательностей
        (Nucleotides.count(base)+1) / (len(DnaList[0])+len(DnaList))
        for Nucleotides in [''.join([Dna[i] for Dna in DnaList]) for i in range(len(DnaList[0]))]
    ] for base in 'ACGT' }
    return Profile 


def Score(Motifs):
    motifsList = [
        [Motifs[i][j] for i in range(len(Motifs))] for j in range(len(Motifs[0]))
    ]
    Score = 0
    for column_i in motifsList:
        count = Counter(column_i)
        maxFreq = count.most_common(1)[0][1]
        Score += len(column_i) - maxFreq
    return Score


def ProfileMostProbable(DnaString, k, Profile:dict):

    kMerList = [ DnaString[i:i+k] for i in range(len(DnaString) - k + 1) ]

    mostProbable = kMerList[0]
    probability = CalculateProbability(kMerList[0], k, Profile) # назначаем изначальные данные для первого k-мера

    for kMer in kMerList[1:]:
        probkMer = CalculateProbability(kMer, k, Profile)
        if probkMer > probability:
            probability = probkMer
            mostProbable = kMer

    return mostProbable

In general, we can begin from a collection of randomly chosen k-mers Motifs in Dna, construct Profile(Motifs), and use this profile to generate a new collection of k-mers:

In [7]:
def RandomNumber(N):
    return np.random.randint(0, N) # выводим рандомный номер в диапазоне 0:N

In [5]:
def RandomizedMotfifSearch(DnaList, k, t, print_output=False):
    Motifs = [  # делаем список случайных мотивов длинной k
        Dna[i:i+k] for Dna in DnaList for i in [RandomNumber(len(DnaList[0])-k+1)]
    ]
    bestMotifs = Motifs[:] # копируем в переменную, считаем Score
    bestScore = Score(bestMotifs)

    while True: # бесконечно делаем профили и считаем Score
        Profile = CreateProfilePseudocounts(Motifs)
        Motifs = [
            ProfileMostProbable(DnaString=Dna, k=k, Profile=Profile) for Dna in DnaList
        ]
        currentScore = Score(Motifs)
        
        if currentScore < bestScore: # в норме каждый новый поиск должен вести к снижению Score
            bestMotifs = Motifs[:]
            bestScore = currentScore
        else:   # если снижение прекратилось, выводим набор Мотивов с самым низким значением Score
            if print_output:
                print(*bestMotifs)
                print("Score is:", bestScore)
            else:
                return [bestMotifs, bestScore]


def MultipleRandomizedMotifSearch(DnaList, k, t, i, print_output=False):
    lastMotifs, lastScore = RandomizedMotfifSearch( # делаем изначалальный набор мотивов
        DnaList, k, t
    ) 

    for index in range(i-1):    # нужное количетво раз итерируемся и находим новый список bestMotifs и его Score
        bestMotifs, bestScore = RandomizedMotfifSearch(
            DnaList, k, t
        )
        if bestScore < lastScore: # если вдруг этот список лучше, записываем его
            lastMotifs = bestMotifs[:]
            print(bestScore, lastScore)
            lastScore = bestScore
        print("Iteration number:", index+1)
    
    print('Complete!')
    if print_output:
        print(*lastMotifs)
        print("Score:", lastScore)
    else:
        return lastMotifs


In [6]:
DnaList = [dna.strip() for dna in input().split()]

MultipleRandomizedMotifSearch(
    DnaList, k=8, t=5, i=100, print_output=True
)

15 21
Iteration number: 1
Iteration number: 2
Iteration number: 3
Iteration number: 4
14 15
Iteration number: 5
Iteration number: 6
Iteration number: 7
Iteration number: 8
Iteration number: 9
Iteration number: 10
Iteration number: 11
9 14
Iteration number: 12
Iteration number: 13
Iteration number: 14
Iteration number: 15
Iteration number: 16
Iteration number: 17
Iteration number: 18
Iteration number: 19
Iteration number: 20
Iteration number: 21
Iteration number: 22
Iteration number: 23
Iteration number: 24
Iteration number: 25
Iteration number: 26
Iteration number: 27
Iteration number: 28
Iteration number: 29
Iteration number: 30
Iteration number: 31
Iteration number: 32
Iteration number: 33
Iteration number: 34
Iteration number: 35
Iteration number: 36
Iteration number: 37
Iteration number: 38
Iteration number: 39
Iteration number: 40
Iteration number: 41
Iteration number: 42
Iteration number: 43
Iteration number: 44
Iteration number: 45
Iteration number: 46
Iteration number: 47
Itera