# BC205: Algorithms for Bioinformatics.
## IV. Motif Discovery

### Christoforos Nikolaou

### Up to now we have seen how:
* We define a sequence motif
* We can search for known short motifs with a determined degree
of ambiguity
* We can estimate the existence of a motif in a sequence
* We can define the strength of the motif in the sequence in an
Entropy-based score

### Recap of results in code:
a. Creating of a PWM from a given sequnce

In [1]:
# Code here
import regex as re
f=open("files/gata.fa", 'r')
seqs = []
for line in f:
    x=re.match(">", line)
    if x == None:
        seqs.append(line.rstrip())
#
def pwm(sequences):
    nuc = ['A', 'C', 'G', 'T']
    profile=[[0 for i in range(len(sequences[0]))] for j in range(len(nuc))]
    #
    for instance in sequences:
        for j in range(len(instance)):
            residue=instance[j]
            profile[nuc.index(residue)][j]+=1
            profile[nuc.index(residue)][j]=float(profile[nuc.index(residue)][j])
    import numpy as np
    pwm = np.array(profile)
    pwm = pwm/len(sequences)
    return(pwm)

mypwm=pwm(seqs)


In [2]:
print(mypwm)


[[0.6  0.02 0.96 0.02 0.72 0.44]
 [0.04 0.02 0.02 0.   0.06 0.06]
 [0.02 0.9  0.02 0.04 0.04 0.4 ]
 [0.34 0.06 0.   0.94 0.18 0.1 ]]


b. Transformation of PWM into PSSM on the basis of a given sequence

We can transform this table by applying a normalization and log-transformation against nucleotide occurrences from a given sequence.

We shall first write a function to read the sequence in fasta format

In [3]:
def readfasta(fastafile):
    import regex as re
    f=open(fastafile, 'r')
    seq = ""
    total = 0
    for line in f:
        x=re.match(">", line)
        if x == None:
            length=len(line)
            total=total+length
            seq=seq+line[0:length-1]
    seq=seq.replace('N','')
    f.close()
    return(seq)

targetsequence=readfasta("files/ecoli.fa")


And then write another function to calculate nucleotide frequencies

In [4]:

def nuccomp(sequence):
    import numpy as np
    nucfreq = [0, 0, 0, 0]
    nuc = ['A', 'C', 'G', 'T']
    for i in range(len(nuc)):
        nucfreq[i]=sequence.count(nuc[i])
    nucfreq=np.array(nucfreq)/len(sequence)
    return(nucfreq)

nucfreqs=nuccomp(targetsequence)

In [5]:
print(nucfreqs)

[0.24592455 0.2537243  0.25370882 0.24664232]


Τhe next obvious step is to combine the PWM with the Array of the nucleotide composition of the target sequnce and log-transform the resulting table into a PSSM

In [7]:
# Creation of a PSSM
def pssm(pwm, nucfreqs):
    import numpy as np
    import math
    pseudocount=0.01
    pssm=[[0 for i in range(len(pwm[0]))] for j in range(len(nucfreqs))]
    for i in range(len(nucfreqs)):
        pssm[i]=(np.array(pwm[i])+pseudocount)/nucfreqs[i]
    for i in range(len(pssm)):
        for k in range(len(pssm[0])):
            pssm[i][k]=math.log(pssm[i][k])/math.log(2)
    return(pssm)

mypssm=pssm(mypwm, nucfreqs)

In [8]:
print(mypssm)

[array([ 1.31059348, -3.03518135,  1.97976899, -3.03518135,  1.56968071,
        0.87170924]), array([-2.34326171, -3.0802273 , -3.0802273 , -4.66518981, -1.85783488,
       -1.85783488]), array([-3.0801393 ,  1.84269284, -3.0801393 , -2.3431737 , -2.3431737 ,
        0.69245021]), array([ 0.50493453, -1.81699356, -4.62434848,  1.94550713, -0.37642097,
       -1.16491686])]


In [None]:
c. Searching a sequence with a PWM/PSSM or with Hamming Distance 

In [9]:
def pssmSearch(pssm, sequence, threshold):
    nuc = ['A', 'C', 'G', 'T']
    hits = []
    instances = []
    for i in range(len(sequence)-len(pssm[0])):
        instance=sequence[i:i+len(pssm[0])]
        score=0
        for l in range(len(instance)):
            score=score+pssm[nuc.index(instance[l])][l]
        if (score > threshold):
            hits.append(i)
            instances.append(instance) 
    return(hits, instances)

out=pssmSearch(mypssm, targetsequence, 9)

In [10]:
print(out)

([5034, 6229, 8707, 9759, 9781, 17600, 21827, 37967, 38567, 39228, 39335, 40623, 41571, 45665, 47187, 50637, 53835, 56483, 58430, 59521, 62109, 62143, 62232, 64327, 64958, 66241, 66563, 66696, 69201, 73270, 73645, 73911, 75670, 77388, 84571, 87243, 92861, 92932, 108733, 117334, 119995, 120562, 120841, 122402, 123009, 123867, 126036, 126332, 126726, 127030, 127424, 127499, 127521, 129086, 129090, 131742, 133150, 133862, 136502, 140160, 142042, 142324, 142987, 145864, 148276, 148634, 153116, 153691, 153868, 154195, 156270, 156975, 159698, 159848, 160010, 162919, 166958, 170267, 179495, 180542, 181354, 182337, 182477, 183366, 191531, 191553, 191921, 194536, 198077, 199955, 200188, 203448, 208649, 210955, 218953, 221807, 226138, 231284, 233221, 234455, 234617, 236376, 237312, 238629, 239360, 241473, 242954, 243866, 244828, 247126, 248401, 252575, 253838, 253847, 255585, 256199, 257930, 258174, 258455, 258862, 258881, 259957, 260524, 261664, 263094, 266041, 267657, 268151, 268363, 268690, 2

d. Entropy calculations

We can then apply Entropy and Information Content calculations on the resulting hits/matches

In [None]:
# code to be added

### The next problem. Discover a new motif from a given set of sequences

#### Part 1. Formulating the problem
1. Given a set of sequences that each contains an instance of the motif, find the motif.

#### A first approach 

Assuming we have a way to 


In [None]:
# code here

1. Given a set of s sequences: Find a set of k-mers (for a given
length k, one from each sequence) that maximizes the score (or
minimizes the distance) of each (one) k-mer with its sequence
2. Collect k-mers
3. Create a motif from them


In [None]:
# code

### Brute Force Approach

What is the complexity of the BFA?
1. Number of k-mers 4k
2. Number of k-mers in each sequence: (n − k + 1)
3. Number of calculations for each k-mer given s sequences of
length n: (n − k + 1)s
4. Total number of calculations 4k (n − k + 1)s

The complexity of the algorithm is at least O(ns ).

We need something faster!


In [None]:
# code

* Assuming we have a way to calculate the distance of a k-mer k

```
from a given sequence seq  
    for k in kmers:  
        for seq in sequences:  
            if distance(k, seq)<min_distance:  
                min_distance<-distance(k,seq)  
                motif[seq]<-k
```

* Because each k-mer needs to pass only once through each
sequence, the median string has O(4k ) complexity because k is
(usually) much shorter than the length of the sequence.

* However, it is still quite slow and for k>10 its implementation
is still unapplicable.


In [None]:
# code

### An alternative

* Assume a greedy approach to go through all sequences
updating a motif every time

* Starting from sequence i:

1. find the most common k-mer
2. create a profile from it (adding pseudocounts to all 0-values)
3. go to the next sequence
4. choose the k-mer that best fits the profile
5. store that k-mer in the collection and update profile
6. iterate steps 3->5.

* We’ ve just described a Greedy approach for discovering a
motif p of a given length k among t sequences.


In [None]:
# code

### A greedy Approach

* Assuming a set of s sequences and a given consensus k-mer k:

* We will construct a PWM “on the go” as we move from one
sequence to the next.
1. For i=1 :
2. For each k in seq_i:  
    2.1 For i = 2 to i = s:  
    2.2 Find the best (smallest distance) kmer in seq_i  
    2.3 Build a profile  
    2.4 If the score(profile) is better than all previous update profile  
3. Repeat


### Analysis. Greedy Approach

1. Why Greedy: It takes kmers from the first sequence only to
scan in the following. Thus it doesn’t go through all
combinations of sequences and k-mers. As we’ve seen above
the trade-off is speed.
2. KEY: It assumes that all sequences contain the motif. If the
first sequence doesn’t contain the motif (in any variation) then
we are doomed in looking for something that is non-sensical.
3. A way to go around this is to sample a small percentage of
sequences randomly, which brings us to the next-to-last chapter
of the motif finding problem


### A randomized approach

* In the Greedy Approach we take the kmers from the first
sequence and scan over the rest. In this way an initial wrong
choice may lead you to disastrous results.
* In a Randomized Approach we start, instead with a
collection of s k-mers, one from each sequence, build a profile,
scan the sequences with that profile, update it and repeat until
the k-mer set is good enough match for the updated profile.
* Stop and think of the problems we get rid of with this
approach.


### Pseudocode

```
for seq in sequences:  
    profile[seq]<-random(k, seq)  
    while distance(profile, sequences)>threshold  
        for seq in sequences:  
        profile[seq]<-max(k, profile, seq)  
```

In [11]:
# Gibbs Sampling to locate a motif in a set of sequences
# code
sequences = []
with open ('files/motifs_in_sequence.fa') as file:
    for line in file:
        sequences.append(line.strip()) # 50 sequences as elements of a list. 100 bases each sequence
    

##Gibbs sampler##
import random
import numpy as np

def Gibbs_sampler(sequences,k): #k is the length of the motif, sequence is a list with the sequences
    
    dictionary = {'A':0,
                  'T':1,
                  'C':2,
                  'G':3}
    
    column_sum = len(sequences) #number of rows (50) or number of sequences
    length = len(sequences[0]) #number of columns or number of nucleotides in seq
    Imax = 1.8*k #threshold of I
    
    pwm = np.zeros([4,k]) # A,T,C,G X len(motif)
    
    for seq in sequences:
        rand_start = random.randint(0, length-k) #pick a random nucleotide from each sequence
        motif = seq[rand_start:rand_start+k] #and take substring as the motif
       
        lst = enumerate(motif) #finding the index of each nucleotide in the motif to access the correct column
                               #and using the dictionary to access the correct row 
        for i in lst: #making the first random pwm
            pwm[dictionary[i[1]],i[0]]+=1
            
    pwm = pwm/column_sum
    
    information = np.zeros([1,k])
    count=0
    while (np.sum(information)) < Imax: #while information_content of the pwm 
        motives=[]                      #is less than the threshold
        
        information_old = np.sum(information) #keeping the previous value of information contect
                                             #to check convergence in case the theshold 
        for row in range (column_sum):                #is never reached
            maxx=0
            rand_seq = random.randint(0, column_sum-1) #pick a random index - sequence
            seq = sequences[rand_seq]               
            for i in range(len(seq)-k):   #take each k-mer from the sequence 
                score = 0
                substring = seq[i:i+k]
                lst = enumerate(substring)
                
                for j in lst:                         #scoring each k-mer based on the pwm
                    score+=pwm[dictionary[j[1]],j[0]]   #keeping the motif with the highest score
                                                         #from each sequence
                if score > maxx:  
                    maxx = score
                    motif = substring
                    
            motives += [motif] #keep all the motifs with the highest score in the list motives
        
        pwm = np.zeros([4,k]) # A,T,C,G X len(motif) 
        
        for elem in motives: 
            lst = enumerate(elem)
            for i in lst:         #making the new pwm
                pwm[dictionary[i[1]],i[0]]+=1
                 
        pwm = pwm/column_sum
        
        information = np.zeros([1,k]) #computing the information of each position
        for i in range(k):
            information[0,i] = 2-abs(sum([elem*np.log2(elem) for elem in pwm[:,i] if elem > 0]))
            

        if abs(information_old - np.sum(information)) <= 0.5: #ckecking convergence 
            count+=1
            if count == 10: #if the difference of the information content is less or equal to 0.5
                break       #for consecutive 10 iterations then break
        else:
            count=0
    
    max_index_col = np.argmax(pwm, axis=0) #extracting the motif according to the   
                                           #highest frequency of each nucleotide in each position
    motif=''
    for values in max_index_col:
        for keys in dictionary.keys():
            if values == dictionary[keys]:
                motif+= keys
        
    return pwm,information,motif


#repeat the algorithmm 100 times for each k (3 to 7) and keep the pwm and motif with the highest infromation_content
#this process takes approximately 4min (in my computer)
    
####100-cycled GIbbs#####
for k in range (3,8):
    highest_info = 0
    for i in range (100):
        summ=0
        pwm, information_content,motif = Gibbs_sampler(sequences,k)
        summ+=np.sum(information_content)
        if summ > highest_info:
            highest_info = summ
            pwm_ret = pwm
            motif_ret = motif
        
    print('\nThe information content of the motif divided by it\'s length is:',highest_info/k) #divide by length to normalize and compare among other k
    print('The pwm of the motif is:\n',pwm_ret)
    print('The motif is:',motif_ret)



#To find the motifs for each k with the highest information content i have to repeat gibbs sampler many times because of the randomness that takes place
#The motif that returns the highest scaled information content is GAT (k = 3, I/k = 1.9528 or 2!) but i think that the motif that we are looking for
#is the motif GATA (k=4, I/k = 1.857) which contains GAT and is contained in all the longer found motifs.
#(Also we know the existence of the GATA transcription factors and indeed GATA is part of these binding sites)
#Sometimes the algorithm returns other 3-mers (eg ATG,GGC,AAG) with high information content as well but it
#is noted that GATA is returned almost in every repetition of the 100-cycled Gibbs which means that it may be the only 4-mer motif with so high information content.
#Finally i have to report that the threshold that i chose for checking convergence may not be the best choice and should be stricter
#Either way i kept this threshold because otherwise i would have made the algorithm a lot slower (is already slow though :P).


The information content of the motif divided by it's length is: 2.0
The pwm of the motif is:
 [[0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 0.]
 [1. 0. 0.]]
The motif is: GAT

The information content of the motif divided by it's length is: 1.9294269527293963
The pwm of the motif is:
 [[0.02 1.   0.   1.  ]
 [0.   0.   1.   0.  ]
 [0.02 0.   0.   0.  ]
 [0.96 0.   0.   0.  ]]
The motif is: GATA

The information content of the motif divided by it's length is: 1.600816685265535
The pwm of the motif is:
 [[0.52 0.   0.98 0.   0.92]
 [0.4  0.   0.   0.98 0.08]
 [0.08 0.   0.02 0.   0.  ]
 [0.   1.   0.   0.02 0.  ]]
The motif is: AGATA

The information content of the motif divided by it's length is: 1.4418722646853723
The pwm of the motif is:
 [[0.6  0.   1.   0.   0.92 0.52]
 [0.36 0.02 0.   1.   0.02 0.14]
 [0.04 0.   0.   0.   0.02 0.02]
 [0.   0.98 0.   0.   0.04 0.32]]
The motif is: AGATAA

The information content of the motif divided by it's length is: 1.2317978680328203
The pwm of the motif is:
 