# BC205: Algorithms for Bioinformatics.
## IV. Motif Discovery

### Christoforos Nikolaou

#### Up to now we have seen how:
* We define a sequence motif
* We can search for known short motifs with a determined degree
of ambiguity
* We can estimate the existence of a motif in a sequence
* We can define the strength of the motif in the sequence in an
Entropy-based score

![Motifs on Genome Sequences](figures/GenomicMotifs.PNG)


#### Consensus Sequences

A _consensus sequence_ is a grammatically encoded pattern that represents the ambiguity of the pattern's instances in a given corpus. We usually encode consensus sequences in the form of regular expression notation. 

![Consensus Sequence Definition](figures/ConsensusSequence.png)

#### Exercise

1. Write the python code to:  
   a. Read a set of aligned sequences of the same length  
   b. Produce the consensus sequence   
   c. Print it on the screen using regular expression notation  
   d. Accompany the consensus sequence with a sequence bearing the most common residue in each position.  



#### Creating of a PWM from a given sequnce

Remember that a PWM is _MxN_ table, where _M_ is the number of the different residue types (e.g. _M=4_ for DNA) and _N_ is the length of the motif. Each position _(i,j)_ in a PWM holds the frequency of occurrence (probability) of residue _i_ to be found in position _j_.

Such a table may be seen below 
![PWN](figures/PWM.PNG)

Below we start by creating a function to read multiple fasta sequences in a list

In [2]:
# A function to read multiple fasta files each and return the sequences as a list
def readmultifasta(file):
    import regex as re
    f = open(file, 'r')
    seqs = []
    for line in f:
        x=re.match(">", line)
        if x == None:
            seqs.append(line.rstrip())
    return(seqs)


In [3]:
seqs = readmultifasta("files/gata.fa")
print(seqs)

['TTATAG', 'AGATAT', 'AGATAG', 'ATATCT', 'AGATAG', 'AGATAG', 'AGATAG', 'TGATAA', 'AGATAA', 'AGATAA', 'CGATAG', 'AGAGTT', 'TGATAA', 'TGATAA', 'AGATGG', 'AGATAG', 'AGATTG', 'AGATAA', 'TGATAA', 'AGATAA', 'AGATAG', 'TGATAG', 'TGATCA', 'TTATCA', 'AGATGG', 'TGATAT', 'AGATAG', 'TGATAA', 'GGATAC', 'AGATAA', 'CGATAA', 'TGATAG', 'AGATAA', 'TGATTA', 'AGATAA', 'AGATAG', 'TGATAT', 'AGATAA', 'TCAGAG', 'AAGTAG', 'AGATTA', 'TGATAG', 'TGATAG', 'AGATAC', 'TGATTG', 'AGATTA', 'AGAATA', 'AGATAA', 'AGATTA', 'AGCTTC']


We next proceed with a function that takes a set of sequences of **the same length** and produces a PWM table from this set. 

In [4]:

# A function to create a PWM from a set of aligned sequences of the same length
def pwm(sequences):
    nuc = ['A', 'C', 'G', 'T']
    profile=[[0 for i in range(len(sequences[0]))] for j in range(len(nuc))]
    #
    for instance in sequences:
        for j in range(len(instance)):
            residue=instance[j]
            profile[nuc.index(residue)][j]+=1
            profile[nuc.index(residue)][j]=float(profile[nuc.index(residue)][j])
    import numpy as np
    pwm = np.array(profile)
    pwm = pwm/len(sequences)
    return(pwm)

mypwm=pwm(seqs)


In [5]:
print(mypwm)

[[0.6  0.02 0.96 0.02 0.72 0.44]
 [0.04 0.02 0.02 0.   0.06 0.06]
 [0.02 0.9  0.02 0.04 0.04 0.4 ]
 [0.34 0.06 0.   0.94 0.18 0.1 ]]


Notice how we assign nucleotides at the beginning alphabetically ordered, which means that "A", corresponds to the first row, "C" to the second etc.

#### Transformation of PWM into PSSM on the basis of a given sequence

We can transform this table by applying a normalization and log-transformation against nucleotide occurrences from a given sequence, to create a PSSM matrix.

PSSM matrices are constructs that are context-dependent, meaning that they can better capture a motif match in a given sequence, provided they have incorporated that sequence's nucleotide background composition.

![PSSM Matrix](figures/PSSM.PNG)



We shall first write a function to read the target sequence in fasta format

In [6]:
def readfasta(fastafile):
    import regex as re
    f=open(fastafile, 'r')
    seq = ""
    total = 0
    for line in f:
        x=re.match(">", line)
        if x == None:
            length=len(line)
            total=total+length
            seq=seq+line[0:length-1]
    seq=seq.replace('N','')
    f.close()
    return(seq)


And then write another function to calculate nucleotide frequencies on a given input sequence/genome

In [7]:
# Calculating mononucleotid composition
def nuccomp(sequence):
    import numpy as np
    nucfreq = [0, 0, 0, 0]
    nuc = ['A', 'C', 'G', 'T']
    for i in range(len(nuc)):
        nucfreq[i]=sequence.count(nuc[i])
    nucfreq=np.array(nucfreq)/len(sequence)
    return(nucfreq)

We can then run them sequentially

In [8]:
targetsequence=readfasta("files/ecoli.fa")
nucfreqs=nuccomp(targetsequence)
print(nucfreqs)

[0.24592455 0.2537243  0.25370882 0.24664232]


Alternatively we can deploy the kmers function from a previous class and run it for k=1

In [9]:
def kmers(genomefile, k):
    import regex as re

    file = open(genomefile, 'r')

    seq = ""
    kmertable = {} 

    count = 0
    for line in file:
        count +=1
        if (count > 1) :
            length=len(line)
            seq=seq+line[0:length-1]
            
    file.close()

    seq = re.sub("[^AGCT]", "", seq)

    for i in range(len(seq)-k):
        DNA=seq[i:i+k]
        if DNA not in kmertable.keys():
            kmertable[DNA]=1
        else:
            kmertable[DNA]+=1

    kmertable = {k: float(v) / len(seq) for k, v in kmertable.items()}
    kmertable = dict(sorted(kmertable.items()))
    return(list(kmertable.values()))

In [10]:
nucfreqs = kmers("files/Staaur.fa", 1)
print(nucfreqs)

[0.3358589850106648, 0.16534241159706717, 0.16384496106086052, 0.3349532806283794]


#### Creation of a PSSM

Τhe next obvious step is to combine the PWM with the Array of the nucleotide composition of the target sequnce and log-transform the resulting table into a PSSM

In [11]:
# Creation of a PSSM
def pssm(pwm, nucfreqs):
    import numpy as np
    import math
    pseudocount=0.01
    pssm=[[0 for i in range(len(pwm[0]))] for j in range(len(nucfreqs))]
    for i in range(len(nucfreqs)):
        pssm[i]=(np.array(pwm[i])+pseudocount)/nucfreqs[i]
    for i in range(len(pssm)):
        for k in range(len(pssm[0])):
            pssm[i][k]=math.log(pssm[i][k])/math.log(2)
    return(np.array(pssm))


In [12]:
mypssm=pssm(mypwm, nucfreqs)
print(mypssm)


[[ 0.86095362 -3.48482122  1.53012912 -3.48482122  1.12004084  0.42206938]
 [-1.72545683 -2.46242243 -2.46242243 -4.04738493 -1.24003001 -1.24003001]
 [-2.4492969   2.47353524 -2.4492969  -1.7123313  -1.7123313   1.32329261]
 [ 0.06339504 -2.25853305 -5.06588798  1.50396763 -0.81796046 -1.60645636]]


#### Searching a sequence with a PWM/PSSM 

In the following, we define a pssmSearch function that takes as input a PSSM, a target sequence and a score threshold which we set to be a percentage of the maximum possible PSSM score

In [13]:
### Update to allow for percentage thresholds

def pssmSearch(pssm, sequence, threshold):

    nuc = ['A', 'C', 'G', 'T']
    hits = []
    instances = []
    x = []
    allscores = [] # for plotting reasons
    
    # Step 1: Calculation of maximum possible PSSM score
    maxPssm = 0
    for j in range(len(pssm[0])):
        maxPssm = maxPssm + max(pssm[:,j])
    
    # Step 2: Search
    for i in range(len(sequence)-len(pssm[0])):
        x.append(i)
        instance=sequence[i:i+len(pssm[0])]
        score=0
        for l in range(len(instance)):
            score=score+pssm[nuc.index(instance[l])][l]
        if (score > threshold*maxPssm):
            hits.append(i)
            instances.append(instance)
            allscores.append(score)
    
    return(hits, instances, allscores)


In [55]:
pssmSearch(mypssm, targetsequence, 0.95)

([5034,
  6229,
  8707,
  9759,
  9781,
  17600,
  21827,
  37967,
  38567,
  39228,
  39335,
  40623,
  41571,
  45665,
  47187,
  50637,
  53835,
  56483,
  58430,
  59521,
  62109,
  62143,
  62232,
  64327,
  64958,
  66241,
  66563,
  66696,
  69201,
  73270,
  73645,
  73911,
  75670,
  77388,
  84571,
  87243,
  92861,
  92932,
  108733,
  117334,
  119995,
  120562,
  120841,
  122402,
  123009,
  123867,
  126036,
  126332,
  126726,
  127030,
  127424,
  127499,
  127521,
  129086,
  129090,
  131742,
  133150,
  133862,
  136502,
  140160,
  142042,
  142324,
  142987,
  145864,
  148276,
  148634,
  153116,
  153691,
  153868,
  154195,
  156270,
  156975,
  159698,
  159848,
  160010,
  162919,
  166958,
  170267,
  179495,
  180542,
  181354,
  182337,
  182477,
  183366,
  191531,
  191553,
  191921,
  194536,
  198077,
  199955,
  200188,
  203448,
  208649,
  210955,
  218953,
  221807,
  226138,
  231284,
  233221,
  234455,
  234617,
  236376,
  237312,
  238629,
  2

#### Exercise

Perform the same process as above but now using:  
    a. A fixed most frequent word pattern instead of a PSSM   
    b. A Hamming Distance approach for matches   
    c. As in this example, allow for different levels of Hamming Distance to report matches. 

#### Entropy calculations

Shannon Entropy and Information Content are quantities that we use to assess the ambiguity in a motif, mostly in the form of a PWM matrix. 

Remember the definition of Shannon Entropy:

![Shannon Entropy](figures/ShannonEntropyFormula.PNG)

We can then apply Entropy and Information Content calculations on the resulting hits/matches.

![Information](figures/InformationFormula.PNG)

In [53]:
### Entropy and Information Content of a PWM

def pwmEntropyInformation(pwm):
    
    import numpy as np
    k = pwm.shape[1]

    information = np.zeros([1,k]) #computing the information of each position
    for i in range(k):
        information[0,i] = 2-abs(sum([elem*np.log2(elem) for elem in pwm[:,i] if elem > 0]))
    
    sumInfo = np.sum(information)
    scaledSumInfo = sumInfo/k
    
    return(information, sumInfo, scaledSumInfo)


#### PWM vs PSSM Searches

Lets compare the output of searches with PWM and PSSM. Our previously defined function can handle both. Below we search for instances of the motifs from a starting PWM, PSSM at the top 90% similarity threshold. 

In [79]:
pwmsearches = pssmSearch(mypwm, targetsequence, 0.90)[1]
pssmsearches = pssmSearch(mypssm, targetsequence, 0.90)[1]


In [81]:
print(pssmsearches)

['AGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'AGATAG', 'AGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'AGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'AGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG', 'TGATAG',

And now we check the Information Content of each of the two search results

In [82]:
pwmEntropyInformation(pwm(pwmsearches))

(array([[1.0231508 , 2.        , 2.        , 2.        , 2.        ,
         0.28962496]]),
 9.312775760245215,
 1.5521292933742024)

In [83]:
pwmEntropyInformation(pwm(pssmsearches))

(array([[1.00658002, 2.        , 2.        , 2.        , 2.        ,
         2.        ]]),
 11.006580021263819,
 1.8344300035439698)

From which we see that the PSSM search at the same threshold returns a more restricted and informative set of motif instances.

### The next problem. Discover a new motif from a given set of sequences

#### Part 1. Formulating the problem
1. Given a set of sequences that each contains an instance of the motif, find the motif.

#### A Naive Brute-Force Approach

Find the most common k-mer in a set of sequences by iteratively scanning all the sequences



1. Given a set of s sequences: Find a set of k-mers (for a given
length k, one from each sequence) that maximizes the score (or
minimizes the distance) of each (one) k-mer with its sequence
2. Collect k-mers
3. Create a motif from them


### Brute Force Approach

What is the complexity of the BFA?
1. Number of k-mers 4k
2. Number of k-mers in each sequence: (n − k + 1)
3. Number of calculations for each k-mer given s sequences of
length n: (n − k + 1)s
4. Total number of calculations 4k (n − k + 1)s

The complexity of the algorithm is at least O(ns).

We need something faster!


#### A more nuanced Brute-Force Approach

Find a set of k-mers, one from each sequence, that together, create a PWM that has maximal Information content.

* Assuming we have a way to calculate the distance of a k-mer k

```
from a given sequence seq  
    for k in kmers:  
        for seq in sequences:  
            if distance(k, seq)<min_distance:  
                min_distance<-distance(k,seq)  
                motif[seq]<-k
```

* Because each k-mer needs to pass only once through each
sequence, the median string has O(4k ) complexity because k is
(usually) much shorter than the length of the sequence.

* However, it is still quite slow and for k>10 its implementation
is still unapplicable.


In [None]:
# code

### An alternative

* Assume a greedy approach to go through all sequences
updating a motif every time

* Starting from sequence i:

1. find the most common k-mer
2. create a profile from it (adding pseudocounts to all 0-values)
3. go to the next sequence
4. choose the k-mer that best fits the profile
5. store that k-mer in the collection and update profile
6. iterate steps 3->5.

* We’ ve just described a Greedy approach for discovering a
motif p of a given length k among t sequences.

* Which problems you see in this approach?


### Greedy Approach. Problems

* Dependence on the starting sequence. For robustness we can repeat the process above with a different starting sequence every time.
* It expects most (if not all) sequences in the collection to contain the motif. This doesn't happen often.
* It will work worse as the motif becomes "degenerate" of low information content


### Analysis. Greedy Approach

1. Why Greedy: It takes kmers from the first sequence only to
scan in the following. Thus it doesn’t go through all
combinations of sequences and k-mers. As we’ve seen above
the trade-off is speed.
2. KEY: It assumes that all sequences contain the motif. If the
first sequence doesn’t contain the motif (in any variation) then
we are doomed in looking for something that is non-sensical.
3. A way to go around this is to sample a small percentage of
sequences randomly, which brings us to the next-to-last chapter
of the motif finding problem


### A randomized approach

* In the Greedy Approach we take the kmers from the first
sequence and scan over the rest. In this way an initial wrong
choice may lead you to disastrous results.
* In a Randomized Approach we start, instead with a
collection of s k-mers, one from each sequence, build a profile,
scan the sequences with that profile, update it and repeat until
the k-mer set is good enough match for the updated profile.
* Stop and think of *the problems we get rid of* with this
approach.

![Gibbs Sampling for Motif Identification](figures/GibbsSampler.PNG)


#### Pseudocode

```
for seq in sequences:  
    profile[seq]<-random(k, seq)  
    while distance(profile, sequences)>threshold  
        for seq in sequences:  
        profile[seq]<-max(k, profile, seq)  
```

* You can think of alternatives to "distance" with Information Content.
* The main idea is unchanged. 
  1. You start with a random collection of k-mers, one from each sequence
  2. You create a PWM out of these
  3. You scan each sequence with the PWM and substitute the chosen k-mer with the one that best fits the PWM
  4. You update the PWM and calculate its Information Content
  5. If I >= threshold stop. Otherwise repeat 3-5.  